← Zh1hao Wang

World Value Models for Robotic Manipulation

Zh1hao Wang1,2,3, Jianxiong Li1,3,†, Yu Cui1,§, Yuan Gao3, Xianyuan Zhan3, Junzhi Yu2,§, Xiao Ma1

1ByteDance Seed  ·  2Peking University  ·  3Tsinghua University

Project Lead  ·  §Corresponding Author

preprint 2026 value model world model

World Value Model — marrying world models with value estimation

Abstract

Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.

Overview

Watch the introduction video

An overview of our work. CV by Doubao.

WVM Architecture

Architecture of WVM. A video DiT is coupled with a lightweight value DiT through a Mixture-of-Transformers (MoT) architecture. The values are formulated as a chunk of value flow, with value prefix randomization applied during training.

Suboptimal-Value-Bench

Suboptimal-Value-Bench. A multi-embodiment benchmark of 800 fully human-verified trajectories spanning 15 tasks across both simulation and the real world. By featuring two prevalent suboptimal behavioral modes—retries and hesitations—it enables a comprehensive evaluation of generalist value models that extends far beyond the scope of existing metrics.

Value Estimation

Qualitative value estimation. WVM more faithfully reflects the temporal dynamics of hesitation, retry, and expert trajectories.

Show quantitative results

Retry VOC↑ (higher is better)

Split GVL VLAC Robo
meter
Top
Reward
WVM
Suboptimal-AgileX 0.73−0.370.320.15 0.79
Suboptimal-ARX 0.76/−0.27−0.19 0.79
Suboptimal-RoboSuite 0.43/−0.370.00 0.75
Average 0.62−0.37−0.160.00 0.78

Hesitation RMSE↓ (lower is better)

Split GVL VLAC Robo
meter
Top
Reward
Robo
Reward
Robo-
Dopamine
WVM
Suboptimal-AgileX 0.110.470.130.360.120.41 0.07
Suboptimal-ARX 0.140.500.120.240.170.52 0.05
Suboptimal-RoboSuite 0.160.540.160.330.310.51 0.04
Average 0.140.510.140.310.210.49 0.05

Expert VOC↑ (higher is better)

Dataset GVL VLAC Robo
meter
Top
Reward
Robo
Reward
Robo-
Dopamine
WVM
OXE 0.670.480.630.190.920.72 0.94
RoboCOIN 0.700.600.770.470.850.75 0.95
EgoDex 0.820.620.860.370.950.88 0.92
Self-collected (3 embodiments) 0.930.500.930.580.840.76 0.99
Average 0.780.590.810.420.880.82 0.95

Downstream Policy Learning

Task setups. Downstream policy learning experiments are conducted across both simulation and the real world.

Policy improvement. WVM boosts downstream policy learning with both Advantage-Weighted Regression (AWR) and Filtered BC.

What Matters?

World models matter. Ablations over video DiT variants, prefix randomization, and value head designs.

BibTeX

@article{wang2026wvm,
  title   = {World Value Models for Robotic Manipulation},
  author  = {Wang, Zhihao and Li, Jianxiong and Cui, Yu and Gao, Yuan
             and Zhan, Xianyuan and Yu, Junzhi and Ma, Xiao},
  journal = {Preprint},
  year    = {2026}
}