World Value Models for Robotic Manipulation
1ByteDance Seed · 2Peking University · 3Tsinghua University
†Project Lead · §Corresponding Author
World Value Model — marrying world models with value estimation
Abstract
Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.
Overview
Watch the introduction video
An overview of our work. CV by Doubao.
WVM Architecture
Architecture of WVM. A video DiT is coupled with a lightweight value DiT through a Mixture-of-Transformers (MoT) architecture. The values are formulated as a chunk of value flow, with value prefix randomization applied during training.
Suboptimal-Value-Bench
Suboptimal-Value-Bench. A multi-embodiment benchmark of 800 fully human-verified trajectories spanning 15 tasks across both simulation and the real world. By featuring two prevalent suboptimal behavioral modes—retries and hesitations—it enables a comprehensive evaluation of generalist value models that extends far beyond the scope of existing metrics.
Value Estimation
Qualitative value estimation. WVM more faithfully reflects the temporal dynamics of hesitation, retry, and expert trajectories.
Show quantitative results
Retry VOC↑ (higher is better)
| Split | GVL | VLAC | Robo meter |
Top Reward |
WVM |
|---|---|---|---|---|---|
| Suboptimal-AgileX | 0.73 | −0.37 | 0.32 | 0.15 | 0.79 |
| Suboptimal-ARX | 0.76 | / | −0.27 | −0.19 | 0.79 |
| Suboptimal-RoboSuite | 0.43 | / | −0.37 | 0.00 | 0.75 |
| Average | 0.62 | −0.37 | −0.16 | 0.00 | 0.78 |
Hesitation RMSE↓ (lower is better)
| Split | GVL | VLAC | Robo meter |
Top Reward |
Robo Reward |
Robo- Dopamine |
WVM |
|---|---|---|---|---|---|---|---|
| Suboptimal-AgileX | 0.11 | 0.47 | 0.13 | 0.36 | 0.12 | 0.41 | 0.07 |
| Suboptimal-ARX | 0.14 | 0.50 | 0.12 | 0.24 | 0.17 | 0.52 | 0.05 |
| Suboptimal-RoboSuite | 0.16 | 0.54 | 0.16 | 0.33 | 0.31 | 0.51 | 0.04 |
| Average | 0.14 | 0.51 | 0.14 | 0.31 | 0.21 | 0.49 | 0.05 |
Expert VOC↑ (higher is better)
| Dataset | GVL | VLAC | Robo meter |
Top Reward |
Robo Reward |
Robo- Dopamine |
WVM |
|---|---|---|---|---|---|---|---|
| OXE | 0.67 | 0.48 | 0.63 | 0.19 | 0.92 | 0.72 | 0.94 |
| RoboCOIN | 0.70 | 0.60 | 0.77 | 0.47 | 0.85 | 0.75 | 0.95 |
| EgoDex | 0.82 | 0.62 | 0.86 | 0.37 | 0.95 | 0.88 | 0.92 |
| Self-collected (3 embodiments) | 0.93 | 0.50 | 0.93 | 0.58 | 0.84 | 0.76 | 0.99 |
| Average | 0.78 | 0.59 | 0.81 | 0.42 | 0.88 | 0.82 | 0.95 |
Downstream Policy Learning
Task setups. Downstream policy learning experiments are conducted across both simulation and the real world.
Policy improvement. WVM boosts downstream policy learning with both Advantage-Weighted Regression (AWR) and Filtered BC.
What Matters?
World models matter. Ablations over video DiT variants, prefix randomization, and value head designs.
BibTeX
@article{wang2026wvm,
title = {World Value Models for Robotic Manipulation},
author = {Wang, Zhihao and Li, Jianxiong and Cui, Yu and Gao, Yuan
and Zhan, Xianyuan and Yu, Junzhi and Ma, Xiao},
journal = {Preprint},
year = {2026}
}