Success rate and grasping rate in percent. Best is bolded and second best is underlined.
Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Model (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jm-this.github.io/mvp_lam/.
MVP-LAM learns action-centric latent actions by training on time-synchronized multi-view videos with a cross-viewpoint reconstruction objective. Self-viewpoint reconstruction predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{v})$. Cross-viewpoint reconstruction swaps latent actions across synchronized views and predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{\tilde v})$ for $v \neq \tilde v$.
We measure action-centricity with mutual information between latent actions and ground-truth actions and with a linear probe that predicts actions from latent actions, reporting NMSE. MVP-LAM achieves the highest estimated $\mathcal{I}(Z;A)$ across estimators and the lowest NMSE on Bridge V2.
Pretraining with MVP-LAM latent actions improves downstream manipulation. The average success rate increases from 39.6 percent to 60.4 percent on SIMPLER. On LIBERO-Long, MVP-LAM reaches 90.8 percent success, improving over UniVLA pretrained on Bridge V2 at 79.4 percent.
Success rate and grasping rate in percent. Best is bolded and second best is underlined.
| Success Rate | MVP-LAM | UniVLA | LAPA | OpenVLA | Octo-Small | Octo-Base | $\pi_0$ |
|---|---|---|---|---|---|---|---|
| StackG2Y | 33.3 | 16.7 | 54.2 | 41.6 | 8.3 | 0.0 | 37.5 |
| Carrot2Plate | 66.7 | 20.8 | 45.8 | 50.0 | 33.3 | 37.5 | 33.3 |
| Spoon2Towel | 66.7 | 54.2 | 70.8 | 37.5 | 25.0 | 12.5 | 29.2 |
| Eggplant2Bask | 75.0 | 66.7 | 58.3 | 16.7 | 12.5 | 20.8 | 45.8 |
| AVG | 60.4 | 39.6 | 57.3 | 36.4 | 19.8 | 17.7 | 36.5 |
Success rate in percent on LIBERO suites for VLAs pretrained on OXE (upper) and Bridge V2 (lower). $\ast$ indicates methods that use additional wrist-view images and proprioceptive states. Best is bolded and second best is underlined.
| Method | Spatial | Object | Goal | Long | AVG |
|---|---|---|---|---|---|
| Octo | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 |
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| LAPA | 73.8 | 74.6 | 58.8 | 55.4 | 65.7 |
| $\pi_0\ast$ | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| UniVLA | 95.2 | 95.4 | 91.9 | 87.5 | 92.5 |
| MVP-LAM | 96.0 | 94.6 | 94.8 | 90.8 | 94.1 |
We show example discrete codes selected for representative frame transitions. Similar motion patterns tend to activate similar codes across sources.
@misc{lee2026mvplamlearningactioncentriclatent,
title = {MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction},
author = {Jung Min Lee and Dohyeok Lee and Seokhun Ju and Taehyun Cho and Jin Woo Koo and Li Zhao and Sangwoo Hong and Jungwoo Lee},
year = {2026},
eprint = {2602.03668},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2602.03668}
}