MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Model (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jm-this.github.io/mvp_lam/.

MVP-LAM learns action-centric latent actions by training on time-synchronized multi-view videos with a cross-viewpoint reconstruction objective. Self-viewpoint reconstruction predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{v})$. Cross-viewpoint reconstruction swaps latent actions across synchronized views and predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{\tilde v})$ for $v \neq \tilde v$.

We measure action-centricity with mutual information between latent actions and ground-truth actions and with a linear probe that predicts actions from latent actions, reporting NMSE. MVP-LAM achieves the highest estimated $\mathcal{I}(Z;A)$ across estimators and the lowest NMSE on Bridge V2.

Pretraining with MVP-LAM latent actions improves downstream manipulation. The average success rate increases from 39.6 percent to 60.4 percent on SIMPLER. On LIBERO-Long, MVP-LAM reaches 90.8 percent success, improving over UniVLA pretrained on Bridge V2 at 79.4 percent.

Success rate and grasping rate in percent. Best is bolded and second best is underlined.

Success Rate	MVP-LAM	UniVLA	LAPA	OpenVLA	Octo-Small	Octo-Base	$\pi_0$
StackG2Y	33.3	16.7	54.2	41.6	8.3	0.0	37.5
Carrot2Plate	66.7	20.8	45.8	50.0	33.3	37.5	33.3
Spoon2Towel	66.7	54.2	70.8	37.5	25.0	12.5	29.2
Eggplant2Bask	75.0	66.7	58.3	16.7	12.5	20.8	45.8
AVG	60.4	39.6	57.3	36.4	19.8	17.7	36.5

Success rate in percent on LIBERO suites for VLAs pretrained on OXE (upper) and Bridge V2 (lower). $\ast$ indicates methods that use additional wrist-view images and proprioceptive states. Best is bolded and second best is underlined.

Method	Spatial	Object	Goal	Long	AVG
Octo	78.9	85.7	84.6	51.1	75.1
OpenVLA	84.7	88.4	79.2	53.7	76.5
LAPA	73.8	74.6	58.8	55.4	65.7
$\pi_0\ast$	96.8	98.8	95.8	85.2	94.2

UniVLA	95.2	95.4	91.9	87.5	92.5
MVP-LAM	96.0	94.6	94.8	90.8	94.1

We show example discrete codes selected for representative frame transitions. Similar motion patterns tend to activate similar codes across sources.

BibTeX

@inproceedings{
lee2026mvplam,
title={{MVP}-{LAM}: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction},
author={Jung Min Lee and Dohyeok Lee and Seokhun Ju and Taehyun Cho and Jin Woo Koo and Li Zhao and Sangwoo Hong and Jungwoo Lee},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=r0SYyjDhxY}
}

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

Cross-viewpoint reconstruction trains a latent action inferred from one view to explain the future in another view.

Abstract

Method

Experiments

RQ1. Are MVP-LAM latent actions more action-centric

RQ2. Is MVP-LAM effective for manipulation

SIMPLER benchmark

LIBERO benchmark

Visualization

Rollouts

Stack green cube on yellow block

Place carrot on plate

Place spoon on towel

Place eggplant in basket

Put the black bowl in the bottom drawer of the cabinet and close it

Put both moka pots on the stove

Put the yellow and white mug in the microwave and close it

BibTeX