MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

1Seoul National University , 2Microsoft Research Asia , 3Konkuk University , 4HodooAI Labs
Concept figure

Cross-viewpoint reconstruction trains a latent action inferred from one view to explain the future in another view.

Abstract

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Model (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jm-this.github.io/mvp_lam/.

Method

MVP-LAM learns action-centric latent actions by training on time-synchronized multi-view videos with a cross-viewpoint reconstruction objective. Self-viewpoint reconstruction predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{v})$. Cross-viewpoint reconstruction swaps latent actions across synchronized views and predicts $o_{t+1}^{v}$ from $(o_t^{v}, z_t^{\tilde v})$ for $v \neq \tilde v$.

Architecture figure

Experiments

RQ1. Are MVP-LAM latent actions more action-centric

We measure action-centricity with mutual information between latent actions and ground-truth actions and with a linear probe that predicts actions from latent actions, reporting NMSE. MVP-LAM achieves the highest estimated $\mathcal{I}(Z;A)$ across estimators and the lowest NMSE on Bridge V2.

Mutual information and NMSE figure

RQ2. Is MVP-LAM effective for manipulation

Pretraining with MVP-LAM latent actions improves downstream manipulation. The average success rate increases from 39.6 percent to 60.4 percent on SIMPLER. On LIBERO-Long, MVP-LAM reaches 90.8 percent success, improving over UniVLA pretrained on Bridge V2 at 79.4 percent.

SIMPLER benchmark

Success rate and grasping rate in percent. Best is bolded and second best is underlined.

Success Rate MVP-LAM UniVLA LAPA OpenVLA Octo-Small Octo-Base $\pi_0$
StackG2Y33.316.754.241.68.30.037.5
Carrot2Plate66.720.845.850.033.337.533.3
Spoon2Towel66.754.270.837.525.012.529.2
Eggplant2Bask75.066.758.316.712.520.845.8
AVG60.439.657.336.419.817.736.5

LIBERO benchmark

Success rate in percent on LIBERO suites for VLAs pretrained on OXE (upper) and Bridge V2 (lower). $\ast$ indicates methods that use additional wrist-view images and proprioceptive states. Best is bolded and second best is underlined.

Method Spatial Object Goal Long AVG
Octo78.985.784.651.175.1
OpenVLA84.788.479.253.776.5
LAPA73.874.658.855.465.7
$\pi_0\ast$96.898.895.885.294.2
UniVLA95.295.491.987.592.5
MVP-LAM96.094.694.890.894.1

Visualization

We show example discrete codes selected for representative frame transitions. Similar motion patterns tend to activate similar codes across sources.

Latent action visualization

Rollouts

Stack green cube on yellow block

MVP-LAMSuccess
UniVLAFail
Octo-BFail
$\pi_0$Fail

Place carrot on plate

MVP-LAMSuccess
UniVLAFail
Octo-BSuccess
$\pi_0$Success

Place spoon on towel

MVP-LAMSuccess
UniVLASuccess
Octo-BSuccess
$\pi_0$Fail

Place eggplant in basket

MVP-LAMSuccess
UniVLASuccess
Octo-BFail
$\pi_0$Success

Put the black bowl in the bottom drawer of the cabinet and close it

MVP-LAMSuccess
UniVLASuccess
$\pi_0$Fail

Put both moka pots on the stove

MVP-LAMSuccess
UniVLAFail
$\pi_0$Success

Put the yellow and white mug in the microwave and close it

MVP-LAMSuccess
UniVLAFail
$\pi_0$Success

BibTeX

@misc{lee2026mvplamlearningactioncentriclatent,
  title     = {MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction},
  author    = {Jung Min Lee and Dohyeok Lee and Seokhun Ju and Taehyun Cho and Jin Woo Koo and Li Zhao and Sangwoo Hong and Jungwoo Lee},
  year      = {2026},
  eprint    = {2602.03668},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url       = {https://arxiv.org/abs/2602.03668}
}