Generating action space data
Using retargeting, we generate cross-modal correspondences in between different action
spaces.
Using retargeting, we generate cross-modal correspondences in between different action
spaces.
To learn latent actions, we utilize a pairwise contrastive loss to maintain the alignment of inputs in the latent space.
We train decoders for each modality to reconstruct ground truth end-effector poses from the encoded latents.
We learn latent cross-embodiment policies. For inference, we utilize the embodiment-specific decoders.
We demonstrate our method for learning latent action space by encoding parallel gripper actions from 0 to 1 and decoding them to actions for the Faive hand and human hand.
We roll out single-embodiment policies trained with explicit actions and latent actions for both embodiments. The policies were trained with 100 demonstrations for each embodiment, respectively. The cross-embodiment policy was co-trained on 100 episodes of data from both embodiments with our proposed methodology. As observation, we use the same external camera for both embodiments.
With the cross-embodiment policy, we can control both end-effectors with a single policy. Additionally, we observe that it exhibits improved manipulation skills: the cotrained policy improves upon the baselines over 10% and 7.5% for the Faive hand and Franka gripper, respectively. We also observe a reduced standard deviation, indicating improved spatial generalization.
The policies were trained with 250 demonstrations for each embodiment, respectively. The cross-embodiment policy was trained on all 500 episodes. As observations, we use an external camera and the arm pose. Furthermore, for the mimic hand, we use a wrist camera, which is replaced by zero-padding for the Franka gripper.
For the mimic hand, the cross-embodiment policy outperforms single-embodiment baselines by 13%. For the Franka gripper, the performance decreases with the cotrained policy - we attribute this to the asymmetric observations. This observation is consistent with prior work cotraining on data with missing camera views.
@misc{bauer2025latentactiondiffusioncrossembodiment,
title={Latent Action Diffusion for Cross-Embodiment Manipulation},
author={Erik Bauer and Elvis Nava and Robert K. Katzschmann},
year={2025},
eprint={2506.14608},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.14608},
}