Latent Action Diffusion for Cross-Embodiment Manipulation

Abstract

End-to-end learning approaches offer great potential for robotic manipulation, but their impact is constrained by data scarcity and heterogeneity across different embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Secondly, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 13% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.

Generating action space data

Using retargeting, we generate cross-modal correspondences in between different action spaces.

Encoder training

To learn latent actions, we utilize a pairwise contrastive loss to maintain the alignment of inputs in the latent space.

Training decoders

We train decoders for each modality to reconstruct ground truth end-effector poses from the encoded latents.

Policy learning

We learn latent cross-embodiment policies. For inference, we utilize the embodiment-specific decoders.

Results

Latent Action Space

We demonstrate our method for learning latent action space by encoding parallel gripper actions from 0 to 1 and decoding them to actions for the Faive hand and human hand.

Setup #1: Pick and Place with Faive Hand + Franka Gripper

Rollouts

We roll out single-embodiment policies trained with explicit actions and latent actions for both embodiments. The policies were trained with 100 demonstrations for each embodiment, respectively. The cross-embodiment policy was co-trained on 100 episodes of data from both embodiments with our proposed methodology. As observation, we use the same external camera for both embodiments.

Success rates

With the cross-embodiment policy, we can control both end-effectors with a single policy. Additionally, we observe that it exhibits improved manipulation skills: the cotrained policy improves upon the baselines over 10% and 7.5% for the Faive hand and Franka gripper, respectively. We also observe a reduced standard deviation, indicating improved spatial generalization.

Setup #2: Pick and Place with mimic Hand + Franka Gripper

Rollouts

The policies were trained with 250 demonstrations for each embodiment, respectively. The cross-embodiment policy was trained on all 500 episodes. As observations, we use an external camera and the arm pose. Furthermore, for the mimic hand, we use a wrist camera, which is replaced by zero-padding for the Franka gripper.

Success rates

For the mimic hand, the cross-embodiment policy outperforms single-embodiment baselines by 13%. For the Franka gripper, the performance decreases with the cotrained policy - we attribute this to the asymmetric observations. This observation is consistent with prior work cotraining on data with missing camera views.

Setup #3: Block Stacking with mimic Hand + Franka Gripper

Rollouts

This task contains 3 subtasks: picking up the first block, alignment and stacking with the second block and placing both in the box. The policies were trained with 200 demonstrations for each embodiment, respectively. The cross-embodiment policy was trained on all 400 episodes. As observations, we only use an external camera and the arm pose.

Success rates

For coarse manipulation skills, we see a significant improvement for the cross-embodied policy for both embodiments. For fine-grained manipulation, success rates for the mimic hand do not change much, likely due to self-occlusion in the workspace camera view. For the Franka gripper, the success rates are drastically improved for the cross-embodied policy.

BibTeX


                @misc{bauer2025latentactiondiffusioncrossembodiment,
                    title={Latent Action Diffusion for Cross-Embodiment Manipulation},
                    author={Erik Bauer and Elvis Nava and Robert K. Katzschmann},
                    year={2025},
                    eprint={2506.14608},
                    archivePrefix={arXiv},
                    primaryClass={cs.RO},
                    url={https://arxiv.org/abs/2506.14608},
                }

Latent Action Diffusion for Cross-Embodiment Manipulation

Abstract

Latent Action Diffusion enables multi-robot control and cross-embodiment skill transfer via a learned latent action space.

Generating action space data

Encoder training

Training decoders

Policy learning

Results

Latent Action Space

Setup #1: Pick and Place with Faive Hand + Franka Gripper

Rollouts

Success rates

Setup #2: Pick and Place with mimic Hand + Franka Gripper

Rollouts

Success rates

Setup #3: Block Stacking with mimic Hand + Franka Gripper

Rollouts

Success rates

BibTeX