Abstract
Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across environments and executing coordinated handovers. Data acquisition is the primary bottleneck for training these systems. To this end, we introduce DUET, a dual-robot learning framework for mobile manipulation. For efficient data collection, we create a unified dual-embodiment synchronized VR-based teleoperation system for in-domain heterogeneous robot data collection. We further develop a complementary tracking pipeline that records human-human coordination and collaborative mobile manipulation priors. To allow efficient learning, we introduce an Action Chunking Transformer based architecture that first pretrains collaborative policies on efficient human-human demonstrations, before finetuning them on a minimal set of real-robot teleoperation trajectories. We develop a benchmark of four collaborative tasks to evaluate our framework using a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator. The results demonstrate that harnessing human priors not only yields superior task performance compared to baselines trained only on robot data, but also reduces the total human effort required for data collection. Our human data collection pipeline is 5.4× faster on average than teleoperation, yet we perform equally or better than robot-only data trained policies across all tasks.
The Problem
Collaborative tasks demand tight spatial-temporal coordination, but collecting demonstrations for a heterogeneous robot team requires complex teleoperation setups, multiple operators, and robot time.
Key Idea
Pretrain a single policy on two humans doing the task together with no robot in the loop, then finetune it on a small set of dual-robot teleoperation data.
Result
Human data is 5.4× faster to collect, and 60 human + 30 robot demonstrations match or beat baselines trained on 50 robot-only demonstrations across four contact-rich tasks.
Method
Two data streams, one policy: synchronized VR teleoperation supplies high-fidelity robot data, human-human capture supplies collaborative priors, and a two-stage ACT policy ties them together.
Two operators, one robot duo
Our synchronized VR teleoperation framework lets just two people command the duo: PICO VR interfaces drive the Unitree G1 humanoid through the SONIC whole-body tracking framework, and the Dexmate Vega1 mobile manipulator through General Motion Retargeting (GMR), while both robots stream real-time egocentric video and log synchronized data at 10 Hz.
Learning from human collaboration
Teleoperation remains bottlenecked by time, cost, and operator effort. To scale up, two humans simply perform the collaborative task themselves: no robots in the loop. Three cameras record the scene: a head-mounted camera on each operator plus a fixed third-person master view. A tracking pipeline built on YOLO and SAM 3D Body lifts both people into metric-scale 3D, naturally encoding rich, transferable priors about human-human coordination.
One policy, two embodiments, two data streams
DUET trains a single policy for the whole duo, in two stages that share one architecture but differ in data source. To bridge the embodiment gap, human and robot data are projected into a shared pose space, with all poses expressed in head-relative coordinates for both humans and robots.
Stage 1: Pretrain on human collaboration. The policy backbone learns the tight spatial-temporal coupling of contact-rich collaboration from egocentric RGB streams and 3D keypoints of human-human demonstrations.
Stage 2: Finetune on robot data. The same architecture is then finetuned on a minimal set of dual-robot teleoperation trajectories.
Our policy builds on the Action Chunking Transformer (ACT), a transformer-based visuomotor architecture that predicts chunks of future actions. DUET shares one ACT backbone across both embodiments: observations from both robots are tokenized into a common sequence, and two parallel decoder heads output a shared pose component and a robot-only joint-space component. The framework is agnostic to this choice: the backbone can be substituted with alternative visuomotor architectures without changing the data collection pipeline.
Dataset
Our benchmark dataset covers four collaborative tasks, each collected through both pipelines, dual-robot teleoperation and human-human demonstration.
| Task | Pipeline | Total Time (min) | Episodes |
|---|---|---|---|
| T1: Trash Collection | Teleoperation | 43.0 | 86 |
| Human Demonstration | 35.7 | 107 | |
| T2: Box Organization | Teleoperation | 71.0 | 71 |
| Human Demonstration | 53.3 | 80 | |
| T3: Board Tilting | Teleoperation | 22.0 | 66 |
| Human Demonstration | 21.5 | 99 | |
| T4: Doll Passing | Teleoperation | 31.5 | 63 |
| Human Demonstration | 26.3 | 79 | |
| Total | Teleoperation | 167.5 | 286 |
| Human Demonstration | 136.8 | 365 |
Total data span and number of episodes per pipeline across the four benchmark tasks.
Amortized collection time (seconds per successful trajectory). Direct human demonstration provides a 5.4× acceleration on average over our optimized dual-robot teleoperation pipeline.
Results
Our heterogeneous duo pairs a Unitree G1 humanoid with a Dexmate Vega1 mobile manipulator, evaluated on four contact-rich collaborative tasks. Each policy is evaluated over 10 independent hardware trials per task, and every trial is scored from 0.0 to 1.0 through two sequential milestones worth 0.5 points each (e.g., for Doll Passing: the initial grasp by Vega1, then the dual-robot handover). Points are the cumulative score over the 10 trials (out of a maximum of 10), capturing partial progress; the success rate counts only trials where all milestones are completed. Across all four tasks, pretraining on 60 human demonstrations and finetuning on just 30 robot trajectories achieves equal or better points and success rates than baselines trained on 50 robot-only demonstrations, while requiring less total data collection effort. Qualitatively, the human-pretrained policy executes more smoothly and adheres better to object affordance priors implicitly captured during human data collection.
| Method | T1 Trash Collection | T2 Box Organization | T3 Board Tilting | T4 Doll Passing |
|---|---|---|---|---|
| DUET (Ours) | 200.95 | 206.50 | 111.65 | 106.20 |
| Robot-only (50) | 268.42 | 236.83 | 139.92 | 116.00 |
| Robot-only (30) | 161.05 | 142.10 | 83.95 | 69.60 |
Data collection effort in minutes. Effort is the total time expected to collect each policy's training dataset — the number of human and robot demonstrations used, each weighted by its amortized collection time. DUET achieves better performance while requiring less collection effort than policies trained on 50 robot-only demonstrations.
Zero-shot generalizability
To evaluate robustness, we test DUET under out-of-distribution conditions with no additional training. Despite these severe variations, the framework successfully adapts and coordinates the duo to complete the task zero-shot.
BibTeX