DUET: Dual-Robot Understanding via Efficient Teaching

Abstract

Dual-robot collaboration enables tasks that exceed the reach and payload of a single robot, such as collaboratively transporting objects across environments and executing coordinated handovers. Data acquisition is the primary bottleneck for training these systems. To this end, we introduce DUET, a dual-robot learning framework for mobile manipulation. For efficient data collection, we create a unified dual-embodiment synchronized VR-based teleoperation system for in-domain heterogeneous robot data collection. We further develop a complementary tracking pipeline that records human-human coordination and collaborative mobile manipulation priors. To allow efficient learning, we introduce an Action Chunking Transformer based architecture that first pretrains collaborative policies on efficient human-human demonstrations, before finetuning them on a minimal set of real-robot teleoperation trajectories. We develop a benchmark of four collaborative tasks to evaluate our framework using a Unitree G1 humanoid and a Dexmate Vega1 mobile manipulator. The results demonstrate that harnessing human priors not only yields superior task performance compared to baselines trained only on robot data, but also reduces the total human effort required for data collection. Our human data collection pipeline is 5.4× faster on average than teleoperation, yet we perform equally or better than robot-only data trained policies across all tasks.

01

The Problem

Collaborative tasks demand tight spatial-temporal coordination, but collecting demonstrations for a heterogeneous robot team requires complex teleoperation setups, multiple operators, and robot time.

02

Key Idea

Pretrain a single policy on two humans doing the task together with no robot in the loop, then finetune it on a small set of dual-robot teleoperation data.

03

Result

Human data is 5.4× faster to collect, and 60 human + 30 robot demonstrations match or beat baselines trained on 50 robot-only demonstrations across four contact-rich tasks.

Method

Two data streams, one policy: synchronized VR teleoperation supplies high-fidelity robot data, human-human capture supplies collaborative priors, and a two-stage ACT policy ties them together.

Two operators, one robot duo

Teleoperation pipeline — **The DUET synchronized dual-robot VR teleoperation pipeline.**

Synchronized dual-robot VR teleoperation.

Our synchronized VR teleoperation framework lets just two people command the duo: PICO VR interfaces drive the Unitree G1 humanoid through the SONIC whole-body tracking framework, and the Dexmate Vega1 mobile manipulator through General Motion Retargeting (GMR), while both robots stream real-time egocentric video and log synchronized data at 10 Hz.

Learning from human collaboration

Human-human demonstration data collection.

**Human-human data collection system.** Two operators with head/neck-mounted cameras plus a fixed third-person master view.

Teleoperation remains bottlenecked by time, cost, and operator effort. To scale up, two humans simply perform the collaborative task themselves: no robots in the loop. Three cameras record the scene: a head-mounted camera on each operator plus a fixed third-person master view. A tracking pipeline built on YOLO and SAM 3D Body lifts both people into metric-scale 3D, naturally encoding rich, transferable priors about human-human coordination.

One policy, two embodiments, two data streams

DUET trains a single policy for the whole duo, in two stages that share one architecture but differ in data source. To bridge the embodiment gap, human and robot data are projected into a shared pose space, with all poses expressed in head-relative coordinates for both humans and robots.

Stage 1: Pretrain on human collaboration. The policy backbone learns the tight spatial-temporal coupling of contact-rich collaboration from egocentric RGB streams and 3D keypoints of human-human demonstrations.

Stage 2: Finetune on robot data. The same architecture is then finetuned on a minimal set of dual-robot teleoperation trajectories.

**The DUET architecture.** A shared ACT backbone with parallel decoder heads for the shared pose space and robot-only joint space.

Our policy builds on the Action Chunking Transformer (ACT), a transformer-based visuomotor architecture that predicts chunks of future actions. DUET shares one ACT backbone across both embodiments: observations from both robots are tokenized into a common sequence, and two parallel decoder heads output a shared pose component and a robot-only joint-space component. The framework is agnostic to this choice: the backbone can be substituted with alternative visuomotor architectures without changing the data collection pipeline.

Dataset

Our benchmark dataset covers four collaborative tasks, each collected through both pipelines, dual-robot teleoperation and human-human demonstration.

Task	Pipeline	Total Time (min)	Episodes
T1: Trash Collection	Teleoperation	43.0	86
T1: Trash Collection	Human Demonstration	35.7	107
T2: Box Organization	Teleoperation	71.0	71
T2: Box Organization	Human Demonstration	53.3	80
T3: Board Tilting	Teleoperation	22.0	66
T3: Board Tilting	Human Demonstration	21.5	99
T4: Doll Passing	Teleoperation	31.5	63
T4: Doll Passing	Human Demonstration	26.3	79
Total	Teleoperation	167.5	286
Total	Human Demonstration	136.8	365

Total data span and number of episodes per pipeline across the four benchmark tasks.

Teleoperation Human Demonstration

T1: Trash Collection8.1× faster

322.1 s

39.9 s

T2: Box Organization4.4× faster

284.2 s

64.4 s

T3: Board Tilting6.1× faster

167.9 s

27.7 s

T4: Doll Passing3.8× faster

139.2 s

36.6 s

Average5.4× faster

228.4 s

42.2 s

Amortized collection time (seconds per successful trajectory). Direct human demonstration provides a 5.4× acceleration on average over our optimized dual-robot teleoperation pipeline.

Results

Our heterogeneous duo pairs a Unitree G1 humanoid with a Dexmate Vega1 mobile manipulator, evaluated on four contact-rich collaborative tasks. Each policy is evaluated over 10 independent hardware trials per task, and every trial is scored from 0.0 to 1.0 through two sequential milestones worth 0.5 points each (e.g., for Doll Passing: the initial grasp by Vega1, then the dual-robot handover). Points are the cumulative score over the 10 trials (out of a maximum of 10), capturing partial progress; the success rate counts only trials where all milestones are completed. Across all four tasks, pretraining on 60 human demonstrations and finetuning on just 30 robot trajectories achieves equal or better points and success rates than baselines trained on 50 robot-only demonstrations, while requiring less total data collection effort. Qualitatively, the human-pretrained policy executes more smoothly and adheres better to object affordance priors implicitly captured during human data collection.

Points and success rates across tasks — **Points and success rates** across the four collaborative tasks, over 10 hardware trials per task.

Method	T1 Trash Collection	T2 Box Organization	T3 Board Tilting	T4 Doll Passing
DUET (Ours)	200.95	206.50	111.65	106.20
Robot-only (50)	268.42	236.83	139.92	116.00
Robot-only (30)	161.05	142.10	83.95	69.60

Data collection effort in minutes. Effort is the total time expected to collect each policy's training dataset — the number of human and robot demonstrations used, each weighted by its amortized collection time. DUET achieves better performance while requiring less collection effort than policies trained on 50 robot-only demonstrations.

Trash Collection. Asymmetric spatial-temporal coordination.

Box Organization. Collaborative vision and long-horizon execution.

Board Tilting. Collaborative manipulation and collective balance.

Doll Passing. Coordinated handovers and grasp transfer.

Zero-shot generalizability

T2 Box Organization, out-of-distribution. The standard boxes are replaced with a black foam box and a green-and-black box, altering visual cues and physical dynamics.

T3 Board Tilting, out-of-distribution. The white board is entirely covered by a black cover, drastically shifting the visual background.

To evaluate robustness, we test DUET under out-of-distribution conditions with no additional training. Despite these severe variations, the framework successfully adapts and coordinates the duo to complete the task zero-shot.