Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

Jianan Li1 Xiao Chen1 Tao Huang2,3 Tien-Tsin Wong4
1The Chinese University of Hong Kong 2Shanghai AI Laboratory
3Shanghai Jiao Tong University 4Monash University
arXiv Paper Code
Teaser Figure

Source Video

2D Motion

Simulated Character

Mimic2DM effectively learns character controllers for diverse motion types by directly imitating 2D motion sequences extracted from in-the-wild videos, without requiring any 3D motion data.

Abstract

Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters.

We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements.

Methodology

System Overview (Mimic2DM Pipeline)

1. Imitation as Reprojection Minimization

We unify motion reconstruction and physics-based imitation into a single reprojection minimization task. The policy learns to control a simulated character to match 2D observations projected from the simulation to the video reference.

2. View-Agnostic Tracking

A general policy is trained to minimize reprojection error under arbitrary camera viewpoints. By aggregating features from multiple views, the policy implicitly learns 3D motion understanding without explicit 3D supervision.

3. Hierarchical Control

We integrate a transformer-based autoregressive 2D motion generator (using VQ-VAE tokens) to produce high-quality 2D reference trajectories that guide the tracking policy for generative tasks.

Results & Demonstrations

⚽ Human-object Interaction (Soccer)

Demonstrating the acquisition of complex ball interaction dynamics using only 2D video data.

🐕 Non-human Characters (Quadruped robot)

Demonstrating the framework's versatility on quadruped robots.

💃 Whole-body Motion Tracking

Zero-Shot 3D motion tracking via multi-view aggregation.

🔄 Generative Control

Synthesizing diverse dribbling styles and achieving seamless skill transitions via hierarchical control.

Comparison with Baselines

Reference 2D Motion
Baseline (Sfv*)
Baseline (Sfv*) Visualization
Ours (Mimic2DM)
Ours (Mimic2DM) Visualization

Left: Reference 2D motion. Middle: Baseline methods relying on explicit 3D reconstruction often fail in complex interaction scenarios (e.g., ball penetration, floating). Right: Our method (Mimic2DM) minimizes reprojection error directly in simulation, ensuring physical validity and robust tracking.

Generated Motions