Learning Acrobatic Flight from Preferences

The Reward Ensemble under Confidence (REC) framework enables autonomous drones to master acrobatic maneuvers by learning reward functions directly from human preferences rather than hand-coded rewards. This preference-based reinforcement learning approach achieves 88.4% of the performance of a perfect oracle reward and successfully transfers from simulation to real-world quadrotors for zero-shot acrobatic flight. The system addresses the fundamental mismatch where human-engineered rewards align with human judgment only 60.7% of the time.

Learning Acrobatic Flight from Preferences

New AI Framework Teaches Drones Acrobatics Using Human Preferences, Not Hand-Coded Rewards

A new reinforcement learning framework enables autonomous drones to master complex acrobatic maneuvers by learning directly from human preferences, bypassing the notoriously difficult task of manually designing reward functions. The system, called Reward Ensemble under Confidence (REC), uses a probabilistic model to learn what constitutes a "good" flight from human feedback, achieving performance close to that of a perfect, pre-defined reward. Critically, policies trained entirely in simulation with REC were successfully transferred to real-world quadrotors, performing flips and rolls with zero-shot adaptation.

The Challenge of Rewarding Acrobatic Flight

Preference-based reinforcement learning (PbRL) is a growing field that trains AI agents using comparative human judgments (e.g., "Trajectory A is better than B") instead of numerical reward signals. This is vital for tasks where success is subjective or hard to quantify, such as artistic motion or acrobatics. Acrobatic drone flight presents an extreme challenge: its dynamics are complex, movements are rapid, and the margin for error is tiny. The research confirms that human-engineered reward functions are poorly aligned with human judgment, agreeing only 60.7% of the time.

"Hand-crafted rewards often fail to capture the nuanced qualities that matter for a graceful or precise maneuver," the authors note, highlighting the fundamental mismatch between programmer intuition and the true objective. This gap necessitates methods that learn the reward directly from the end-user's preferences.

How REC Models Uncertainty for Better Learning

The proposed REC framework introduces a key innovation: it explicitly models the uncertainty in the learned reward at every timestep. It does this by maintaining an ensemble of distributional reward models, each predicting a possible reward distribution. This ensemble allows REC to quantify how confident it is in its own reward predictions.

REC then propagates this uncertainty into two critical components of the learning process. First, it uses uncertainty to weight the preference loss, emphasizing comparisons where it is more confident. Second, it leverages areas of high model disagreement to guide exploration, actively seeking out trajectories that would reduce its uncertainty about human preferences. This creates a virtuous cycle of more efficient and robust learning.

Superior Performance and Real-World Transfer

The results on acrobatic quadrotor control are striking. When evaluated in simulation, a policy trained with REC achieved 88.4% of the performance of an agent trained with a meticulously "shaped" oracle reward—a near-perfect benchmark that is typically unavailable in real applications. In contrast, a standard Preference PPO baseline reached only 55.2% of that performance, demonstrating REC's significant advancement.

Most impressively, policies trained in simulation with REC were deployed zero-shot onto physical drones. Without any additional fine-tuning in the real world, the drones successfully executed learned acrobatic maneuvers, proving the framework's ability to produce robust and transferable policies. The study further validated REC's generalizability by showing strong results on standard continuous control benchmarks, indicating its utility beyond aerial robotics.

Why This Matters for AI and Robotics

  • Democratizes Complex Skill Learning: REC reduces the need for deep reinforcement learning expertise to design rewards, allowing domain experts (e.g., pilots, artists) to teach robots through intuitive preference feedback.
  • Enables Learning of Subjective Tasks: It opens the door for AI to master tasks with inherently qualitative goals, such as stylistic motion, graceful manipulation, or safe and comfortable driving.
  • Improves Sample Efficiency and Robustness: By modeling uncertainty, REC learns more efficiently from limited human feedback and produces policies robust enough for sim-to-real transfer, a major hurdle in robotics.
  • Broad Applicability: The validation on non-robotics benchmarks suggests REC is a general-purpose PbRL algorithm applicable to finance, healthcare, and any domain where objectives are complex to specify.

常见问题