Learning Acrobatic Flight from Preferences

Researchers developed the Reward Ensemble under Confidence (REC) framework that enables AI to learn complex acrobatic drone flight directly from human preferences, eliminating flawed hand-crafted reward systems. The method achieves 88.4% of performance attainable with perfectly shaped rewards and successfully transfers simulation-trained policies to real drones zero-shot. This represents a major advance in preference-based reinforcement learning for robotics.

Learning Acrobatic Flight from Preferences

Preference-Based AI Masters Acrobatic Drone Flight, Surpassing Hand-Crafted Rewards

In a significant advance for robotics and reinforcement learning, researchers have developed a new AI framework that learns complex, acrobatic drone maneuvers directly from human preferences, eliminating the need for flawed, manually designed reward systems. The method, called Reward Ensemble under Confidence (REC), uses a probabilistic model to quantify uncertainty in reward learning, achieving performance levels close to those of idealized, pre-shaped rewards. Critically, policies trained entirely in simulation using only preference feedback were successfully deployed in the real world, performing intricate aerial stunts zero-shot.

This work, detailed in a paper on arXiv, tackles a core challenge in preference-based reinforcement learning (PbRL). While PbRL is ideal for tasks with subjective or hard-to-define goals—like the aesthetic quality of a drone flip—existing methods often struggle with the high-dimensional, dynamic nature of acrobatic flight. The research reveals a fundamental weakness of traditional approaches: hand-crafted reward functions for such tasks align with human judgment a mere 60.7% of the time, highlighting their inadequacy.

How REC Models Uncertainty to Improve Learning

The proposed REC framework introduces a novel probabilistic architecture. Instead of learning a single reward function, it employs an ensemble of distributional reward models. This allows the system to explicitly model uncertainty at each timestep of a task. This uncertainty is then propagated into the preference learning loss function and strategically used to guide exploration, focusing the AI's attention on parts of the task where it is least confident.

This approach stands in contrast to standard methods like Preference PPO. On the challenging benchmark of acrobatic quadrotor control, REC achieved 88.4% of the performance attainable with a perfectly shaped, ground-truth reward. Preference PPO, by comparison, reached only 55.2%. The ability to learn from preferences and approach "oracle" reward performance is a major leap forward for data-efficient and human-aligned AI training.

Real-World Validation and Broader Applicability

The most compelling demonstration involved transferring the simulation-trained policies to physical drones without any fine-tuning. The drones executed complex maneuvers learned purely from human preference judgments, validating the robustness and sim-to-real transfer capability of the REC framework. Furthermore, the researchers confirmed the generalizability of their approach by applying REC to standard continuous control benchmarks beyond aerial robotics, proving it is not a domain-specific solution.

Why This Matters for the Future of AI and Robotics

  • Eliminates Reward Engineering Bottlenecks: REC provides a powerful pathway to train AI for complex, subjective tasks where designing accurate reward functions by hand is impractical or impossible.
  • Enhances Human-AI Alignment: By learning directly from human preferences, the resulting AI behaviors are inherently more aligned with what humans actually value and find successful.
  • Improves Sim-to-Real Transfer: The success of zero-shot deployment from simulation to real drones accelerates robotics development, reducing costly and risky real-world trial-and-error.
  • Introduces a New Paradigm for Uncertainty: The framework's explicit modeling and utilization of reward uncertainty provide a blueprint for making PbRL more sample-efficient and robust across various domains.

常见问题