Deep Model Predictive Optimization

Abstract

A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world. On one end of the spectrum, we have model-free reinforcement learning (MFRL), which is incredibly flexible and general but often results in brittle policies. In contrast, model predictive control (MPC) continually re-plans at each time step to remain robust to perturbations and model inaccuracies. However, despite its real-world successes, MPC often under-performs the optimal strategy. This is due to model quality, myopic behavior from short planning horizons, and approximations due to computational constraints. And even with a perfect model and enough compute, MPC can get stuck in bad local optima, depending heavily on the quality of the optimization algorithm. To this end, we propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience, specifically tailored to the needs of the control problem. We evaluate DMPO on a real quadrotor agile trajectory tracking task, on which it improves performance over a baseline MPC algorithm for a given computational budget. It can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%. Moreover, because DMPO requires fewer samples, it can also achieve these benefits with 4.3X less memory. When we subject the quadrotor to turbulent wind fields with an attached drag plate, DMPO can adapt zero-shot while still outperforming all baselines.

Experimental Results

We perform all evaluates on a quadrotor trajectory tracking problem in which the desired trajectories are infeasible zig-zags with and without yaw flips. These zig-zags linearly connect a series of random waypoints, while the yaw flips are a 180 degree change in to the desired yaw at each waypoint. Baselines include Model Predictive Path Integral (MPPI) control and an end-to-end (E2E) 3-layer MLP policy operating on states and conditioned on desired trajectories, trained with MFRL.

Zig-zag trajectory tracking without additional disturbances

DMPO (512 samples) vs. MPPI (512 samples): In a sample-constrained regime, DMPO can successfully perform the zig-zag maneuver while MPPI consistently crashes with 512 or fewer samples.

DMPO (1024 samples) vs. MPPI (4096 samples): With 4X fewer samples, DMPO outperforms the best MPPI controller by over 7% in terms of total trajectory cost.

DMPO (1024 samples) vs. E2E: DMPO also outperforms an end-to-end policy trained with MFRL by over 14% in terms of total trajectory cost.

Zig-zag trajectory tracking under an unknown turbulent wind field and added cardboard drag plate

DMPO (512 samples) vs. MPPI (512 samples): With unknown perturbations in a sample-constrained regime, MPPI will crash while DMPO will remain robust.

DMPO (4096 samples) vs. MPPI (8192 samples): If we give DMPO more samples, it can still surpass the performance of the best MPPI controller by over 6% in terms of total trajectory cost despite these additional perturbations.

DMPO (1024 samples) vs. E2E: And despite the perturbations, DMPO only needs 1024 samples to outperform the end-to-end policy by over 14% in terms of total trajectory cost.

DMPO (4096 samples) vs. E2E: With even more samples, DMPO can outperform the end-to-end policy by over 25% in terms of total trajectory cost.

Zig-zag trajectory tracking with yaw flips (without additional disturbances)

DMPO (256 samples) vs. MPPI (256 samples): In this harder task, DMPO can succeed with as few as 256 samples while MPPI consistently crashes with the same amount of samples.

DMPO (256 samples) vs. MPPI (4096 samples): And with only 256 samples, DMPO outperforms the best MPPI controller by over 27% in terms of total trajectory cost. This illustrates how there is more room for improvement on this harder problem, which DMPO is able to attain by improving both position and orientation tracking error.

Zig-zag trajectory tracking with yaw flips, under an unknown turbulent wind field and added cardboard drag plate

DMPO (512 samples) vs. MPPI (4096 samples): With perturbations, MPPI will crash with 4096 or fewer samples, while DMPO with 512 samples will again remain robust and successfully complete the task.

DMPO (512 samples) vs. MPPI (8192 samples): And despite the wind and drag plate, DMPO with 512 samples still outperforms the best MPPI controller with 8192 samples by over 7% in terms of total trajectory cost.