Learning Switching Criteria for Sim2Real Transfer

Satvik Sharma*, Ellen Novoseller*, Vainavi Viswanath, Zaynah Javed, Rishi Parikh, Ryan Hoque, Ashwin Balakrishna, Daniel Brown, Ken Goldberg

*equal contribution

ArXiv: [Link] Code: [Link]

Presentation at CASE 2022

Sharma CASE 2022 Switching Criteria 15 min Video Presentation.mp4

Abstract

Simulation-to-reality transfer has emerged as a popular and highly successful method to train robotic control policies for a wide variety of tasks. However, it is often challenging to determine when policies trained in simulation are ready to be transferred to the physical world. Deploying policies that have been trained with very little simulation data can result in unreliable and dangerous behaviors on physical hardware. On the other hand, excessive training in simulation can cause policies to overfit to the visual appearance and dynamics of the simulator. In this work, we study strategies to automatically determine when policies trained in simulation can be reliably transferred to a physical robot. We specifically study these ideas in the context of robotic fabric manipulation, in which successful sim2real transfer is especially challenging due to the difficulties of precisely modeling the dynamics and visual appearance of fabric. Results in a fabric smoothing task suggest that our switching criteria correlate well with performance in real. In particular, our confidence-based switching criteria achieve average final fabric coverage of 87.2-93.7% within 55-60% of the total training budget.

Physical Experiment and Simulator Setup

This work studies sim-to-real switching of behavior cloning policies in a fabric smoothing task. We consider an environment (pictured below) consisting of an ABB YuMi robot with a single tweezer gripper, in which an overhead Photoneo Phoxi Camera captures grayscale images. The manipulation workspace border is marked with blue tape and is designed to visually emulate the GymCloth simulator, as shown on the left; the top two lefthand images show example starting and ending configurations from an oracle smoothing policy in GymCloth, while the bottom two lefthand images show the same observations, processed to resemble the grayscale images taken by the Photoneo Phoxi Camera. To the right is an example (normalized and blurred) image of the fabric taken by the Photoneo PhoXi camera.

Example Photoneo PhoXi camera image

System Overview

At each step, our algorithm pipeline collects a new batch of simulation data, performs a behavior cloning model update epoch, and then checks whether a switching condition is satisfied. If the switching criterion is met, then the model is ready to be deployed in real. Otherwise, we continue collecting simulation data to further update the model. We test four switching criteria, which utilize metrics based on a) reward when evaluated in simulation and b) epistemic uncertainty as estimated via an ensemble of policy networks, paired with each of two stopping conditions based on 1) absolute thresholding or values, and 2) gradients.

Example Trajectory Rollouts in Simulation and Physical Experiments

Above, the top row (left to right) depicts a sample trajectory in simulation, while the bottom row similarly depicts a sample physical robot trajectory.

Determining Stopping Points for Various Switching Criteria

The figure below depicts stopping points identified by the various switching criteria. For the two plots on the left, which depict the simulation performance switching criteria, we overlay the physical performance of several policy checkpoints.

On all graphs above, the dark blue curves are splines fit to the data to mitigate noise when evaluating the stopping conditions. Left Two Plots: The simulation reward comes from evaluating the policy in the GymCloth simulation environment and determining the fabric coverage of the final configuration; curves are averaged over 5 episode rollouts in GymCloth. For comparison with real, the orange points correspond to mean performance in real of the behavior cloning policy selected at that iteration. The error bars correspond to the standard error across four runs. The stopping points (red point) are determined to be at iteration 171 for reward value and 153 for the reward gradient stopping conditions. Right Two Plots: The epistemic uncertainty is calculated at each iteration over five policy ensemble members and over a holdout set of 200 demonstration episodes. The confidence value stopping condition determines the stopping point be 111, while the confidence gradient determines it to be 117.

Performance of Learned Policies in Physical Fabric Smoothing Experiments at Various Stopping Points

The four switching criteria---the two evaluation metrics (simulation performance and epistemic uncertainty) coupled with each of the two stopping conditions (value-based and gradient-based)---are evaluated on a fabric smoothing task, where we use behavior cloning to learn a policy from demonstrations.

Left: Final physical fabric coverage achieved for each of the four stopping conditions. Right: Comparing the final physical fabric coverage for the confidence value stopping condition to various checkpoints. We see that the stopping conditions are largely competitive with 200 iterations (the maximum iteration number considered), but require significantly less training, particularly for the confidence-based stopping criteria. Plots show mean +/- standard error over 4 episodes. Note that to plot episodes that reach the target coverage of 92% in fewer than 10 actions, we repeat the final achieved coverage for the remainder of the 10-action budget.

Here are two example trajectory rollouts given by the confidence value stopping condition (which halts at 111 training iterations) and by the final policy (which uses a maximum of 200 training iterations), respectively:

Confidence value stopping condition (111 training iterations)

Final policy (200 training iterations)

Our physical experiments evaluate four repetitions of each policy, which for repeatability of results, utilize the same (approximate) set of four initial fabric configurations, pictured below:

Hyperparameter Values

Behavior cloning: We use a maximum of 2,000 demonstrations, with an additional holdout set consisting of 200 demonstrations. Each demonstration is given by the oracle corner-pulling policy in the GymCloth simulator, which terminates on reaching at least 92% coverage or 10 actions (whichever occurs first). In each iteration of the pipeline, 10 demonstrations are added to the replay buffer and one model update epoch is performed. During each model update epoch, each of the 5 policy ensemble members receives a bootstrapped subsample of the data in the replay buffer. Each model update epoch consists of 400 gradient steps (per ensemble member); during each gradient step, a minibatch of size 64 is randomly selected from the current bootstrap. We utilize an Adam optimizer with a learning rate of 2.5e-4 and L2 regularization of 1e-5.

Stopping conditions: Stopping condition hyperparameters were tuned using simulator data, without physical evaluation; we selected values that yield reasonable stopping points in the simulated evaluation metric curves. For the value-based stopping condition, the value of A (threshold) for Reward Value is 0.842 and the value for Confidence Value is 0.045. The gradient-based method has more hyperparameters, which are tuned on a cross-validation set in simulation. The epsilon, U, and V values are 0.002, 50, 70 respectively for Reward Gradient and 0.001, 50, 70 for Confidence Gradient. The finite differences are taken over the previous 5 points.