Training Methodology
We are using Proximal Policy Optimization (PPO) as our RL algorithm, which has proven effective for robotics locomotion training and is integrated into Genesis extremely conveniently.
Our training utilizes a teacher-student setup: the "privileged" teacher model has access to full state information in simulation (exact joint positions, velocities, foot contact sensors, etc.), while the "unprivileged" student model only has access to the same limited sensor data that the real robot will have (IMU data, commanded joint positions). The teacher model helps guide the student's learning process, improving overall performance of the student policy given its ability to train its walking behavior from information it would otherwise not have.
Our reward function is designed to encourage efficient and stable locomotion given any arbitrary [speed, direction, rotation speed] command. This includes terms for linear and angular velocity tracking, survival, and smoothness of motion. We also incorporated a reward function for energy efficiency inspired by this paper and utilized many of the same domain randomization techniques described therein.
To ensure effective sim-to-real transfer, we employ extensive domain randomization including:
- - Random payload mass and position somewhere on the body
- - Servo PID parameters
- - Friction coefficients of the ground
- - Sensor noise characteristics
- - Gravity vector direction (to simulate inclines)
We also include a fixed command delay of one control step to simulate latency in the system.
Performance Results
Our training has shown promising results in simulation, with the robot learning to walk and navigate towards target locations effectively. Currently we are just working with the priveleged teacher model, which has been able to achieve stable locomotion at a variety of speeds and directions. The unprivileged student model is still in the process of being set up and connected, but we are optimistic about its potential given the success of the teacher. The energy efficiency rewards have also led to smoother and more natural transitions between types of gaits (like walking to trotting), which is encouraging for future work on more dynamic movement.