Terra Baselines provides a set of tools to train and evaluate RL policies on the Terra environment. This implementation allows to train an agent capable of planning earthworks in trenches and foundations environments in less than 1 minute on 8 Nvidia RTX-4090 GPUs.
- Train on multiple devices using PPO with
train.py
(based on XLand-MiniGrid) - Generate metrics for your checkpoint with
eval.py
- Visualize rollouts of your checkpoint with
visualize.py
- Run a grid search on the hyperparameters with
train_sweep.py
(orchestrated with wandb)
Clone the repo and install the requirements with
pip install -r requirements.txt
Clone Terra in a different folder and install it with
pip install -e .
Lastly, install JAX.
Setup your training job configuring the TrainConfig
@dataclass
class TrainConfig:
name: str
num_devices: int = 0
project: str = "excavator"
group: str = "default"
num_envs_per_device: int = 4096
num_steps: int = 32
update_epochs: int = 3
num_minibatches: int = 32
total_timesteps: int = 3_000_000_000
lr: float = 3e-4
clip_eps: float = 0.5
gamma: float = 0.995
gae_lambda: float = 0.95
ent_coef: float = 0.01
vf_coef: float = 5.0
max_grad_norm: float = 0.5
eval_episodes: int = 100
seed: int = 42
log_train_interval: int = 1
log_eval_interval: int = 50
checkpoint_interval: int = 50
clip_action_maps = True
local_map_normalization_bounds = [-16, 16]
loaded_max = 100
num_rollouts_eval = 300
Then, setup the curriculum in config.py
in Terra (making sure the maps are saved to disk).
Run a training job with
DATASET_PATH=/path/to/dataset DATASET_SIZE=<num_maps_per_type> python train.py -d <num_devices>
and collect your weights in the checkpoints/
folder.
You can run a hyperparameter sweep over reward settings using Weights & Biases Sweeps. This allows you to efficiently grid search or random search over reward parameters and compare results.
The sweep configuration is defined in sweep.py
. It includes a grid over reward parameters such as existence
, collision_move
, move
, etc. The sweep uses the TrainConfigSweep
dataclass, which extends the standard training config with sweepable reward parameters.
To create a new sweep on wandb, run:
python sweep.py create
This will print a sweep ID (e.g., abc123xy
). Copy this ID for the next step.
You can launch multiple agents (workers) to run experiments in parallel. Each agent will pick up a different configuration from the sweep and start a training run.
To launch an agent, run:
wandb agent <SWEEP_ID>
You can run this command multiple times (e.g., in different terminals, or as background jobs in a cluster script) to parallelize the sweep.
If you are using a cluster, you can use the provided sweep_cluster.sh
script. Make sure to set the SWEEP_ID
variable to your sweep ID:
# In sweep_cluster.sh
SWEEP_ID=<YOUR_SWEEP_ID>
wandb agent $SWEEP_ID &
wandb agent $SWEEP_ID &
wandb agent $SWEEP_ID &
wandb agent $SWEEP_ID &
wait
Evaluate your checkpoint with standard metrics using
DATASET_PATH=/path/to/dataset DATASET_SIZE=<num_maps_per_type> python eval.py -run <checkpoint_path> -n <num_environments> -steps <num_steps>
Visualize the rollout of your policy with
DATASET_PATH=/path/to/dataset DATASET_SIZE=<num_maps_per_type> python visualize.py -run <checkpoint_path> -nx <num_environments_x> -ny <num_environments_y> -steps <num_steps> -o <output_path.gif>
We train 2 models capable of solving both foundation and trench type of environments. They differentiate themselves based on the type of agent (wheeled or tracked), and the type of curriculum used to train them (dense reward with single level, or sparse reward with curriculum). All models are trained on 64x64 maps and are stored in the checkpoints/
folder.
Checkpoint | Map Type | ||||
---|---|---|---|---|---|
tracked-dense.pkl |
Foundations | 97% | 5.66 (1.51) | 19.06 (2.86) | 0.99 (0.04) |
Trenches | 94% | 7.09 (5.66) | 20.57 (5.26) | 0.99 (0.10) | |
wheeled-dense.pkl |
Foundations | 99% | 11.43 (8.96) | 22.06 (3.65) | 1.00 (0.00) |
Trenches | 89% | 15.84 (25.10) | 21.12 (5.65) | 0.96 (0.14) |
Where we define the metrics from Terenzi et al:
All the models we train share the same structure. We encode the maps with a CNN, and the agent state and local maps with MLPs. The latent features are concatenated and shared by the two MLP heads of the model (value and action). In total, the model has ~130k parameters counting both value and action weights.
Here's a collection of rollouts for the models we trained.