Model performs poorly if its state is restored from checkpoint #3259

davidshen84 · 2023-08-06T05:59:37Z

davidshen84
Aug 6, 2023

Hi,

I am learning Jax and Flax and want to build a simple Masked Autoencoder model. I have built the model and trained it using the MNIST dataset. The model can reconstruct the images from the masked images (mask ratio = .6).

I want to save the model state and restart the training from the checkpoint. However, I found the model performed poorly if trained from a restored state.

The image below shows the loss metric of 4 training loops.

All training parameters are the same, except the number of training epochs.
One model is trained with 200 epochs. Indicated by the blue line.
The other model was trained with 50 epochs and was trained three times. Indicated by the green, grey and orange lines.

From the image, it is clear that the model trained with 50 epochs failed to learn much during the 2nd and 3rd training sessions, and its performance could not match the model trained with 200 epochs. The images reconstructed by the later model also look better.

I used the orbax.checkpoint.CheckpointManager to manage the saving and restoring of the checkpoints. This is how I created the object.

checkpoint_manager_options = orbax.checkpoint.CheckpointManagerOptions(max_to_keep=4, create=True)
orbax_checkpointer = {
    'state': PyTreeCheckpointer(),
    'config': Checkpointer(JsonCheckpointHandler()) ,
    'meta': PyTreeCheckpointer()
}
checkpoint_manager = orbax.checkpoint.CheckpointManager(
    f'mae_checkpoint/{model_name}', orbax_checkpointer, checkpoint_manager_options)

When restoring from a checkpoint, I first create a new state object using a typical create_train_state function, then call the restore function like this:

checkpoint = {
    'state': state, # <- this is the init state
    'config': None,
    'meta': None
}

restore_args = orbax_utils.restore_args_from_target(checkpoint, mesh=None)
restored_step = 0
if checkpoint_manager.latest_step():
    restored_checkpoint = checkpoint_manager.restore(checkpoint_manager.latest_step(),
                               items=checkpoint,
                               restore_kwargs={'restore_args': restore_args})
    state = restored_checkpoint['state'] # <- this is the state used in continuous training
    restored_step = checkpoint_manager.latest_step()

This is how I saved the checkpoint.

        checkpoint = {
            'state': state,
            'config': {
                'encoder': 64,
                'decoder': 24
            },
            'meta': {
                'metrics': metrics.items()
            }
        }
        save_args = orbax_utils.save_args_from_target(checkpoint)
        checkpoint_manager.save(total_steps,
                                checkpoint,
                                save_kwargs={'save_args': save_args})

I am not sure if the state is saved and restored correctly using the above code. Another thing is about the optimizer in the create_train_state function. I am unsure if I need to adjust the parameters if I restore the training from a checkpoint. I used the cosine warmup learning rate scheduler and adam optimizer.

def create_train_state(module, rngs, learning_rate, momentum, weight_decay,
                       warmup_steps, max_steps):
    """Creates an initial `TrainState`."""
    params = module.init(rngs, jnp.empty(
        [1, 20, 20,
         1]))['params']  # initialize parameters by passing a template image
    lr_scheduler = optax.warmup_cosine_decay_schedule(init_value=.0,
                                                peak_value=learning_rate,
                                                warmup_steps=warmup_steps,
                                                decay_steps=max_steps)
    tx = optax.chain(
        optax.clip_by_global_norm(1.0),
        optax.adam(lr_scheduler))

    return TrainState.create(apply_fn=module.apply,
                             params=params,
                             tx=tx,
                             metrics=Metrics.empty())

davidshen84 · 2023-08-07T07:29:49Z

davidshen84
Aug 7, 2023
Author

nvm...some silly mistake with how I calculated the warm-up steps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model performs poorly if its state is restored from checkpoint #3259

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Model performs poorly if its state is restored from checkpoint #3259

davidshen84 Aug 6, 2023

Replies: 1 comment

davidshen84 Aug 7, 2023 Author

davidshen84
Aug 6, 2023

davidshen84
Aug 7, 2023
Author