Skip to content

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912

Open
@schmidt-ai

Description

@schmidt-ai

Bug description

When both of the following happen together:

  1. a logger is used with a cloud (e.g s3:// or gcs:// protocol) save dir
  2. a ModelCheckpoint is used without passing a dirpath

The desired behaviors are:

  1. the checkpoint directory is resolved (via ModelCheckpoint.__resolve_ckpt_dir) to $logger.save_dir/$logger.name/$logger.version/checkpoints and the ModelCheckpoint callback saves them there.
  2. ModelCheckpoint._find_last_checkpoints will find $logger.save_dir/$logger.name/$logger.version/checkpoints/last.ckpt. If will first check if that path exists on the filesystem instantiated in ModelCheckpoint.__init_ckpt_dir.

Desired behavior 1 works, 2 does not. There are two bugs:

  1. ModelCheckpoint.__init_ckpt_dir will select the wrong filesystem when dirpath is None, causing ModelCheckpoint._find_last_checkpoints to not find the cloud filepaths.
  2. Even if the correct ModelCheckpoint._fs were used, _find_last_checkpoints returns a set of paths with their protocols stripped (due to the call to _fs.ls). This causes _CheckpointConnector_parse_ckpt_path to then also select the wrong filesystem, resulting in no checkpoints found.

What version are you seeing the problem on?

v2.0; but likely also present on others

How to reproduce the bug

  1. Use a logger with a cloud save dir
  2. Create some cloud checkpoint, e.g. s3://.../logger_name/logger_version/checkpoints/last.ckpt
  3. From a new job, try to resume training using ckpt_path="last"
  4. A warning will be emitted about how lightning couldn't find the checkpoint

Error messages and logs

UserWarning: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.

Environment

Current environment
#- Lightning Component: ModelCheckpoint
#- PyTorch Lightning Version: 2.0.3
#- Lightning App Version: N/A
#- PyTorch Version: 2.0.1
#- Python version: 3.10.11
#- OS: Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: 1x T4
#- How you installed Lightning: `pip`
#- Running environment of LightningApp (e.g. local, cloud): AWS Sagemaker

More info

Here is my current workaround for S3 checkpoints:

from s3fs import S3FileSystem


class S3ModelCheckpoint(ModelCheckpoint):
    def __init__(self, *args: str | None, **kwargs: str | None) -> None:
        super().__init__(*args, **kwargs)
        self._fs = S3FileSystem()

    def _find_last_checkpoints(self, trainer: "L.Trainer") -> set[str]:
        ckpts = {"s3://" + ckpt for ckpt in super()._find_last_checkpoints(trainer)}
        return ckpts

The Universal Pathlib project fixes the behavior of cloud paths so that the procotols aren't stripped off. Could be worth looking into, to prevent these sorts of edge cases from occurring.

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions