Open
Description
Bug description
When both of the following happen together:
- a logger is used with a cloud (e.g
s3://
orgcs://
protocol) save dir - a
ModelCheckpoint
is used without passing adirpath
The desired behaviors are:
- the checkpoint directory is resolved (via
ModelCheckpoint.__resolve_ckpt_dir
) to$logger.save_dir/$logger.name/$logger.version/checkpoints
and theModelCheckpoint
callback saves them there. ModelCheckpoint._find_last_checkpoints
will find$logger.save_dir/$logger.name/$logger.version/checkpoints/last.ckpt
. If will first check if that path exists on the filesystem instantiated inModelCheckpoint.__init_ckpt_dir
.
Desired behavior 1 works, 2 does not. There are two bugs:
ModelCheckpoint.__init_ckpt_dir
will select the wrong filesystem whendirpath
isNone
, causingModelCheckpoint._find_last_checkpoints
to not find the cloud filepaths.- Even if the correct
ModelCheckpoint._fs
were used,_find_last_checkpoints
returns a set of paths with their protocols stripped (due to the call to _fs.ls). This causes_CheckpointConnector_parse_ckpt_path
to then also select the wrong filesystem, resulting in no checkpoints found.
What version are you seeing the problem on?
v2.0; but likely also present on others
How to reproduce the bug
- Use a logger with a cloud save dir
- Create some cloud checkpoint, e.g.
s3://.../logger_name/logger_version/checkpoints/last.ckpt
- From a new job, try to resume training using
ckpt_path="last"
- A warning will be emitted about how lightning couldn't find the checkpoint
Error messages and logs
UserWarning: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
Environment
Current environment
#- Lightning Component: ModelCheckpoint
#- PyTorch Lightning Version: 2.0.3
#- Lightning App Version: N/A
#- PyTorch Version: 2.0.1
#- Python version: 3.10.11
#- OS: Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: 1x T4
#- How you installed Lightning: `pip`
#- Running environment of LightningApp (e.g. local, cloud): AWS Sagemaker
More info
Here is my current workaround for S3 checkpoints:
from s3fs import S3FileSystem
class S3ModelCheckpoint(ModelCheckpoint):
def __init__(self, *args: str | None, **kwargs: str | None) -> None:
super().__init__(*args, **kwargs)
self._fs = S3FileSystem()
def _find_last_checkpoints(self, trainer: "L.Trainer") -> set[str]:
ckpts = {"s3://" + ckpt for ckpt in super()._find_last_checkpoints(trainer)}
return ckpts
The Universal Pathlib project fixes the behavior of cloud paths so that the procotols aren't stripped off. Could be worth looking into, to prevent these sorts of edge cases from occurring.
cc @awaelchli