Edge case causes incorrect filesystem to be selected for finding cloud checkpoints

### Bug description

When both of the following happen together:
1. a logger is used with a cloud (e.g `s3://` or `gcs://` protocol) save dir
2. a `ModelCheckpoint` is used without passing a `dirpath`

The desired behaviors are:
1. the checkpoint directory is resolved (via `ModelCheckpoint.__resolve_ckpt_dir`) to `$logger.save_dir/$logger.name/$logger.version/checkpoints` and the `ModelCheckpoint` callback saves them there.
2. `ModelCheckpoint._find_last_checkpoints` will find `$logger.save_dir/$logger.name/$logger.version/checkpoints/last.ckpt`. If will first check if that path exists on the filesystem instantiated in `ModelCheckpoint.__init_ckpt_dir`.

Desired behavior 1 works, 2 does not. There are two bugs:
1. `ModelCheckpoint.__init_ckpt_dir` will [select the wrong filesystem](https://github.com/Lightning-AI/lightning/blob/6eae2310d6dae086596e5bdddd08e8cd3884336e/src/lightning/pytorch/callbacks/model_checkpoint.py#L443) when `dirpath` is `None`, causing `ModelCheckpoint._find_last_checkpoints` to [not find](https://github.com/Lightning-AI/lightning/blob/6eae2310d6dae086596e5bdddd08e8cd3884336e/src/lightning/pytorch/callbacks/model_checkpoint.py#L612) the cloud filepaths.
2. Even if the correct `ModelCheckpoint._fs` were used, `_find_last_checkpoints` returns a set of paths with their protocols stripped (due to the [call to _fs.ls](https://github.com/Lightning-AI/lightning/blob/6eae2310d6dae086596e5bdddd08e8cd3884336e/src/lightning/pytorch/callbacks/model_checkpoint.py#L615)). This causes `_CheckpointConnector_parse_ckpt_path` to then also [select the wrong filesystem](https://github.com/Lightning-AI/lightning/blob/master/src/lightning/pytorch/trainer/connectors/checkpoint_connector.py#L185-L186), resulting in no checkpoints found.
### What version are you seeing the problem on?

v2.0; but likely also present on others

### How to reproduce the bug

1. Use a logger with a cloud save dir
2. Create some cloud checkpoint, e.g. `s3://.../logger_name/logger_version/checkpoints/last.ckpt`
6. From a new job, try to resume training using `ckpt_path="last"`
7. A warning will be emitted about how lightning couldn't find the checkpoint

### Error messages and logs

```
UserWarning: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- Lightning Component: ModelCheckpoint
#- PyTorch Lightning Version: 2.0.3
#- Lightning App Version: N/A
#- PyTorch Version: 2.0.1
#- Python version: 3.10.11
#- OS: Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: 1x T4
#- How you installed Lightning: `pip`
#- Running environment of LightningApp (e.g. local, cloud): AWS Sagemaker
```

</details>


### More info

Here is my current workaround for S3 checkpoints:

```python
from s3fs import S3FileSystem


class S3ModelCheckpoint(ModelCheckpoint):
    def __init__(self, *args: str | None, **kwargs: str | None) -> None:
        super().__init__(*args, **kwargs)
        self._fs = S3FileSystem()

    def _find_last_checkpoints(self, trainer: "L.Trainer") -> set[str]:
        ckpts = {"s3://" + ckpt for ckpt in super()._find_last_checkpoints(trainer)}
        return ckpts
```

The [Universal Pathlib](https://github.com/fsspec/universal_pathlib) project fixes the behavior of cloud paths so that the procotols aren't stripped off. Could be worth looking into, to prevent these sorts of edge cases from occurring.

cc @awaelchli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions