|
| 1 | +--- |
| 2 | +title: "DataStates-LLM Checkpointing Engine" |
| 3 | +tags: asynchronous checkpointing for minimizing I/O overheads. |
| 4 | +--- |
| 5 | +This tutorial will show how to use [DataStates-LLM](https://github.com/DataStates/datastates-llm) for asynchronous checkpointing. DataStates-LLM introduces a lazy asynchronous checkpointing mechanism tailored for LLMs, aiming to minimize I/O overhead and enhance training efficiency. This tutorial provides a guide on integrating DataStates-LLM with the DeepSpeed framework. |
| 6 | + |
| 7 | +## Overview of DataStates-LLM |
| 8 | + |
| 9 | +DataStates-LLM is designed to address the challenges of frequent checkpointing in LLM training by introducing a lazy asynchronous multi-level approach. It leverages the immutability of model parameters and optimizer states during forward and backward passes to perform non-blocking data transfers, thereby reducing interference with the training process. This method has demonstrated up to 48x faster checkpointing and 2.2x faster end-to-end training times compared to traditional approaches as outlined in [DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models](https://arxiv.org/abs/2406.10707). |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | + |
| 13 | +Before integrating DataStates-LLM with DeepSpeed, ensure the following: |
| 14 | + |
| 15 | +- **DeepSpeed Installation**: DeepSpeed should be installed in your environment. If not, refer to the [DeepSpeed Getting Started Guide](https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/getting-started.md) for installation instructions. |
| 16 | + |
| 17 | +- **DataStates-LLM Repository**: Access the DataStates-LLM source code from its [GitHub repository](https://github.com/DataStates/datastates-llm) and follow the installation instructions provided therein. |
| 18 | + |
| 19 | +## Configuring DeepSpeed for DataStates-LLM |
| 20 | + |
| 21 | +To enable DataStates-LLM's asynchronous checkpointing within DeepSpeed, please modify the `deepspeed_config.json` file to include specific configurations under the `datastates_ckpt` section. Below is an example configuration: |
| 22 | + |
| 23 | +```json |
| 24 | +{ |
| 25 | + // ... other DeepSpeed configuration options |
| 26 | + "datastates_ckpt": { |
| 27 | + "host_cache_size": 16, |
| 28 | + "parser_threads": 8 |
| 29 | + } |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +### Configuration Parameters |
| 34 | + |
| 35 | +- **`host_cache_size`**: Specifies the amount of pinned host memory (in gigabytes) reserved for asynchronous checkpoint flushing. Adjust this value based on your system's memory capacity and the size of your model checkpoints. |
| 36 | + |
| 37 | +- **`parser_threads`**: Determines the number of threads dedicated to parsing checkpoint file requests in parallel. Increasing this value can enhance parsing throughput but may also increase CPU utilization. |
| 38 | + |
| 39 | +## Implementing DataStates-LLM in Your Training Script |
| 40 | + |
| 41 | +After enabling datastates checkpointing the `deepspeed_config.json`, the frequency of checkpointing can be configured by specifying the number of iterations after which the checkpoints should be captured using command-line parameter ` --save-interval`. |
| 42 | + |
| 43 | +## Performance Results |
| 44 | + |
| 45 | +The checkpoint acceleration achieved by DataStates-LLM for various models are shown in |
| 46 | + |
| 47 | +{: .align-center} |
| 48 | + |
| 49 | +{: .align-center} |
| 50 | + |
| 51 | + |
| 52 | +## Limitations and Ongoing Work |
| 53 | + |
| 54 | +1. DataStates-LLM currently only supports the CUDA runtime on Nvidia-based GPUs. |
| 55 | + |
| 56 | + |
| 57 | +2. DataStates-LLM has only been tested with ZeRO stage-1 without offloading to any other tiers. |
| 58 | + |
| 59 | + |
| 60 | +3. While the checkpoint layout of datastates matches Huggingface's [safetensor](https://huggingface.co/docs/safetensors/) format, due to pickled objects required by DeepSpeed during restart, it is not fully compatible with safetensor library yet. |
| 61 | + |
| 62 | +4. DataStates-LLM does not yet support universal or elastic checkpointing. |
| 63 | + |
| 64 | + |
| 65 | +## Questions and Support |
| 66 | + |
| 67 | +Please use the [DataStates-LLM Github repository](https://github.com/DataStates/datastates-llm) for any questions, issues, or feature requests. |
0 commit comments