RFC: Reducing Download Traffic and Latency with ZipNN Lossless Compression for AI Models

### Feature request

This RFC proposes integrating a lossless compression method called ZipNN into Hugging Face Transformers to reduce latency and traffic for downloading models. ZipNN is specifically designed for AI models, offering a model size reduction of 17% to over 50%, depending on the model format and compressibility. Additionally, it significantly reduces time for the user due to its fast decompression speed, allowing compressed models to be ready for use almost immediately without impacting model accuracy.

### Motivation

From a [LinkedIn post by Julien Chaumond](https://www.linkedin.com/posts/julienchaumond_i-am-super-excited-to-announce-that-weve-activity-7227305609254113280-zx3H), August 2024, Hugging Face holds 1.3M models, with a cumulative storage space of 12PB. They also serve 1 billion daily requests, amounting to a network bandwidth of around 6 PetaBytes per day!

Downloading large models from Hugging Face can be time-consuming; for example, downloading a model like Llama-3.1-405B can take nearly a day on a 10 MB/s home connection or nearly 2 hours on a 125 MB/s high-bandwidth connection. ZipNN could reduce this time by up to 33%.


## Model Comparison Table

We took the 20 most downloaded models in Hugging Face from late OCT 2024:
(Based on 1GB from the middle of the model).

- **17% Savings: 9 models**
- **33% Savings: 3 models**
- **50% or Greater Savings: 8 models**

| Model Name | Format | Size | ZipNN Compression Remaining (%) |
|:-----------|:-------|:-----|:--------------------------|
| BAAI/bge-base-en-v1.5 | FP32 | 0.4GB | 42.2% |
| sentence-transformers/all-mpnet-base-v2 | FP32 | 0.4GB  | 83% |
| nesaorg/benchmark_v0 | FP32 | 1.35GB | 82.38% |
| google-bert/bert-base-uncased | FP32 | 0.4GB | 83.17% |
| sentence-transformers/all-MiniLM-L6-v2 | FP32 | 0.09GB | 82.07% |
| Qwen/Qwen2.5-1.5B-Instruct | BF16 | 3GB | 66.86% |
| openai/whisper-large-v2 | FP32 | 6.1GB | 42.8% |
| FacebookAI/xlm-roberta-large | FP32 | 2.2GB | 42.9% |
| 1231czx/llama3_it_ultra_list_and_bold500 | BF16 | 16GB | 66.77% |
| openai/clip-vit-base-patch32 | FP32 | 0.6GB | 43% |
| jonatasgrosman/wav2vec2-large-xlsr-53-english | FP32 | 1.26GB | 82.96% |
| openai/clip-vit-base-patch16 | FP32 | 0.6GB | 51.24% |
| google/vit-base-patch16-224-in21k | FP32 | 0.4GB | 84% |
| FacebookAI/roberta-base | FP32 | 0.5GB | 43.9% |
| nesaorg/fc_8 | FP32 | 0.13GB | 82.52% |
| nesaorg/fc_6 | FP32 | 0.1GB | 82.2% |
| BAAI/bge-small-en-v1.5 | FP32 | 0.13GB | 42.9% |
| openai/clip-vit-large-patch14 | FP32 | 1.71GB | 42.97% |
| timm/resnet50.a1_in1k | FP32 | 0.1GB | 83.51% |
| meta-llama/Llama-3.1-405B | BF16 | 812GB | 66% |

### Your contribution

# ZipNN

ZipNN (The NN stands for Neural Network) is a lossless compression library tailored to neural networks. ZipNN compresses models by targeting the skewed distribution of exponent bits in floating-point parameters, which is highly compressible. By isolating exponents and applying Entropy Encoding with Huffman codes, ZipNN achieves efficient compression without the overhead of multi-byte repetition algorithms like Lempel-Ziv. It further optimizes speed by skipping non-compressible segments and adapting strategies based on the model’s characteristics.

[ZipNN Repository Link](https://github.com/zipnn/zipnn)
[ZipNN arXiv Paper: ZIPNN: LOSSLESS COMPRESSION FOR AI MODELS](https://arxiv.org/abs/2411.05239)

## Comparing Speed and Compression ratio of different compression methods:
(Based on 1GB from the middle of the model).

| Model Name | Format | Compression Method | Compression Remaining (%) | Compression Speed (GB/Sec) | Decompression Speed (GB/Sec) |
|:-----------|:-------|:------------------|:-------------------|:--------------------------|:----------------------------|
| meta-llama/Llama-3.1-8B-Instruct  | BF16 | Zstd | 77.7% | 0.71 | 1.02 |
| meta-llama/Llama-3.1-8B-Instruct | BF16 | ZipNN | 66.4% | 1.15 | 1.65 |
| allenai/OLMo-1B-0724-hf | FP32 | Zstd | 92.3% | 0.97 | 1.02 |
| allenai/OLMo-1B-0724-hf | FP32 | ZipNN | 83.2% | 1.64 | 2.48 |
| FacebookAI/xlm-roberta-large | FP32 | Zstd | 57.4% | 0.18 | 0.77 |
| FacebookAI/xlm-roberta-large | FP32 | ZipNN | 42.9% | 0.83 | 1.41 |


## User benefits
Figure 10 in the arXiv paper shows the download and upload timing for three models, comparing the original and compressed versions, including decompression and compression times. Network speed is the primary factor affecting download and upload durations, and even for models that are less compressible, users benefit from reduced total latency when decompression and compression are included.

[Link to Figure 10 from the arXiv paper](https://github.com/zipnn/zipnn/blob/main/images/hf_download_upload_2.pdf)

## Usage

### Installation
To get started, you can install the library directly from PyPI:
```bash
pip install zipnn
```

### API Usage
You can call ZipNN directly from the API:
```python
import zipnn
zpn = zipnn.ZipNN()
compressed_buffer = zpn.compress(original_buffer)
decompressed_buffer = zpn.decompress(compressed_buffer)
```

### Command-Line Scripts
You can also use the provided wrapper [scripts](https://github.com/zipnn/zipnn/tree/main/scripts).
Note: **All ZipNN compressed files use the ".znn" extension**.

Single file compression/decompression:
```bash
python zipnn_compress_file.py model_name
python zipnn_decompress_file.py compressed_model_name.znn
```

## Hugging Face Plugin and compressed Models stored on Hugging Face

### Plugin Usage
ZipNN has a plugin for the Hugging Face transformers library that can handle ZipNN-compressed Models. 

The user can save the compressed model to his local storage using the default plugin. When loading, the model includes a fast decompression phase on the CPU while remaining compressed in its storage.

**What this means:** Each time the user loads the model, less data is transferred to the GPU cluster, with decompression happening on the CPU.

```python
from zipnn import zipnn_hf
zipnn_hf()
```

**Alternatively, avoiding future decompression**: the user can save the model uncompressed on his local storage. This way, future loads won’t require a decompression phase 
```python
from zipnn import zipnn_hf
zipnn_hf(replace_local_file=True)
```

To compress and decompress manually, simply run: [Link to scripts](https://github.com/zipnn/zipnn/tree/main/scripts)

```bash
python zipnn_compress_path.py safetensors --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache
```

```bash
python zipnn_decompress_path.py --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache
```

There are a few models compressed by ZipNN hosted on Hugging Face:
Example: 
[ compressed FacebookAI/roberta-base ]( https://huggingface.co/royleibov/roberta-base-ZipNN-Compressed )
[ compressed meta-llama/Llama-3.2-11B-Vision-Instruct ]( https://huggingface.co/royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed )

And a usage example:
[Usage Example Llama-3.2-11B](https://github.com/zipnn/zipnn/blob/main/examples/huggingface_llama_3.2_example.py)

### Upload compressed models to Hugging Face:

1. Compress all the model weights
Download the scripts for compressing/decompressing AI Models:

```bash
wget -i https://raw.githubusercontent.com/zipnn/zipnn/main/scripts/scripts.txt &&
rm scripts.txt
```

```bash
python3 zipnn_compress_path.py safetensors --path .
```

2. Add the compressed weights to git-lfs tracking and correct the index json
```
git lfs track "*.znn" &&
sed -i 's/.safetensors/.safetensors.znn/g' model.safetensors.index.json &&
git add *.znn .gitattributes model.safetensors.index.json &&
git rm *.safetensors
```

3. Done! Now push the changes as per [the documentation](https://huggingface.co/docs/hub/repositories-getting-started#set-up):
```bash
git lfs install --force --local && # this reinstalls the LFS hooks
huggingface-cli lfs-enable-largefiles . && # needed if some files are bigger than 5GB
git push --force origin main
```


## Current status

The code is ready for use with single-threaded compression and decompression on the CPU, and ZipNN already has a few users. The next version will support multi-threading on the CPU, with a future milestone targeting GPU implementation.

# Proposed change:

Decompress any shard of a model that was previously compressed with ZipNN. [This commit](https://github.com/huggingface/transformers/commit/607982fea2ff9cb381d1038adc6bd22c1fe58267) only extends the functionality of load_state_dict(), making sure to load the model and decompress it as efficiently as possible by decompressing in chunks and by avoiding unnecessary I/O requests.

In modeling_utils.load_state_dict():
```python
    checkpoint_bytes = b""
    if checkpoint_file.endswith(".znn"):
        output_file = checkpoint_file.replace(".znn", "")
        if not os.path.exists(output_file):
            try:
                from zipnn import ZipNN
            except ImportError:
                raise ImportError("To load a zipped checkpoint file, you need to install zipnn.")
            znn = ZipNN(is_streaming=True)
            with open(checkpoint_file, "rb") as infile:
                chunk = infile.read()
                checkpoint_bytes += znn.decompress(chunk)
        else:
            with open(output_file, "rb") as infile:
                checkpoint_bytes += infile.read()
```

**This is a proof of concept**, currently only supporting sharded models whose index.json been modified to .znn suffixes (as seen in this [ZipNN compressed Llama 3.2 example on Hugging Face](https://huggingface.co/royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/blob/main/model.safetensors.index.json)), safetensors or any other file. Support for all single files can be readily added by adding individual checks in modeling_utils.PreTrainedModel.from_pretrained() or by changing utils.hub.cached_file() to check for .znn filepath.

A working version of all edge cases can be found in ZipNN's [zipnn_hf() plugin](https://github.com/zipnn/zipnn/blob/ffa5b9f6d2a55fb2b2fd460995fc81e1283d0954/zipnn/zipnn.py#L1081).

**Additionally, to allow for users to only decompress once**, the plugin has a flag `zipnn_hf(replace_local_file=True)` that locally saves the decompressed model in the cache, reorders the symlinks, and fixes accordingly any index.json if there is one. This functionality can be done equivalently by adding a flag in from_pretrained().


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Reducing Download Traffic and Latency with ZipNN Lossless Compression for AI Models #34737

Feature request

Motivation

Model Comparison Table

Your contribution

ZipNN

Comparing Speed and Compression ratio of different compression methods:

User benefits

Usage

Installation

API Usage

Command-Line Scripts

Hugging Face Plugin and compressed Models stored on Hugging Face

Plugin Usage

Upload compressed models to Hugging Face:

Current status

Proposed change:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model Name	Format	Size	ZipNN Compression Remaining (%)
BAAI/bge-base-en-v1.5	FP32	0.4GB	42.2%
sentence-transformers/all-mpnet-base-v2	FP32	0.4GB	83%
nesaorg/benchmark_v0	FP32	1.35GB	82.38%
google-bert/bert-base-uncased	FP32	0.4GB	83.17%
sentence-transformers/all-MiniLM-L6-v2	FP32	0.09GB	82.07%
Qwen/Qwen2.5-1.5B-Instruct	BF16	3GB	66.86%
openai/whisper-large-v2	FP32	6.1GB	42.8%
FacebookAI/xlm-roberta-large	FP32	2.2GB	42.9%
1231czx/llama3_it_ultra_list_and_bold500	BF16	16GB	66.77%
openai/clip-vit-base-patch32	FP32	0.6GB	43%
jonatasgrosman/wav2vec2-large-xlsr-53-english	FP32	1.26GB	82.96%
openai/clip-vit-base-patch16	FP32	0.6GB	51.24%
google/vit-base-patch16-224-in21k	FP32	0.4GB	84%
FacebookAI/roberta-base	FP32	0.5GB	43.9%
nesaorg/fc_8	FP32	0.13GB	82.52%
nesaorg/fc_6	FP32	0.1GB	82.2%
BAAI/bge-small-en-v1.5	FP32	0.13GB	42.9%
openai/clip-vit-large-patch14	FP32	1.71GB	42.97%
timm/resnet50.a1_in1k	FP32	0.1GB	83.51%
meta-llama/Llama-3.1-405B	BF16	812GB	66%

Model Name	Format	Compression Method	Compression Remaining (%)	Compression Speed (GB/Sec)	Decompression Speed (GB/Sec)
meta-llama/Llama-3.1-8B-Instruct	BF16	Zstd	77.7%	0.71	1.02
meta-llama/Llama-3.1-8B-Instruct	BF16	ZipNN	66.4%	1.15	1.65
allenai/OLMo-1B-0724-hf	FP32	Zstd	92.3%	0.97	1.02
allenai/OLMo-1B-0724-hf	FP32	ZipNN	83.2%	1.64	2.48
FacebookAI/xlm-roberta-large	FP32	Zstd	57.4%	0.18	0.77
FacebookAI/xlm-roberta-large	FP32	ZipNN	42.9%	0.83	1.41

RFC: Reducing Download Traffic and Latency with ZipNN Lossless Compression for AI Models #34737

Description

Feature request

Motivation

Model Comparison Table

Your contribution

ZipNN

Comparing Speed and Compression ratio of different compression methods:

User benefits

Usage

Installation

API Usage

Command-Line Scripts

Hugging Face Plugin and compressed Models stored on Hugging Face

Plugin Usage

Upload compressed models to Hugging Face:

Current status

Proposed change:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions