Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Commit 5b2de08

Browse files
Binh TangBinh Tang
andauthored
Add scripts to convert and load Metasq checkpoints with FasterTransformer (#671)
Co-authored-by: Binh Tang <[email protected]>
1 parent cb06c97 commit 5b2de08

File tree

5 files changed

+423
-0
lines changed

5 files changed

+423
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ The OPT models are now supported in the [Colossal-AI](https://github.com/hpcaite
2222

2323
The OPT 125M--66B models can be executed with [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models. The project integrates the [SmoothQuant](https://github.com/mit-han-lab/smoothquant) technique to allow 8-bit quantization of OPT models. See the [usage example](https://opennmt.net/CTranslate2/guides/transformers.html#opt) to get started.
2424

25+
### Using OPT with FasterTransformer
26+
27+
The OPT models can be served with [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), a highly optimized inference framework written and maintained by NVIDIA. We provide instructions to convert OPT checkpoints into FasterTransformer format and [a usage example](docs/faster-transformer.md) with some benchmark results.
28+
2529
## Getting Started in Metaseq
2630
Follow [setup instructions here](docs/setup.md) to get started.
2731

docs/faster-transformer.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
## FasterTransformer
2+
3+
As an alternative to the [API](api.md) provided by Metaseq, you can serve models locally using [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), an inference framework written and maintained by NVIDIA. The library includes more advanced inference optimizations and is compatible with models trained with Metaseq (e.g. OPT).
4+
5+
### Run FasterTransformer with Metaseq Checkpoints
6+
7+
We provide [a script](https://github.com/facebookresearch/metaseq/blob/main/metaseq/scripts/convert_metaseq_ft.py) to convert OPT checkpoints directly into FasterTransformer format. The script expects the input checkpoints to contain unflattened, FSDP-consolidated model weights (see the script [`reshard_fsdp.py`](https://github.com/facebookresearch/metaseq/blob/main/metaseq/scripts/reshard_fsdp.py) for related information) and maps each Metaseq model parallel part to its FasterTransformer counterpart.
8+
9+
We also include [an interactive script](https://github.com/facebookresearch/metaseq/blob/main/metaseq/cli/interactive_ft.py) that demonstrates how to run FasterTransformer with the converted checkpoints. Please see the detailed instructions below.
10+
11+
```bash
12+
# Clone metaseq and download metaseq checkpoints
13+
SRC_DIR="${HOME}/metaseq"
14+
git clone https://github.com/facebookresearch/metaseq.git "${SRC_DIR}"
15+
16+
CKPT_DIR="${HOME}/checkpoints"
17+
mkdir -p "${CKPT_DIR}/opt-125m"
18+
wget https://github.com/facebookresearch/metaseq/raw/main/projects/OPT/assets/gpt2-merges.txt -P "${CKPT_DIR}"
19+
wget https://github.com/facebookresearch/metaseq/raw/main/projects/OPT/assets/gpt2-vocab.json -P "${CKPT_DIR}"
20+
for i in {0..1}; do wget "https://dl.fbaipublicfiles.com/opt/v1_20220502/125m/reshard-model_part-${i}.pt" -P "${CKPT_DIR}/opt-125m"; done
21+
22+
# Install FasterTransformer
23+
nvidia-docker run -tid --rm --shm-size 5g --name ft \
24+
-w "${HOME}" -e SRC_DIR="${SRC_DIR}" -e CKPT_DIR="${CKPT_DIR}" \
25+
-v "${SRC_DIR}:${SRC_DIR}" -v "${CKPT_DIR}:${CKPT_DIR}" \
26+
nvcr.io/nvidia/pytorch:22.09-py3 bash
27+
nvidia-docker exec -ti ft bash
28+
git clone -b v5.3 https://github.com/NVIDIA/FasterTransformer.git
29+
mkdir -p FasterTransformer/build && cd FasterTransformer/build
30+
git submodule init && git submodule update
31+
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON BUILD_MIXED_GEMM=ON ..
32+
make -j"$(grep -c ^processor /proc/cpuinfo)"
33+
34+
# Convert metaseq checkpoints
35+
pip install fire tokenizers
36+
python "${SRC_DIR}/metaseq/scripts/convert_metaseq_ft.py" \
37+
--input "${CKPT_DIR}/opt-125m/reshard-no-os/reshard-model_part-*.pt" \
38+
--output "${CKPT_DIR}/opt-125m-ft-mp2/part-{i}.pt" --dtype fp16
39+
40+
# Run interactive script
41+
FT_PATH="lib/libth_transformer.so"
42+
mpirun -n 2 --allow-run-as-root python "${SRC_DIR}/metaseq/cli/interactive_ft.py" \
43+
--num-layers 12 --num-heads 12 --embed-size 768 --vocab-size 50272 \
44+
--vocab-file "${CKPT_DIR}/gpt2-vocab.json" --merges-file "${CKPT_DIR}/gpt2-merges.txt" \
45+
--weight-path "${CKPT_DIR}/opt-125m-ft-mp2" --dtype fp16 \
46+
--output-length 128 --top-k 20 --top-p 0.95 --temperature 0.7 --repetition-penalty 1.2
47+
```
48+
49+
### Benchmark Results with OPT Models
50+
51+
We benchmark FasterTransformer with OPT models using two common metrics, namely, latency (ms per token) and throughput (queries per second). The following plots show latency and throughput as we generate sequences of 256 tokens given prompts of 4 tokens each on [p4de.24xlarge](https://aws.amazon.com/ec2/instance-types/p4/) nodes (A100 80GB GPUs). The batch sizes are powers of two, ranging from 1 to 1024 (or the maximum allowed limits before OOM).
52+
53+
![](./images/opt-30b-175b.png)
54+
<p align="center">Throughput and latency of OPT-30B and OPT-175B when served with FasterTransformer using 2-way (MP2) or 8-way (MP8) model parallelism with or without per-channel weight-only INT8 quantization <a href="https://arxiv.org/abs/2109.12948">(Bondarenko et al., 2021)</a>.</p>

docs/images/opt-30b-175b.png

229 KB
Loading

metaseq/cli/interactive_ft.py

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
import argparse
2+
import os
3+
import torch
4+
import torch.distributed as dist
5+
import torch.nn as nn
6+
from tokenizers import Tokenizer, ByteLevelBPETokenizer
7+
from typing import Any, List, Optional
8+
9+
try:
10+
torch.classes.load_library(os.environ.get("FT_PATH"))
11+
except Exception:
12+
raise ImportError(
13+
"Please install FasterTransformer and provide a path to the binary"
14+
"`libth_transformer.so` via the environment variable `FT_PATH`."
15+
)
16+
17+
model = None
18+
tokenizer = None
19+
device = None
20+
21+
BOS_TOKEN = 0
22+
PAD_TOKEN = 1
23+
EOS_TOKEN = 2
24+
UNK_TOKEN = 3
25+
26+
27+
@torch.inference_mode()
28+
def generate(
29+
inputs: List[List[int]],
30+
output_length: int,
31+
beam_width: int = 1,
32+
top_k: Optional[int] = 0,
33+
top_p: Optional[float] = 1.0,
34+
diversity_rate: Optional[float] = None,
35+
temperature: Optional[float] = 1.0,
36+
len_penalty: Optional[float] = None,
37+
repetition_penalty: Optional[float] = 1.0,
38+
presence_penalty: Optional[float] = None,
39+
random_seed: Optional[int] = 0,
40+
min_length: Optional[int] = None,
41+
bad_words_list: Optional[torch.Tensor] = None,
42+
return_cum_log_probs: Optional[int] = 0,
43+
) -> List[Any]:
44+
inputs = [[EOS_TOKEN] + toks for toks in inputs]
45+
inputs = [torch.tensor(toks, dtype=torch.int32, device=device) for toks in inputs]
46+
lengths = torch.tensor([len(t) for t in inputs], dtype=torch.int32, device=device)
47+
inputs = nn.utils.rnn.pad_sequence(inputs, True, padding_value=PAD_TOKEN)
48+
49+
if top_k is not None:
50+
top_k = torch.tensor([top_k], dtype=torch.int32)
51+
if top_p is not None:
52+
top_p = torch.tensor([top_p], dtype=torch.float32)
53+
if diversity_rate is not None:
54+
diversity_rate = torch.tensor([diversity_rate], dtype=torch.float32)
55+
if temperature is not None:
56+
temperature = torch.tensor([temperature], dtype=torch.float32)
57+
if len_penalty is not None:
58+
len_penalty = torch.tensor([len_penalty], dtype=torch.float32)
59+
if repetition_penalty is not None:
60+
repetition_penalty = torch.tensor([repetition_penalty], dtype=torch.float32)
61+
if presence_penalty is not None:
62+
presence_penalty = torch.tensor([presence_penalty], dtype=torch.float32)
63+
if random_seed is not None:
64+
random_seed = torch.tensor([random_seed], dtype=torch.int64)
65+
if min_length is not None:
66+
min_length = torch.tensor([min_length], dtype=torch.int64)
67+
68+
outputs, output_lengths = model.forward(
69+
inputs,
70+
lengths,
71+
output_length,
72+
beam_width,
73+
top_k,
74+
top_p,
75+
diversity_rate,
76+
temperature,
77+
len_penalty,
78+
repetition_penalty,
79+
presence_penalty,
80+
min_length,
81+
random_seed,
82+
bad_words_list,
83+
return_cum_log_probs,
84+
)
85+
86+
results = []
87+
beam_idx = 0
88+
special = outputs.new_tensor([BOS_TOKEN, PAD_TOKEN, EOS_TOKEN, UNK_TOKEN])
89+
for output, output_len in zip(outputs, output_lengths):
90+
mask = ~torch.isin(output[beam_idx], special)
91+
mask[1:] = mask[1:].cummin(dim=0)[0]
92+
93+
tokens = output[beam_idx][1 : output_len[beam_idx]]
94+
tokens = tokens[mask[1 : output_len[beam_idx]]]
95+
results.append({"text": tokenizer.decode(tokens.tolist())})
96+
return [results]
97+
98+
99+
def main(args: argparse.Namespace) -> None:
100+
global model, tokenizer, device
101+
dist.init_process_group(backend="mpi")
102+
world_size = dist.get_world_size()
103+
rank = dist.get_rank() % world_size
104+
device = torch.device(f"cuda:{dist.get_rank() % torch.cuda.device_count()}")
105+
torch.cuda.set_device(device)
106+
107+
if args.tokenizer_file is not None:
108+
tokenizer = Tokenizer.from_file(args.tokenizer_file)
109+
else:
110+
tokenizer = ByteLevelBPETokenizer(args.vocab_file, args.merges_file)
111+
112+
torch_dtypes = {"fp16": torch.half, "bf16": torch.bfloat16, "fp32": torch.float}
113+
dtype = torch_dtypes[args.dtype]
114+
115+
state_dict = torch.load(f"{args.weight_path}/part-{rank}.pt")
116+
weights = [w.to(device, dtype) for w in state_dict["weights"]]
117+
int8_weights, int8_scales = [], []
118+
if args.int8_mode != 0 and {"int8_weights", "int8_scales"} <= state_dict.keys():
119+
int8_weights = [w.to(device=device) for w in state_dict["int8_weights"]]
120+
int8_scales = [w.to(device=device) for w in state_dict["int8_scales"]]
121+
122+
kwargs = {
123+
"head_num": args.num_heads,
124+
"size_per_head": args.embed_size // args.num_heads,
125+
"inter_size": 4 * args.embed_size,
126+
"layer_num": args.num_layers,
127+
"expert_num": 0,
128+
"moe_k": 0,
129+
"moe_layer_index": [],
130+
"vocab_size": args.vocab_size,
131+
"start_id": 2,
132+
"end_id": 2,
133+
"tensor_para_size": world_size,
134+
"pipeline_para_size": 1,
135+
"int8_mode": args.int8_mode,
136+
"layernorm_eps": 1e-5,
137+
"layernorm_type": "pre_layernorm",
138+
"activation_type": "Relu",
139+
"has_positional_encoding": True,
140+
"has_pre_decoder_layernorm": False,
141+
"has_post_decoder_layernorm": True,
142+
"has_adapters": False,
143+
"adapter_inter_size": 0,
144+
"use_attention_linear_bias": False,
145+
"weights": weights,
146+
"int8_weights": int8_weights,
147+
"scale": int8_scales,
148+
"shared_contexts_ratio": 1.0,
149+
}
150+
model = torch.classes.FasterTransformer.ParallelGptOp(*kwargs.values())
151+
152+
object = [None]
153+
while True:
154+
if torch.distributed.get_rank() == 0:
155+
prompt = input("\033[32mPrompt: \033[0;1m").rstrip()
156+
if not prompt:
157+
continue
158+
object = [[tokenizer.encode(prompt).ids]]
159+
160+
dist.broadcast_object_list(object, src=0)
161+
output = generate(
162+
object[0],
163+
output_length=args.output_length,
164+
beam_width=args.beam_width,
165+
top_k=args.top_k,
166+
top_p=args.top_p,
167+
diversity_rate=args.diversity_rate,
168+
temperature=args.temperature,
169+
len_penalty=args.len_penalty,
170+
repetition_penalty=args.repetition_penalty,
171+
random_seed=0,
172+
)
173+
if torch.distributed.get_rank() == 0:
174+
print(f"Output: {output[0][0]['text']}")
175+
176+
177+
def measure_time(func, *args, **kwargs):
178+
start = torch.cuda.Event(enable_timing=True)
179+
end = torch.cuda.Event(enable_timing=True)
180+
start.record()
181+
func(*args, **kwargs)
182+
end.record()
183+
torch.cuda.synchronize()
184+
return start.elapsed_time(end)
185+
186+
187+
def get_args() -> argparse.Namespace:
188+
parser = argparse.ArgumentParser()
189+
parser.add_argument("--num-layers", type=int, default=12)
190+
parser.add_argument("--num-heads", type=int, default=12)
191+
parser.add_argument("--embed-size", type=int, default=768)
192+
parser.add_argument("--vocab-size", type=int, default=50272)
193+
194+
parser.add_argument("--vocab-file", type=str)
195+
parser.add_argument("--merges-file", type=str)
196+
parser.add_argument("--tokenizer-file", type=str, default=None)
197+
parser.add_argument("--weight-path", type=str)
198+
parser.add_argument("--dtype", choices=["fp32", "fp16", "bf16"], default="fp16")
199+
parser.add_argument("--int8-mode", type=int, default=0)
200+
201+
parser.add_argument("--batch-size", type=int, default=1)
202+
parser.add_argument("--output-length", type=int, default=256)
203+
parser.add_argument("--beam-width", type=int, default=1)
204+
parser.add_argument("--top-k", type=int, default=20)
205+
parser.add_argument("--top-p", type=float, default=0.95)
206+
parser.add_argument("--temperature", type=float, default=0.7)
207+
parser.add_argument("--len-penalty", type=float, default=0.0)
208+
parser.add_argument("--diversity-rate", type=float, default=0.0)
209+
parser.add_argument("--repetition-penalty", type=float, default=1.2)
210+
return parser.parse_args()
211+
212+
213+
if __name__ == "__main__":
214+
args = get_args()
215+
main(args)

0 commit comments

Comments
 (0)