You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| input tokens or prompt | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 32768, 130944]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
113
-
| input images | default: None, use "--image-url" to choose the image link address for vision-text tasks |
114
-
| vision text tasks | default: False, use "--vision-text-model" to choose if your model (like llama3.2 11B model) is running for vision-text generation tasks, default False meaning text generation tasks only|
115
113
| output tokens | default: 32, use "--max-new-tokens" to choose any other size |
116
114
| batch size | default: 1, use "--batch-size" to choose any other size |
117
115
| token latency | enable "--token-latency" to print out the first or next token latency |
118
116
| generation iterations | use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
119
117
| streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output |
120
118
| KV Cache dtype | default: auto, use "--kv-cache-dtype=fp8_e5m2" to enable e5m2 KV Cache. More information refer to [vLLM FP8 E5M2 KV Cache](https://docs.vllm.ai/en/v0.6.6/quantization/fp8_e5m2_kvcache.html)|
121
119
| input mode | default: 0, use "--input-mode" to choose input mode for multimodal models. 0: language; 1: vision; 2: speech; 3: vision and speech |
122
-
| input audios | default: None, use "--audio" to choose the audio link address for speech tasks |
120
+
| input images | default: None, use "--image-url" to choose the image file address for vision-text tasks |
121
+
| input audios | default: None, use "--audio" to choose the audio file address for speech tasks |
123
122
124
123
*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login).
125
124
@@ -522,19 +521,118 @@ There are some model-specific requirements to be aware of, as follows:
522
521
523
522
- For Llava models from remote hub, additional setup is required, i.e., `bash ./tools/prepare_llava.sh`.
524
523
525
-
## 2.3 Instructions for Running LLM with Intel® Xeon® CPU Max Series
524
+
## 2.3 Instructions for Running Multimodal LLMs
525
+
526
+
Multimodal LLMs are large language models capable of processing multiple types of inputs,
527
+
like images and audios, in addition to text prompts.
528
+
We have optimized the performance of some popular multimodal LLMs like `microsoft/Phi-4-multimodal-instruct`
529
+
and `meta-llama/Llama-3.2-11B-Vision-Instruct`, which can be showcased with the provided `run.py` script.
530
+
In the commands, the additional arguments need to be specified are highlighted here:
531
+
532
+
| Special args for multimodal | Notes |
533
+
|---|---|
534
+
| input mode | Use "--input-mode" to choose input mode for multimodal models. 0: language; 1: vision; 2: speech; 3: vision and speech |
535
+
| input image | Use "--image-url" to specify the image link address or local path for vision-text tasks |
536
+
| input audio | Use "--audio" to specify the audio file local path for speech tasks |
537
+
538
+
Meanwhile, for multimodal tasks we need to set the text prompt and bind it with the input image/audio.
539
+
The binding is realized with the special tokens, like the image tag `<|image|>` and the audio tag `<|audio|>`.
540
+
We provide the following example commands to showcase the argument settings in detail.
541
+
542
+
### 2.3.1 Phi-4-multimodal-instruct
543
+
544
+
You can download the sample image and audio to your local folder beforehand.
545
+
Also, `peft` package is required for running the model.
We provide example commands running in BF16 precision for all the input modes.
556
+
The OMP thread num and `numactl` setup parts are omitted.
557
+
558
+
Example command for pure text input
559
+
560
+
```bash
561
+
python run.py --input-mode 0 --benchmark -m microsoft/Phi-4-multimodal-instruct --ipex --token-latency --greedy --dtype bfloat16 --max-new-tokens 128 --prompt "<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>"
562
+
```
563
+
564
+
Example command for image comprehension
565
+
566
+
```bash
567
+
python run.py --input-mode 1 --benchmark -m microsoft/Phi-4-multimodal-instruct --ipex --token-latency --greedy --dtype bfloat16 --max-new-tokens 128 --prompt "<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>" --image-url australia.jpg
568
+
```
569
+
570
+
Example command for speech comprehension
571
+
572
+
```bash
573
+
python run.py --input-mode 2 --benchmark -m microsoft/Phi-4-multimodal-instruct --ipex --token-latency --greedy --dtype bfloat16 --max-new-tokens 128 --prompt "<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>" --audio Trailer.wav
574
+
```
575
+
576
+
Example command for image and speech comprehension
deepspeed --bind_cores_to_rank run.py --input-mode 1 --benchmark -m meta-llama/Llama-3.2-11B-Vision-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --prompt "<|image|>Describe the contents of this image." --image-url australia.jpg --autotp --shard-model
621
+
```
622
+
623
+
## 2.4 Instructions for Running LLM with Intel® Xeon® CPU Max Series
526
624
527
625
Intel® Xeon® CPU Max Series are equipped with high bandwidth memory (HBM), which further accelerates LLM inference. For the common case that HBM and DDR are both installed in a Xeon® CPU Max Series server, the memory mode can be configured to Flat Mode or Cache Mode.
528
626
Details about memory modes can be found at Section 3.1 in [the Xeon® CPU Max Series Configuration Guide](https://cdrdv2-public.intel.com/769060/354227-intel-xeon-cpu-max-series-configuration-and-tuning-guide.pdf).
529
627
530
-
### 2.3.1 Single Instance Inference with Xeon® CPU Max Series
628
+
### 2.4.1 Single Instance Inference with Xeon® CPU Max Series
531
629
532
-
#### 2.3.1.1 Cache Mode HBM
630
+
#### 2.4.1.1 Cache Mode HBM
533
631
534
632
In cache mode, only DDR address space is visible to software and HBM functions as a transparent memory-side cache for DDR.
535
633
Therefore the usage is the same with [the common usage](#221-run-generation-with-one-instance).
536
634
537
-
#### 2.3.1.2 Flat Mode HBM
635
+
#### 2.4.1.2 Flat Mode HBM
538
636
539
637
In flat mode, HBM and DDR are exposed to software as separate address spaces.
540
638
Therefore we need to check the `HBM_NODE_INDEX` of interest with commands like `lscpu`, then the LLM inference invoking command would be like:
0 commit comments