Skip to content

Commit 0e0786a

Browse files
authored
update llama.cpp related quickstart with rebased llama.cpp (#12996)
* update doc with reabsed llama.cpp * revert table of contents * update demo output log
1 parent 7a86dd0 commit 0e0786a

4 files changed

+314
-266
lines changed

docs/mddocs/Quickstart/llama_cpp_quickstart.md

+116-102
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@
1212
> For installation on Intel Arc B-Series GPU (such as **B580**), please refer to this [guide](./bmg_quickstart.md).
1313
1414
> [!NOTE]
15-
> Our latest version is consistent with [3f1ae2e](https://github.com/ggerganov/llama.cpp/commit/3f1ae2e32cde00c39b96be6d01c2997c29bae555) of llama.cpp.
15+
> Our latest version is consistent with [d7cfe1f](https://github.com/ggml-org/llama.cpp/commit/d7cfe1ffe0f435d0048a6058d529daf76e072d9c) of llama.cpp.
1616
>
17-
> `ipex-llm[cpp]==2.2.0b20241204` is consistent with [a1631e5](https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd) of llama.cpp.
17+
> `ipex-llm[cpp]==2.2.0b20250320` is consistent with [ba1cb19](https://github.com/ggml-org/llama.cpp/commit/ba1cb19cdd0d92e012e0f6e009e0620f854b6afd) of llama.cpp.
1818
1919
See the demo of running LLaMA2-7B on Intel Arc GPU below.
2020

@@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
158158
- For **Linux users**:
159159

160160
```bash
161-
./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
161+
./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
162162
```
163163

164164
> **Note**:
@@ -170,7 +170,7 @@ Before running, you should download or copy community GGUF model to your current
170170
Please run the following command in Miniforge Prompt.
171171

172172
```cmd
173-
llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
173+
llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
174174
```
175175

176176
> **Note**:
@@ -179,11 +179,10 @@ Before running, you should download or copy community GGUF model to your current
179179
180180
#### Sample Output
181181
```
182-
Log start
183-
main: build = 1 (6f4ec98)
184-
main: built with MSVC 19.39.33519.0 for
185-
main: seed = 1724921424
186-
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\gguf-models\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
182+
main: llama backend init
183+
main: load the model and apply lora adapter, if any
184+
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
185+
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/ruonan/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
187186
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
188187
llama_model_loader: - kv 0: general.architecture str = llama
189188
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1
@@ -208,108 +207,123 @@ llama_model_loader: - kv 19: general.quantization_version u32
208207
llama_model_loader: - type f32: 65 tensors
209208
llama_model_loader: - type q4_K: 193 tensors
210209
llama_model_loader: - type q6_K: 33 tensors
211-
llm_load_vocab: special tokens cache size = 3
212-
llm_load_vocab: token to piece cache size = 0.1637 MB
213-
llm_load_print_meta: format = GGUF V2
214-
llm_load_print_meta: arch = llama
215-
llm_load_print_meta: vocab type = SPM
216-
llm_load_print_meta: n_vocab = 32000
217-
llm_load_print_meta: n_merges = 0
218-
llm_load_print_meta: vocab_only = 0
219-
llm_load_print_meta: n_ctx_train = 32768
220-
llm_load_print_meta: n_embd = 4096
221-
llm_load_print_meta: n_layer = 32
222-
llm_load_print_meta: n_head = 32
223-
llm_load_print_meta: n_head_kv = 8
224-
llm_load_print_meta: n_rot = 128
225-
llm_load_print_meta: n_swa = 0
226-
llm_load_print_meta: n_embd_head_k = 128
227-
llm_load_print_meta: n_embd_head_v = 128
228-
llm_load_print_meta: n_gqa = 4
229-
llm_load_print_meta: n_embd_k_gqa = 1024
230-
llm_load_print_meta: n_embd_v_gqa = 1024
231-
llm_load_print_meta: f_norm_eps = 0.0e+00
232-
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
233-
llm_load_print_meta: f_clamp_kqv = 0.0e+00
234-
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
235-
llm_load_print_meta: f_logit_scale = 0.0e+00
236-
llm_load_print_meta: n_ff = 14336
237-
llm_load_print_meta: n_expert = 0
238-
llm_load_print_meta: n_expert_used = 0
239-
llm_load_print_meta: causal attn = 1
240-
llm_load_print_meta: pooling type = 0
241-
llm_load_print_meta: rope type = 0
242-
llm_load_print_meta: rope scaling = linear
243-
llm_load_print_meta: freq_base_train = 10000.0
244-
llm_load_print_meta: freq_scale_train = 1
245-
llm_load_print_meta: n_ctx_orig_yarn = 32768
246-
llm_load_print_meta: rope_finetuned = unknown
247-
llm_load_print_meta: ssm_d_conv = 0
248-
llm_load_print_meta: ssm_d_inner = 0
249-
llm_load_print_meta: ssm_d_state = 0
250-
llm_load_print_meta: ssm_dt_rank = 0
251-
llm_load_print_meta: ssm_dt_b_c_rms = 0
252-
llm_load_print_meta: model type = 7B
253-
llm_load_print_meta: model ftype = Q4_K - Medium
254-
llm_load_print_meta: model params = 7.24 B
255-
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
256-
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1
257-
llm_load_print_meta: BOS token = 1 '<s>'
258-
llm_load_print_meta: EOS token = 2 '</s>'
259-
llm_load_print_meta: UNK token = 0 '<unk>'
260-
llm_load_print_meta: LF token = 13 '<0x0A>'
261-
llm_load_print_meta: max token length = 48
262-
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
263-
ggml_sycl_init: SYCL_USE_XMX: yes
264-
ggml_sycl_init: found 1 SYCL devices:
265-
llm_load_tensors: ggml ctx size = 0.27 MiB
266-
llm_load_tensors: offloading 32 repeating layers to GPU
267-
llm_load_tensors: offloading non-repeating layers to GPU
268-
llm_load_tensors: offloaded 33/33 layers to GPU
269-
llm_load_tensors: SYCL0 buffer size = 4095.05 MiB
270-
llm_load_tensors: CPU buffer size = 70.31 MiB
271-
..............................................................................................
272-
llama_new_context_with_model: n_ctx = 512
273-
llama_new_context_with_model: n_batch = 512
274-
llama_new_context_with_model: n_ubatch = 512
275-
llama_new_context_with_model: flash_attn = 0
276-
llama_new_context_with_model: freq_base = 10000.0
277-
llama_new_context_with_model: freq_scale = 1
278-
[SYCL] call ggml_check_sycl
279-
ggml_check_sycl: GGML_SYCL_DEBUG: 0
280-
ggml_check_sycl: GGML_SYCL_F16: no
281-
found 1 SYCL devices:
210+
print_info: file format = GGUF V2
211+
print_info: file type = Q4_K - Medium
212+
print_info: file size = 4.07 GiB (4.83 BPW)
213+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
214+
load: special tokens cache size = 3
215+
load: token to piece cache size = 0.1637 MB
216+
print_info: arch = llama
217+
print_info: vocab_only = 0
218+
print_info: n_ctx_train = 32768
219+
print_info: n_embd = 4096
220+
print_info: n_layer = 32
221+
print_info: n_head = 32
222+
print_info: n_head_kv = 8
223+
print_info: n_rot = 128
224+
print_info: n_swa = 0
225+
print_info: n_embd_head_k = 128
226+
print_info: n_embd_head_v = 128
227+
print_info: n_gqa = 4
228+
print_info: n_embd_k_gqa = 1024
229+
print_info: n_embd_v_gqa = 1024
230+
print_info: f_norm_eps = 0.0e+00
231+
print_info: f_norm_rms_eps = 1.0e-05
232+
print_info: f_clamp_kqv = 0.0e+00
233+
print_info: f_max_alibi_bias = 0.0e+00
234+
print_info: f_logit_scale = 0.0e+00
235+
print_info: n_ff = 14336
236+
print_info: n_expert = 0
237+
print_info: n_expert_used = 0
238+
print_info: causal attn = 1
239+
print_info: pooling type = 0
240+
print_info: rope type = 0
241+
print_info: rope scaling = linear
242+
print_info: freq_base_train = 10000.0
243+
print_info: freq_scale_train = 1
244+
print_info: n_ctx_orig_yarn = 32768
245+
print_info: rope_finetuned = unknown
246+
print_info: ssm_d_conv = 0
247+
print_info: ssm_d_inner = 0
248+
print_info: ssm_d_state = 0
249+
print_info: ssm_dt_rank = 0
250+
print_info: ssm_dt_b_c_rms = 0
251+
print_info: model type = 7B
252+
print_info: model params = 7.24 B
253+
print_info: general.name = mistralai_mistral-7b-instruct-v0.1
254+
print_info: vocab type = SPM
255+
print_info: n_vocab = 32000
256+
print_info: n_merges = 0
257+
print_info: BOS token = 1 '<s>'
258+
print_info: EOS token = 2 '</s>'
259+
print_info: UNK token = 0 '<unk>'
260+
print_info: LF token = 13 '<0x0A>'
261+
print_info: EOG token = 2 '</s>'
262+
print_info: max token length = 48
263+
load_tensors: loading model tensors, this can take a while... (mmap = true)
264+
load_tensors: offloading 32 repeating layers to GPU
265+
load_tensors: offloading output layer to GPU
266+
load_tensors: offloaded 33/33 layers to GPU
267+
load_tensors: CPU_Mapped model buffer size = 70.31 MiB
268+
load_tensors: SYCL0 model buffer size = 4095.05 MiB
269+
.................................................................................................
270+
llama_init_from_model: n_seq_max = 1
271+
llama_init_from_model: n_ctx = 1024
272+
llama_init_from_model: n_ctx_per_seq = 1024
273+
llama_init_from_model: n_batch = 1024
274+
llama_init_from_model: n_ubatch = 1024
275+
llama_init_from_model: flash_attn = 0
276+
llama_init_from_model: freq_base = 10000.0
277+
llama_init_from_model: freq_scale = 1
278+
llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
279+
Running with Environment Variables:
280+
GGML_SYCL_DEBUG: 0
281+
GGML_SYCL_DISABLE_OPT: 1
282+
Build with Macros:
283+
GGML_SYCL_FORCE_MMQ: no
284+
GGML_SYCL_F16: no
285+
Found 1 SYCL devices:
282286
| | | | |Max | |Max |Global | |
283287
| | | | |compute|Max work|sub |mem | |
284288
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
285289
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
286-
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 112| 1024| 32| 13578M| 1.3.27504|
287-
llama_kv_cache_init: SYCL0 KV buffer size = 64.00 MiB
288-
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
289-
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB
290-
llama_new_context_with_model: SYCL0 compute buffer size = 81.00 MiB
291-
llama_new_context_with_model: SYCL_Host compute buffer size = 9.01 MiB
292-
llama_new_context_with_model: graph nodes = 902
293-
llama_new_context_with_model: graph splits = 2
294-
295-
system_info: n_threads = 8 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
296-
sampling:
290+
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
291+
SYCL Optimization Feature:
292+
|ID| Device Type|Reorder|
293+
|--|-------------------|-------|
294+
| 0| [level_zero:gpu:0]| Y|
295+
llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
296+
llama_kv_cache_init: SYCL0 KV buffer size = 128.00 MiB
297+
llama_init_from_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
298+
llama_init_from_model: SYCL_Host output buffer size = 0.12 MiB
299+
llama_init_from_model: SYCL0 compute buffer size = 164.01 MiB
300+
llama_init_from_model: SYCL_Host compute buffer size = 20.01 MiB
301+
llama_init_from_model: graph nodes = 902
302+
llama_init_from_model: graph splits = 2
303+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
304+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
305+
main: llama threadpool init, n_threads = 8
306+
307+
system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
308+
309+
sampler seed: 403565315
310+
sampler params:
297311
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
298-
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
312+
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
313+
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
299314
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
300-
sampling order:
301-
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
302-
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 1
315+
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
316+
generate: n_ctx = 1024, n_batch = 4096, n_predict = 32, n_keep = 1
303317
318+
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. But sometimes, she found it hard to find friends who shared her interests.
304319
305-
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. She lived in a small village where there weren't many opportunities for adventures, but that didn't stop her. She would often read
306-
llama_print_timings: load time = xxxx ms
307-
llama_print_timings: sample time = x.xx ms / 32 runs ( xx.xx ms per token, xx.xx tokens per second)
308-
llama_print_timings: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
309-
llama_print_timings: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
310-
llama_print_timings: total time = xx.xx ms / 62 tokens
311-
Log end
320+
One day, she decided to take matters into her own
312321
322+
llama_perf_sampler_print: sampling time = x.xx ms / 63 runs ( x.xx ms per token, xx.xx tokens per second)
323+
llama_perf_context_print: load time = xx.xx ms
324+
llama_perf_context_print: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
325+
llama_perf_context_print: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
326+
llama_perf_context_print: total time = xx.xx ms / 62 tokens
313327
```
314328

315329
### Troubleshooting

0 commit comments

Comments
 (0)