Skip to content

Latest commit

 

History

History
83 lines (53 loc) · 4.31 KB

File metadata and controls

83 lines (53 loc) · 4.31 KB

Moonlight

In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Moonlight model on Intel GPUs. For illustration purposes, we utilize moonshotai/Moonlight-16B-A3B-Instruct as reference Moonlight model.

0. Requirements & Installation

To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to here (Windows or Linux) for more information.

0.1 Installation

Visit Install IPEX-LLM on Intel GPU with PyTorch 2.6, and follow Install ipex-llm (Windows or Linux).

Then, install other dependencies for Moonlight model with IPEX-LLM optimizations:

conda activate llm-pt26

pip install transformers==4.45.0
pip install accelerate==0.33.0
pip install "trl<0.12.0" 

pip install tiktoken blobfile

0.2 Runtime Configuration

Visit Install IPEX-LLM on Intel GPU with PyTorch 2.6, and follow Runtime Configurations (Windows or Linux).

1. Download & Convert Model

To run the Moonlight model with IPEX-LLM optimizations, we need to download and convert it first to make sure it could be successfully loaded by transformers.

1.1 Download Model

To download moonshotai/Moonlight-16B-A3B-Instruct from Hugging Face, you could use download.py through:

download.py --repo-id moonshotai/Moonlight-16B-A3B-Instruct --commit-id 95583251e616c46a80715897a705cd38659afc27 

By default, Moonlight-16B-A3B-Instruct will be downloaded to the current folder. You could also define the download folder path by --download-dir-path DOWNLOAD_DIR_PATH.

Tip

Refer to here for althernative methods to download models from Hugging Face.

For moonshotai/Moonlight-16B-A3B-Instruct, please make sure to use its revision/commit id 95583251e616c46a80715897a705cd38659afc27.

1.2 Convert Model

Next, convert the downloaded model by convert.py:

convert.py --model-path DOWNLOAD_DIR_PATH

The converted model will be saved at <DOWNLOAD_DIR_PATH>-converted.

2. Example: Predict Tokens using generate() API

In the example generate.py, we show a basic use case for a Moonlight model to predict the next N tokens using generate() API, with IPEX-LLM INT4 optimizations on Intel GPUs.

2.1 Running example

python generate.py --converted-model-path `<DOWNLOAD_DIR_PATH>-converted` --prompt PROMPT --n-predict N_PREDICT

Arguments info:

  • --converted-model-path CONVERTED_MODEL_PATH: argument defining the converted model path by convert.py
  • --prompt PROMPT: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be 'What is AI?'.
  • --n-predict N_PREDICT: argument defining the max number of tokens to predict. It is default to be 32.

2.2 Sample Outputs

Inference time: xxxx s
-------------------- Prompt --------------------
Is 123 a prime?
-------------------- Output --------------------
<|im_system|>system<|im_middle|>You are a helpful assistant provided by Moonshot-AI.<|im_end|><|im_user|>user<|im_middle|>Is 123 a prime?<|im_end|><|im_assistant|>assistant<|im_middle|>No, 123 is not a prime number. A prime number is a number greater than 1 that has no positive divisors other than 1 and itself