-
Notifications
You must be signed in to change notification settings - Fork 11
feat(containers): experimentation with hugging face models #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
67e0065
273615d
ff81008
1c3fce3
7c8a742
975f389
162a433
3640c3d
1a01f21
6990297
72d4289
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
FROM python:3.12-slim-bookworm | ||
|
||
ARG MODEL_DOWNLOAD_SOURCE | ||
|
||
RUN apt-get update && apt-get install -y wget | ||
|
||
WORKDIR /app | ||
|
||
RUN pip install --upgrade pip | ||
|
||
COPY requirements.txt . | ||
|
||
RUN pip install -r requirements.txt | ||
|
||
RUN pip install llama-cpp-python==0.2.62 \ | ||
--no-cache-dir \ | ||
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu | ||
|
||
RUN wget $MODEL_DOWNLOAD_SOURCE | ||
|
||
COPY main.py . | ||
|
||
CMD ["uvicorn", "main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "80"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Hugging Face Models | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(docs): this README has a lot of good technical detail, but no high-level explanation of what the example does. We need to explain what the example does, what SCW resources it uses, and link to Hugging Face and the models used (and any interesting Python libraries we use too). |
||
### Deploy models in Serverless Containers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(structure): our examples should all use the standard README format included in the top-level of the repo. |
||
|
||
- Export these variables: | ||
|
||
```bash | ||
export SCW_ACCESS_KEY="access-key" SCW_SECRET_KEY="secret-key" SCW_PROJECT_ID="project-id" REGION="fr-par" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion(tfvars): make these Terraform variables and give region a default of |
||
``` | ||
|
||
- Add/remove Hugging Face models (with `.gguf` extension) in `terraform/hf-models.json` file. | ||
|
||
- Run script to deploy multiple hugging face models using terraform workspaces: | ||
|
||
```bash | ||
cd terraform && bash terraform.sh -a | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(docs): can you add a section on how to call one of the inference endpoints? If you add the endpoints as a Terraform output, you can write a command that you can copy-paste using There should be a command to call the "hello" endpoint to check they are working, then ideally a command for how to get an inference decision. |
||
### Benchmark models | ||
|
||
Check your models were deployed on the console and copy your container endpoints to the `terraform/hf-models.json` file, then perform the following command: | ||
|
||
```bash | ||
python benchmark-models.py | ||
``` | ||
|
||
This will generate a box plot to analyze response time per model family, and a `csv` file containing textual responses per each model. | ||
|
||
### Destroy terraform resources for all models | ||
|
||
```bash | ||
bash terraform.sh -d | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
import os | ||
|
||
from fastapi import FastAPI | ||
from llama_cpp import Llama | ||
from pydantic import BaseModel | ||
|
||
|
||
class Message(BaseModel): | ||
content: str | ||
|
||
|
||
MODEL_FILE_NAME = os.environ["MODEL_FILE_NAME"] | ||
|
||
app = FastAPI() | ||
|
||
print("loading model starts", flush=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: An alternative to adding |
||
|
||
llm = Llama(model_path=MODEL_FILE_NAME) | ||
|
||
print("loading model successfully ends", flush=True) | ||
|
||
|
||
@app.get("/") | ||
def hello(): | ||
"""Get info of inference server""" | ||
|
||
return { | ||
"message": "Hello, this is the inference server! Serving model {model_name}".format( | ||
model_name=MODEL_FILE_NAME | ||
) | ||
} | ||
|
||
|
||
@app.post("/") | ||
def infer(message: Message): | ||
"""Post a message and receive a response from inference server""" | ||
|
||
print("inference endpoint is called", flush=True) | ||
|
||
output = llm(prompt=message.content, max_tokens=200) | ||
|
||
print("output is successfully inferred", flush=True) | ||
|
||
print(output, flush=True) | ||
|
||
return output |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
fastapi==0.104.1 | ||
uvicorn==0.24.0.post1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
import csv | ||
import json | ||
|
||
import matplotlib.pyplot as plt | ||
import pandas | ||
import requests | ||
|
||
|
||
class Benchmark: | ||
_model_families = ["llama", "mistral", "phi"] | ||
_endpoints = {} | ||
|
||
def __init__( | ||
self, models_file: str, benchmark_file: str, results_figure: str, message: str | ||
) -> None: | ||
self.models_file = models_file | ||
self.benchmark_file = benchmark_file | ||
self.message = message | ||
self.results_figure = results_figure | ||
|
||
def get_container_endpoints_from_json_file(self) -> None: | ||
if self.models_file == "": | ||
raise Exception("file name is empty") | ||
|
||
with open(self.models_file, "r") as models_file: | ||
json_data = json.load(models_file) | ||
|
||
for family in self._model_families: | ||
self._endpoints[family] = [] | ||
for model in json_data[family]: | ||
self._endpoints[family].append( | ||
{"model": model["file"], "endpoint": model["ctn_endpoint"]} | ||
) | ||
|
||
def analyze_results(self) -> None: | ||
benchmark_results = pandas.read_csv(self.benchmark_file) | ||
benchmark_results.boxplot(column="Total Response Time", by="Family").plot() | ||
plt.ylabel("Total Response Time in seconds") | ||
plt.savefig(self.results_figure) | ||
|
||
def benchmark_models(self, num_samples: int) -> None: | ||
self.get_container_endpoints_from_json_file() | ||
|
||
fields = ["Model", "Family", "Total Response Time", "Response Message"] | ||
benchmark_data = [] | ||
|
||
for family in self._model_families: | ||
for endpoint in self._endpoints[family]: | ||
if endpoint["endpoint"] == "": | ||
raise Exception("model endpoint is empty") | ||
|
||
for _ in range(num_samples): | ||
try: | ||
print( | ||
"Calling model {model} on endpoint {endpoint} with message {message}".format( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(syntax): use f-strings by default. |
||
model=endpoint["model"], | ||
endpoint=endpoint["endpoint"], | ||
message=self.message, | ||
) | ||
) | ||
|
||
rsp = requests.post( | ||
endpoint["endpoint"], json={"message": self.message} | ||
) | ||
|
||
response_text = rsp.json()["choices"][0]["text"] | ||
|
||
print( | ||
"The model {model} responded with: {response_text}".format( | ||
model=endpoint["model"], response_text=response_text | ||
) | ||
) | ||
|
||
benchmark_data.append( | ||
[ | ||
endpoint["model"], | ||
family, | ||
rsp.elapsed.total_seconds(), | ||
response_text, | ||
] | ||
) | ||
except: | ||
pass | ||
|
||
with open(self.benchmark_file, "w") as results_file: | ||
wrt = csv.writer(results_file) | ||
wrt.writerow(fields) | ||
wrt.writerows(benchmark_data) | ||
|
||
self.analyze_results() | ||
|
||
|
||
if __name__ == "__main__": | ||
|
||
benchmark = Benchmark( | ||
models_file="hf-models.json", | ||
benchmark_file="benchmark-results.csv", | ||
results_figure="results-plot.png", | ||
message="What the difference between an elephant and an ant?", | ||
) | ||
|
||
benchmark.benchmark_models(num_samples=50) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
resource "scaleway_container_namespace" "main" { | ||
name = "ifr-${lower(replace(var.hf_model_file_name, "/[.]|[_]/", "-"))}-${random_string.random_suffix.result}" | ||
description = "Inference using Hugging Face models" | ||
} | ||
|
||
resource "scaleway_container" "inference-hugging-face" { | ||
name = "inference" | ||
description = "Inference serving API using a Hugging Face model" | ||
namespace_id = scaleway_container_namespace.main.id | ||
registry_image = docker_image.inference.name | ||
environment_variables = { | ||
"MODEL_FILE_NAME" = var.hf_model_file_name | ||
} | ||
port = 80 | ||
cpu_limit = 2240 | ||
memory_limit = 4096 | ||
min_scale = 1 | ||
max_scale = 1 | ||
deploy = true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
{ | ||
"llama" : [ | ||
{ | ||
"file": "llama-2-7b.Q2_K.gguf", | ||
"source" : "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q2_K.gguf", | ||
"size_gb": "2.83", | ||
"ctn_endpoint": "paste container endpoint here" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(improvement): we can template this file with Terraform as part of the deployment. You can see an example of templating container URLs in this example where we template a shell script. |
||
}, | ||
{ | ||
"file": "llama-2-7b.Q3_K_L.gguf", | ||
"source" : "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q3_K_L.gguf", | ||
"size_gb": "3.6", | ||
"ctn_endpoint": "paste container endpoint here" | ||
} | ||
], | ||
|
||
"mistral" : [ | ||
{ | ||
"file": "mistral-7b-instruct-v0.2.Q2_K.gguf", | ||
"source" : "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q2_K.gguf", | ||
"size_gb": "3.08", | ||
"ctn_endpoint": "paste container endpoint here" | ||
}, | ||
{ | ||
"file": "mistral-7b-instruct-v0.2.Q3_K_L.gguf", | ||
"source" : "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q3_K_L.gguf", | ||
"size_gb": "3.82", | ||
"ctn_endpoint": "paste container endpoint here" | ||
} | ||
], | ||
|
||
"phi" : [ | ||
{ | ||
"file": "phi-2.Q2_K.gguf", | ||
"source" : "https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q2_K.gguf", | ||
"size_gb": "1.17", | ||
"ctn_endpoint": "paste container endpoint here" | ||
}, | ||
{ | ||
"file": "phi-2.Q5_K_M.gguf", | ||
"source" : "https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q5_K_M.gguf", | ||
"size_gb": "2.07", | ||
"ctn_endpoint": "paste container endpoint here" | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
resource "scaleway_registry_namespace" "main" { | ||
name = "ifr-${lower(replace(var.hf_model_file_name, "/[.]|[_]/", "-"))}-${random_string.random_suffix.result}" | ||
region = var.region | ||
project_id = var.project_id | ||
} | ||
|
||
resource "docker_image" "inference" { | ||
name = "${scaleway_registry_namespace.main.endpoint}/inference-with-huggingface:${var.image_version}" | ||
build { | ||
context = "${path.cwd}/../" | ||
no_cache = true | ||
build_args = { | ||
MODEL_DOWNLOAD_SOURCE : var.hf_model_download_source | ||
} | ||
} | ||
|
||
provisioner "local-exec" { | ||
command = "docker push ${docker_image.inference.name}" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
provider "scaleway" { | ||
region = var.region | ||
access_key = var.access_key | ||
secret_key = var.secret_key | ||
project_id = var.project_id | ||
Comment on lines
+2
to
+5
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: IMO this is unnecessary, the default behavior of the provider is to use your config file or the environment to get its configuration, so I would leave it blank |
||
} | ||
|
||
provider "docker" { | ||
host = "unix:///var/run/docker.sock" | ||
|
||
registry_auth { | ||
address = scaleway_registry_namespace.main.endpoint | ||
username = "nologin" | ||
password = var.secret_key | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
#!/bin/bash | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue(scripting): These kinds of commands should be managed in a |
||
set -e | ||
|
||
# Common environment variables | ||
export TF_VAR_access_key=${SCW_ACCESS_KEY} \ | ||
TF_VAR_secret_key=${SCW_SECRET_KEY} \ | ||
TF_VAR_project_id=${SCW_PROJECT_ID} | ||
|
||
# Associative list of models to deploy using json data | ||
declare -A hf_models | ||
eval "$(jq -r '.[]|.[]|"hf_models[\(.file)]=\(.source)"' hf-models.json)" | ||
|
||
# Login to docker Scaleway's registry | ||
docker login "rg.$REGION.scw.cloud" -u nologin --password-stdin <<< "$SCW_SECRET_KEY" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion(simplify): We don't need to log into the repo every time, this can be a one-off step at the start (and listed in the README). |
||
|
||
# Initialize, plan, and deploy each model in a Terraform workspace | ||
apply() { | ||
terraform init | ||
for model_file_name in "${!hf_models[@]}"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question(terraform): Instead of using bash for-loops here, can we use |
||
do | ||
terraform workspace select -or-create $model_file_name | ||
export TF_VAR_hf_model_file_name=$model_file_name \ | ||
TF_VAR_hf_model_download_source=${hf_models[$model_file_name]} | ||
terraform plan | ||
terraform apply -auto-approve | ||
done | ||
} | ||
|
||
# Destroy resources of each Terraform workspace | ||
destroy() { | ||
for model_file_name in "${!hf_models[@]}"; | ||
do | ||
terraform workspace select $model_file_name | ||
export TF_VAR_hf_model_file_name=$model_file_name \ | ||
TF_VAR_hf_model_download_source=${hf_models[$model_file_name]} | ||
terraform destroy -auto-approve | ||
done | ||
} | ||
|
||
# Script actions | ||
while getopts "ad" option; do | ||
case $option in | ||
a) | ||
echo "deploying models" | ||
apply | ||
;; | ||
d) | ||
echo "destroying models" | ||
destroy | ||
;; | ||
*) | ||
echo "flag is not provided" | ||
exit 1 | ||
esac | ||
done |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
resource "random_string" "random_suffix" { | ||
length = 3 | ||
upper = false | ||
special = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can also include it in the requirements.txt directly: