A FastAPI-based load balancer for serving vLLM models with RunPod integration. Provides OpenAI-compatible APIs with streaming and non-streaming text generation.
Before you begin, make sure you have:
- A RunPod account (sign up at runpod.io)
- RunPod API key (available in your RunPod dashboard)
- Basic understanding of REST APIs and HTTP requests
curl
or a similar tool for testing API endpoints
Use the pre-built Docker image: runpod/vllm-loadbalancer:dev
Configure these environment variables in your RunPod endpoint:
Variable | Required | Description | Default | Example |
---|---|---|---|---|
MODEL_NAME |
Yes | HuggingFace model identifier | None | microsoft/DialoGPT-medium |
TENSOR_PARALLEL_SIZE |
No | Number of GPUs for model parallelism | 1 |
2 |
DTYPE |
No | Model precision type | auto |
float16 |
TRUST_REMOTE_CODE |
No | Allow remote code execution | true |
false |
MAX_MODEL_LEN |
No | Maximum sequence length | None (auto) | 2048 |
GPU_MEMORY_UTILIZATION |
No | GPU memory usage ratio | 0.9 |
0.8 |
ENFORCE_EAGER |
No | Disable CUDA graphs | false |
true |
- Create a new serverless endpoint
- Use Docker image:
runpod/vllm-loadbalancer:dev
- Set required environment variable:
MODEL_NAME
(e.g., "microsoft/DialoGPT-medium") - Optional: Configure additional environment variables as needed
curl -X POST "https://your-endpoint-id.api.runpod.ai/v1/completions" \
-H "Authorization: Bearer YOUR_RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a story about a brave knight",
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
curl -X POST "https://your-endpoint-id.api.runpod.ai/v1/completions" \
-H "Authorization: Bearer YOUR_RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Tell me about artificial intelligence",
"max_tokens": 200,
"temperature": 0.8,
"stream": true
}'
curl -X POST "https://your-endpoint-id.api.runpod.ai/v1/chat/completions" \
-H "Authorization: Bearer YOUR_RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
curl -X GET "https://your-endpoint-id.api.runpod.ai/ping" \
-H "Authorization: Bearer YOUR_RUNPOD_API_KEY"
Run the test script:
export ENDPOINT_ID="your-endpoint-id"
export RUNPOD_API_KEY="your-api-key"
python example.py