Skip to content

Latest commit

 

History

History
130 lines (102 loc) · 12.3 KB

File metadata and controls

130 lines (102 loc) · 12.3 KB
meta content dates category productIcon
title description
Generative APIs FAQ
Get answers to the most frequently asked questions about Scaleway Generative APIs.
h1
Generative APIs FAQ
validation
2025-02-12
ai-data
GenerativeApiProductIcon

What are Scaleway Generative APIs?

Scaleway's Generative APIs provide access to pre-configured, serverless endpoints of leading AI models, hosted in European data centers. This allows you to integrate advanced AI capabilities into your applications without managing underlying infrastructure.

Which models are supported by Generative APIs?

Our Generative APIs support a range of popular models, including:

  • Chat / Text Generation models: Refer to our dedicated documentation for a list of supported chat models.
  • Vision models: Refer to our dedicated documentation for a list of supported vision models.
  • Embedding models: Refer to our dedicated documentation for a list of supported embedding models.

How does the free tier work?

The free tier allows you to process up to 1,000,000 tokens without incurring any costs. After reaching this limit, you will be charged per million tokens processed. Free tier usage is calculated by adding all input and output tokens consumed from all models used. For more information, refer to our pricing page or access your bills by token types and models in billing section from Scaleway Console (past and provisional bills for the current month).

Note that when your consumption exceeds the free tier, you will be billed for each additional token consumed by model and token type. The minimum billing unit is 1 million tokens. Here are two examples for low volume consumption:

Example 1: Free Tier only

Model Token type Tokens consumed Price Bill
llama-3.3-70b-instruct Input 500k 0.90€/million tokens 0.00€
llama-3.3-70b-instruct Output 200k 0.90€/million tokens 0.00€
mistral-small-3.1-24b-instruct-2503 Input 100k 0.15€/million tokens 0.00€
mistral-small-3.1-24b-instruct-2503 Output 100k 0.35€/million tokens 0.00€

Total tokens consumed: 900k Total bill: 0.00€

Example 2: Exceeding Free Tier

Model Token type Tokens consumed Price Billed consumption Bill
llama-3.3-70b-instruct Input 800k 0.90€/million tokens 1 million tokens 0.00€ (Free Tier application)
llama-3.3-70b-instruct Output 2 500k 0.90€/million tokens 3 million tokens 2.70€
mistral-small-3.1-24b-instruct-2503 Input 100k 0.15€/million tokens 1 million tokens 0.15€
mistral-small-3.1-24b-instruct-2503 Output 100k 0.35€/million tokens 1 million tokens 0.35€

Total tokens consumed: 900k Total billed consumption: 6 million tokens Total bill: 3.20€

Note that in this example, the first line where the free tier applies will not display in your current Scaleway bills by model but will instead be listed under Generative APIs Free Tier - First 1M tokens for free.

What is a token and how are they counted?

A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:

  • For text, on average, 1 token corresponds to ~4 characters, and thus 0.75 words (as words are on average five characters long)
  • For images, 1 token corresponds to a square of pixels. For example, pixtral-12b-2409 model image tokens of 16x16 pixels (16-pixel height, and 16-pixel width, hence 256 pixels in total).

The exact token count and definition depend on tokenizers used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model documentation (for instance in pixtral-12b-2409 size limit documentation). Otherwise, when the model is open, you can find this information in the model files on platforms such as Hugging Face, usually in the tokenizer_config.json file.

How can I monitor my token consumption?

You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:

  • Cockpits are isolated by Projects, hence you first need to select the right project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the project_id in the Cockpit URL: https://{project_id}.dashboard.obs.fr-par.scw.cloud/.
  • Cockpit graphs can take up to 1 hour to update token consumption, see Troubleshooting for further details.

How can I give access to token consumption to my users outside of Scaleway?

If your users do not have a Scaleway account, you can still give them access to their Generative API usage consumption by either:

  • Providing them an access to Grafana inside Cockpit. You can create dedicated Grafana users with read-only access (Viewer Role). Note that these users will still have access to all other Cockpit dashboards for this project.
  • Collecting consumption data from the Billing API and exposing it to your users. Consumption can be detailed by Projects.
  • Collecting consumption data from Cockpit data sources and exposing it to your users. As an example, you can query consumption using the following query:
curl -G 'https://{data-source-id}.metrics.cockpit.fr-par.scw.cloud/prometheus/api/v1/query_range' \
    --data-urlencode 'query=generative_apis_tokens_total{resource_name=~".*",type=~"(input_tokens|output_tokens)"}' \
    --data-urlencode 'start=2025-03-15T20:10:51.781Z' \
    --data-urlencode 'end=2025-03-20T20:10:51.781Z' \
    --data-urlencode 'step=1h' \
    -H "Authorization: Bearer $COCKPIT_TOKEN" | jq

Make sure that you replace the following values:

You can see your token consumption in Scaleway Cockpit. You can access it from the Scaleway console under the Metrics tab. Note that:

  • Cockpits are isolated by Projects. You first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for the desired Project (you can see the project_id in the Cockpit URL: https://{project_id}.dashboard.obs.fr-par.scw.cloud/.
  • Cockpit graphs can take up to 1 hour to update token consumption, see Troubleshooting for further details.

Can I configure a maximum billing threshold?

Currently, you cannot configure a specific threshold after which your usage will blocked. However:

  • You can configure billing alerts to ensure you are warned when you hit specific budget thresholds.
  • Your total billing remains limited by the amount of tokens you can consume within rate limits.
  • If you want to ensure fixed billing, you can use Managed Inference, which provides the same set of OpenAI-compatible APIs and a wider range of models.

How can I access and use the Generative APIs?

Access is open to all Scaleway customers. You can start by using the Generative APIs Playground in the Scaleway console to experiment with different models. For integration into applications, you can use the OpenAI-compatible APIs provided by Scaleway. Detailed instructions are available in our Quickstart guide.

Where are the inference servers located?

All models are currently hosted in a secure data center located in Paris, France, operated by OPCORE. This ensures low latency for European users and compliance with European data privacy regulations.

Where can I find the privacy policy regarding Generative APIs?

You can find the privacy policy applicable to all use of Generative APIs here.

Can I use OpenAI libraries and APIs with Scaleway's Generative APIs?

Yes, Scaleway's Generative APIs are designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows.

What is the difference between Generative APIs and Managed Inference?

  • Generative APIs: A serverless service providing access to pre-configured AI models via API, billed per token usage.
  • Managed Inference: Allows deployment of curated or custom models with chosen quantization and instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not.

How do I get started with Generative APIs?

To get started, explore the Generative APIs Playground in the Scaleway console. For application integration, refer to our Quickstart guide, which provides step-by-step instructions on accessing, configuring, and using a Generative APIs endpoint.

Are there any rate limits for API usage?

Yes, API rate limits define the maximum number of requests a user can make within a specific time frame to ensure fair access and resource allocation between users. If you require increased rate limits (by a factor from 2 to 5 times), you can request them by creating a ticket. If you require even higher rate limits, especially to absorb infrequent peak loads, we recommend using Managed Inference instead with dedicated provisioned capacity. Refer to our dedicated documentation for more information on rate limits.

What is the model lifecycle for Generative APIs?

Scaleway is dedicated to updating and offering the latest versions of generative AI models, ensuring improvements in capabilities, accuracy, and safety. As new versions of models are introduced, you can explore them through the Scaleway console. Learn more in our dedicated documentation.

What are the SLAs applicable to Generative APIs?

We are currently working on defining our SLAs for Generative APIs. We will provide more information on this topic soon.

What are the performance guarantees (vs Managed Inference)?

We are currently working on defining our performance guarantees for Generative APIs. We will provide more information on this topic soon.

Do model licenses apply when using Generative APIs?

Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for each model in our documentation and in Console Playground.