From aced750854ab920630c1f80187def2af7ebfa8ad Mon Sep 17 00:00:00 2001 From: tarsipella-cloud Date: Sun, 16 Nov 2025 17:15:33 +0400 Subject: [PATCH] Update How_to_count_tokens_with_tiktoken.ipynb --- .../How_to_count_tokens_with_tiktoken.ipynb | 815 +----------------- 1 file changed, 1 insertion(+), 814 deletions(-) diff --git a/examples/How_to_count_tokens_with_tiktoken.ipynb b/examples/How_to_count_tokens_with_tiktoken.ipynb index 19ce768921..afaefae67d 100644 --- a/examples/How_to_count_tokens_with_tiktoken.ipynb +++ b/examples/How_to_count_tokens_with_tiktoken.ipynb @@ -1,814 +1 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to count tokens with tiktoken\n", - "\n", - "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", - "\n", - "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", - "\n", - "Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n", - "\n", - "\n", - "## Encodings\n", - "\n", - "Encodings specify how text is converted into tokens. Different models use different encodings.\n", - "\n", - "`tiktoken` supports three encodings used by OpenAI models:\n", - "\n", - "| Encoding name | OpenAI models |\n", - "|-------------------------|-----------------------------------------------------|\n", - "| `o200k_base` | `gpt-4o`, `gpt-4o-mini` |\n", - "| `cl100k_base` | `gpt-4-turbo`, `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large` |\n", - "| `p50k_base` | Codex models, `text-davinci-002`, `text-davinci-003`|\n", - "| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |\n", - "\n", - "You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:\n", - "```python\n", - "encoding = tiktoken.encoding_for_model('gpt-4o-mini')\n", - "```\n", - "\n", - "Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n", - "\n", - "## Tokenizer libraries by language\n", - "\n", - "For `o200k_base`, `cl100k_base` and `p50k_base` encodings:\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", - "- .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken), [TiktokenSharp](https://github.com/aiqinxuancai/TiktokenSharp)\n", - "- Java: [jtokkit](https://github.com/knuddelsgmbh/jtokkit)\n", - "- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)\n", - "- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)\n", - "\n", - "For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n", - "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", - "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", - "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", - "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", - "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", - "- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)\n", - "- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)\n", - "\n", - "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", - "\n", - "\n", - "## How strings are typically tokenized\n", - "\n", - "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer), or the third-party [Tiktokenizer](https://tiktokenizer.vercel.app/) webapp." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 0. Install `tiktoken`\n", - "\n", - "If needed, install `tiktoken` with `pip`:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.2\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "Note: you may need to restart the kernel to use updated packages.\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.2\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --upgrade tiktoken -q\n", - "%pip install --upgrade openai -q" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Import `tiktoken`" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import tiktoken" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Load an encoding\n", - "\n", - "Use `tiktoken.get_encoding()` to load an encoding by name.\n", - "\n", - "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "encoding = tiktoken.get_encoding(\"cl100k_base\")\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use `tiktoken.encoding_for_model()` to automatically load the correct encoding for a given model name." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "encoding = tiktoken.encoding_for_model(\"gpt-4o-mini\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Turn text into tokens with `encoding.encode()`\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `.encode()` method converts a text string into a list of token integers." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[83, 8251, 2488, 382, 2212, 0]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "encoding.encode(\"tiktoken is great!\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Count tokens by counting the length of the list returned by `.encode()`." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", - " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " num_tokens = len(encoding.encode(string))\n", - " return num_tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "6" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "num_tokens_from_string(\"tiktoken is great!\", \"o200k_base\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Turn tokens into text with `encoding.decode()`" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`.decode()` converts a list of token integers to a string." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'tiktoken is great!'" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "encoding.decode([83, 8251, 2488, 382, 2212, 0])" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[b't', b'ikt', b'oken', b' is', b' great', b'!']" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "[encoding.decode_single_token_bytes(token) for token in [83, 8251, 2488, 382, 2212, 0]]\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(The `b` in front of the strings indicates that the strings are byte strings.)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Comparing encodings\n", - "\n", - "Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "def compare_encodings(example_string: str) -> None:\n", - " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", - " # print the example string\n", - " print(f'\\nExample string: \"{example_string}\"')\n", - " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", - " for encoding_name in [\"r50k_base\", \"p50k_base\", \"cl100k_base\", \"o200k_base\"]:\n", - " encoding = tiktoken.get_encoding(encoding_name)\n", - " token_integers = encoding.encode(example_string)\n", - " num_tokens = len(token_integers)\n", - " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", - " print()\n", - " print(f\"{encoding_name}: {num_tokens} tokens\")\n", - " print(f\"token integers: {token_integers}\")\n", - " print(f\"token bytes: {token_bytes}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Example string: \"antidisestablishmentarianism\"\n", - "\n", - "r50k_base: 5 tokens\n", - "token integers: [415, 29207, 44390, 3699, 1042]\n", - "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n", - "\n", - "p50k_base: 5 tokens\n", - "token integers: [415, 29207, 44390, 3699, 1042]\n", - "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n", - "\n", - "cl100k_base: 6 tokens\n", - "token integers: [519, 85342, 34500, 479, 8997, 2191]\n", - "token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n", - "\n", - "o200k_base: 6 tokens\n", - "token integers: [493, 129901, 376, 160388, 21203, 2367]\n", - "token bytes: [b'ant', b'idis', b'est', b'ablishment', b'arian', b'ism']\n" - ] - } - ], - "source": [ - "compare_encodings(\"antidisestablishmentarianism\")" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Example string: \"2 + 2 = 4\"\n", - "\n", - "r50k_base: 5 tokens\n", - "token integers: [17, 1343, 362, 796, 604]\n", - "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n", - "\n", - "p50k_base: 5 tokens\n", - "token integers: [17, 1343, 362, 796, 604]\n", - "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n", - "\n", - "cl100k_base: 7 tokens\n", - "token integers: [17, 489, 220, 17, 284, 220, 19]\n", - "token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n", - "\n", - "o200k_base: 7 tokens\n", - "token integers: [17, 659, 220, 17, 314, 220, 19]\n", - "token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n" - ] - } - ], - "source": [ - "compare_encodings(\"2 + 2 = 4\")" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Example string: \"お誕生日おめでとう\"\n", - "\n", - "r50k_base: 14 tokens\n", - "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n", - "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n", - "\n", - "p50k_base: 14 tokens\n", - "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n", - "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n", - "\n", - "cl100k_base: 9 tokens\n", - "token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n", - "token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n", - "\n", - "o200k_base: 8 tokens\n", - "token integers: [8930, 9697, 243, 128225, 8930, 17693, 4344, 48669]\n", - "token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n" - ] - } - ], - "source": [ - "compare_encodings(\"お誕生日おめでとう\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 6. Counting tokens for chat completions API calls\n", - "\n", - "ChatGPT models like `gpt-4o-mini` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n", - "\n", - "Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo`, `gpt-4`, `gpt-4o` and `gpt-4o-mini`.\n", - "\n", - "Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.\n", - "\n", - "In particular, requests that use the optional functions input will consume extra tokens on top of the estimates calculated below." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "def num_tokens_from_messages(messages, model=\"gpt-4o-mini-2024-07-18\"):\n", - " \"\"\"Return the number of tokens used by a list of messages.\"\"\"\n", - " try:\n", - " encoding = tiktoken.encoding_for_model(model)\n", - " except KeyError:\n", - " print(\"Warning: model not found. Using o200k_base encoding.\")\n", - " encoding = tiktoken.get_encoding(\"o200k_base\")\n", - " if model in {\n", - " \"gpt-3.5-turbo-0125\",\n", - " \"gpt-4-0314\",\n", - " \"gpt-4-32k-0314\",\n", - " \"gpt-4-0613\",\n", - " \"gpt-4-32k-0613\",\n", - " \"gpt-4o-mini-2024-07-18\",\n", - " \"gpt-4o-2024-08-06\"\n", - " }:\n", - " tokens_per_message = 3\n", - " tokens_per_name = 1\n", - " elif \"gpt-3.5-turbo\" in model:\n", - " print(\"Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.\")\n", - " return num_tokens_from_messages(messages, model=\"gpt-3.5-turbo-0125\")\n", - " elif \"gpt-4o-mini\" in model:\n", - " print(\"Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.\")\n", - " return num_tokens_from_messages(messages, model=\"gpt-4o-mini-2024-07-18\")\n", - " elif \"gpt-4o\" in model:\n", - " print(\"Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.\")\n", - " return num_tokens_from_messages(messages, model=\"gpt-4o-2024-08-06\")\n", - " elif \"gpt-4\" in model:\n", - " print(\"Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.\")\n", - " return num_tokens_from_messages(messages, model=\"gpt-4-0613\")\n", - " else:\n", - " raise NotImplementedError(\n", - " f\"\"\"num_tokens_from_messages() is not implemented for model {model}.\"\"\"\n", - " )\n", - " num_tokens = 0\n", - " for message in messages:\n", - " num_tokens += tokens_per_message\n", - " for key, value in message.items():\n", - " num_tokens += len(encoding.encode(value))\n", - " if key == \"name\":\n", - " num_tokens += tokens_per_name\n", - " num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>\n", - " return num_tokens\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "gpt-3.5-turbo\n", - "Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.\n", - "129 prompt tokens counted by num_tokens_from_messages().\n", - "129 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4-0613\n", - "129 prompt tokens counted by num_tokens_from_messages().\n", - "129 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4\n", - "Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.\n", - "129 prompt tokens counted by num_tokens_from_messages().\n", - "129 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4o\n", - "Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.\n", - "124 prompt tokens counted by num_tokens_from_messages().\n", - "124 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4o-mini\n", - "Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.\n", - "124 prompt tokens counted by num_tokens_from_messages().\n", - "124 prompt tokens counted by the OpenAI API.\n", - "\n" - ] - } - ], - "source": [ - "# let's verify the function above matches the OpenAI API response\n", - "\n", - "from openai import OpenAI\n", - "import os\n", - "\n", - "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"\"))\n", - "\n", - "example_messages = [\n", - " {\n", - " \"role\": \"system\",\n", - " \"content\": \"You are a helpful, pattern-following assistant that translates corporate jargon into plain English.\",\n", - " },\n", - " {\n", - " \"role\": \"system\",\n", - " \"name\": \"example_user\",\n", - " \"content\": \"New synergies will help drive top-line growth.\",\n", - " },\n", - " {\n", - " \"role\": \"system\",\n", - " \"name\": \"example_assistant\",\n", - " \"content\": \"Things working well together will increase revenue.\",\n", - " },\n", - " {\n", - " \"role\": \"system\",\n", - " \"name\": \"example_user\",\n", - " \"content\": \"Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.\",\n", - " },\n", - " {\n", - " \"role\": \"system\",\n", - " \"name\": \"example_assistant\",\n", - " \"content\": \"Let's talk later when we're less busy about how to do better.\",\n", - " },\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": \"This late pivot means we don't have time to boil the ocean for the client deliverable.\",\n", - " },\n", - "]\n", - "\n", - "for model in [\n", - " \"gpt-3.5-turbo\",\n", - " \"gpt-4-0613\",\n", - " \"gpt-4\",\n", - " \"gpt-4o\",\n", - " \"gpt-4o-mini\"\n", - " ]:\n", - " print(model)\n", - " # example token count from the function defined above\n", - " print(f\"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().\")\n", - " # example token count from the OpenAI API\n", - " response = client.chat.completions.create(model=model,\n", - " messages=example_messages,\n", - " temperature=0,\n", - " max_tokens=1)\n", - " print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')\n", - " print()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 7. Counting tokens for chat completions with tool calls\n", - "\n", - "Next, we will look into how to apply this calculations to messages that may contain function calls. This is not immediately trivial, due to the formatting of the tools themselves. \n", - "\n", - "Below is an example function for counting tokens for messages that contain tools, passed to `gpt-3.5-turbo`, `gpt-4`, `gpt-4o` and `gpt-4o-mini`." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "def num_tokens_for_tools(functions, messages, model):\n", - " \n", - " # Initialize function settings to 0\n", - " func_init = 0\n", - " prop_init = 0\n", - " prop_key = 0\n", - " enum_init = 0\n", - " enum_item = 0\n", - " func_end = 0\n", - " \n", - " if model in [\n", - " \"gpt-4o\",\n", - " \"gpt-4o-mini\"\n", - " ]:\n", - " \n", - " # Set function settings for the above models\n", - " func_init = 7\n", - " prop_init = 3\n", - " prop_key = 3\n", - " enum_init = -3\n", - " enum_item = 3\n", - " func_end = 12\n", - " elif model in [\n", - " \"gpt-3.5-turbo\",\n", - " \"gpt-4\"\n", - " ]:\n", - " # Set function settings for the above models\n", - " func_init = 10\n", - " prop_init = 3\n", - " prop_key = 3\n", - " enum_init = -3\n", - " enum_item = 3\n", - " func_end = 12\n", - " else:\n", - " raise NotImplementedError(\n", - " f\"\"\"num_tokens_for_tools() is not implemented for model {model}.\"\"\"\n", - " )\n", - " \n", - " try:\n", - " encoding = tiktoken.encoding_for_model(model)\n", - " except KeyError:\n", - " print(\"Warning: model not found. Using o200k_base encoding.\")\n", - " encoding = tiktoken.get_encoding(\"o200k_base\")\n", - " \n", - " func_token_count = 0\n", - " if len(functions) > 0:\n", - " for f in functions:\n", - " func_token_count += func_init # Add tokens for start of each function\n", - " function = f[\"function\"]\n", - " f_name = function[\"name\"]\n", - " f_desc = function[\"description\"]\n", - " if f_desc.endswith(\".\"):\n", - " f_desc = f_desc[:-1]\n", - " line = f_name + \":\" + f_desc\n", - " func_token_count += len(encoding.encode(line)) # Add tokens for set name and description\n", - " if len(function[\"parameters\"][\"properties\"]) > 0:\n", - " func_token_count += prop_init # Add tokens for start of each property\n", - " for key in list(function[\"parameters\"][\"properties\"].keys()):\n", - " func_token_count += prop_key # Add tokens for each set property\n", - " p_name = key\n", - " p_type = function[\"parameters\"][\"properties\"][key][\"type\"]\n", - " p_desc = function[\"parameters\"][\"properties\"][key][\"description\"]\n", - " if \"enum\" in function[\"parameters\"][\"properties\"][key].keys():\n", - " func_token_count += enum_init # Add tokens if property has enum list\n", - " for item in function[\"parameters\"][\"properties\"][key][\"enum\"]:\n", - " func_token_count += enum_item\n", - " func_token_count += len(encoding.encode(item))\n", - " if p_desc.endswith(\".\"):\n", - " p_desc = p_desc[:-1]\n", - " line = f\"{p_name}:{p_type}:{p_desc}\"\n", - " func_token_count += len(encoding.encode(line))\n", - " func_token_count += func_end\n", - " \n", - " messages_token_count = num_tokens_from_messages(messages, model)\n", - " total_tokens = messages_token_count + func_token_count\n", - " \n", - " return total_tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "gpt-3.5-turbo\n", - "Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.\n", - "105 prompt tokens counted by num_tokens_for_tools().\n", - "105 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4\n", - "Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.\n", - "105 prompt tokens counted by num_tokens_for_tools().\n", - "105 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4o\n", - "Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.\n", - "101 prompt tokens counted by num_tokens_for_tools().\n", - "101 prompt tokens counted by the OpenAI API.\n", - "\n", - "gpt-4o-mini\n", - "Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.\n", - "101 prompt tokens counted by num_tokens_for_tools().\n", - "101 prompt tokens counted by the OpenAI API.\n", - "\n" - ] - } - ], - "source": [ - "tools = [\n", - " {\n", - " \"type\": \"function\",\n", - " \"function\": {\n", - " \"name\": \"get_current_weather\",\n", - " \"description\": \"Get the current weather in a given location\",\n", - " \"parameters\": {\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"location\": {\n", - " \"type\": \"string\",\n", - " \"description\": \"The city and state, e.g. San Francisco, CA\",\n", - " },\n", - " \"unit\": {\"type\": \"string\", \n", - " \"description\": \"The unit of temperature to return\",\n", - " \"enum\": [\"celsius\", \"fahrenheit\"]},\n", - " },\n", - " \"required\": [\"location\"],\n", - " },\n", - " }\n", - " }\n", - "]\n", - "\n", - "example_messages = [\n", - " {\n", - " \"role\": \"system\",\n", - " \"content\": \"You are a helpful assistant that can answer to questions about the weather.\",\n", - " },\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": \"What's the weather like in San Francisco?\",\n", - " },\n", - "]\n", - "\n", - "for model in [\n", - " \"gpt-3.5-turbo\",\n", - " \"gpt-4\",\n", - " \"gpt-4o\",\n", - " \"gpt-4o-mini\"\n", - " ]:\n", - " print(model)\n", - " # example token count from the function defined above\n", - " print(f\"{num_tokens_for_tools(tools, example_messages, model)} prompt tokens counted by num_tokens_for_tools().\")\n", - " # example token count from the OpenAI API\n", - " response = client.chat.completions.create(model=model,\n", - " messages=example_messages,\n", - " tools=tools,\n", - " temperature=0)\n", - " print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')\n", - " print()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.7" - }, - "vscode": { - "interpreter": { - "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" - } - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} +tiktoken/core.py