Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect token count (usage_metadata) in streaming mode #30429

Open
5 tasks done
andrePankraz opened this issue Mar 22, 2025 · 8 comments
Open
5 tasks done

Incorrect token count (usage_metadata) in streaming mode #30429

andrePankraz opened this issue Mar 22, 2025 · 8 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate Flagged for investigation.

Comments

@andrePankraz
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Any LLM-call with streaming.

The aggregated token usage is totally wrong and much to high.

See this method:

def add_ai_message_chunks(

    # Token usage
    if left.usage_metadata or any(o.usage_metadata is not None for o in others):
        usage_metadata: Optional[UsageMetadata] = left.usage_metadata
        for other in others:
            usage_metadata = add_usage(usage_metadata, other.usage_metadata)
    else:
        usage_metadata = None

For streaming we get usage_metdata for each token, e.g.

'input_tokens' = 713
'output_tokens' = 1
'total_tokens' = 714

output_tokens is always 1 and adds up nicely.
input_tokens is always 713 for llm-token-stream and adds up to "input_tokens" * "count(tokens)" (same total_tokens with 714)

This just adds up tokens to huge (totally useless) numbers.

What is the strategy here? Should the llm not report per-token usage metdata and only report this in final chunk? Then Langchain-openai has to change this for that call:

def _create_usage_metadata(oai_token_usage: dict) -> UsageMetadata:

Error Message and Stack Trace (if applicable)

No response

Description

  • I'm trying to get sane token usage numbers for streaming with usage_metadata
  • I get hugely inflated total_tokens and input_tokens (because multiplied by count(output_token)
  • Define a strategy and either adapt the token aggregation in langchain_core.messages.add_ai_message_chunks or the usage reporting only in final chunk in openai.chatmodels.base._create_usage_metadata

System Info

totally not relevant

@langcarl langcarl bot added the investigate Flagged for investigation. label Mar 22, 2025
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 22, 2025
@le-codeur-rapide
Copy link

Hello @andrePankraz ,
When I run a basic stream from openai chat model while tracking the stream usage

model = ChatOpenAI()
prompt = "write the recipe of tiramisu"
response = model.stream(prompt, stream_usage=True)
for s in response:
    print(s)

I get the expected usage_metadata at the end of the stream.

[...] previous stream chunks
content=' Enjoy' additional_kwargs={} response_metadata={} id='run-2466888b-00dc-4720-8572-00056703fc67'
content='!' additional_kwargs={} response_metadata={} id='run-2466888b-00dc-4720-8572-00056703fc67'
content='' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'gpt-3.5-turbo-0125'} id='run-2466888b-00dc-4720-8572-00056703fc67'
content='' additional_kwargs={} response_metadata={} id='run-2466888b-00dc-4720-8572-00056703fc67' usage_metadata={'input_tokens': 14, 'output_tokens': 293, 'total_tokens': 307, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}``

Can you please share your code ?

@Yogesh-Dubey-Ayesavi
Copy link

I’m facing the same issue. In my case, I don’t get output_tokens and the total_tokens in the usage metadata, not even input_tokens in case of ChatOpenAI

  1. ChatOpenAI.
  2. ChatGoogleGenerativeAI.

Note

I have enabled stream_usage and streaming in runnable configurations.

Code

   const stream = await this.graph?.stream(graphInput, {
          configurable: this.config?.configurable,
          streamMode: "messages",
        });

        for await (const [msg] of stream!) {
          // Write message to JSON file
          const logPath = `./logs/stream_${this.config?.configurable.thread_id}.json`;
          const logData = {
            timestamp: new Date().toISOString(),
            message: msg,
          };

          fs.mkdirSync("./logs", { recursive: true });

          let existingData = [];
          if (fs.existsSync(logPath)) {
            existingData = JSON.parse(fs.readFileSync(logPath, "utf8"));
          }

          existingData.push(logData);
          fs.writeFileSync(logPath, JSON.stringify(existingData, null, 2));
        }

Logs for ChatOpenAI

[
  {
    "timestamp": "2025-03-22T11:45:17.791Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.794Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "Hi",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.814Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " there",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.815Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "!",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.843Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " What's",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.844Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " your",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.854Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " name",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.855Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "?",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:45:17.859Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "",
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "id": "chatcmpl-BDrZtqqSJCZyyZjuY4eL8CtcBwGW6",
        "response_metadata": {
          "usage": {}
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  }
]

Logs for ChatGoogleGenerativeAI

[
  {
    "timestamp": "2025-03-22T11:39:34.645Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "Hi",
        "tool_calls": [],
        "invalid_tool_calls": [],
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "response_metadata": {},
        "id": "c5020587-7cb7-437c-b835-462eb5831e79"
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:39:34.649Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " there! I",
        "tool_calls": [],
        "invalid_tool_calls": [],
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "response_metadata": {},
        "id": "3ddd7076-4b61-4eb4-a4bf-016943cb252b"
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:39:34.699Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "'m Mira, your AI career advisor. I can offer guidance on choosing a",
        "tool_calls": [],
        "invalid_tool_calls": [],
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "response_metadata": {},
        "id": "2e9d873f-6db0-4591-bc25-ad6fdd17a7a1"
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:39:34.785Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " tech career, suggest skills to learn, give job search tips, and share industry",
        "tool_calls": [],
        "invalid_tool_calls": [],
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "response_metadata": {},
        "id": "7ba57b54-1e4f-4bfd-a2ed-4241d3c2cd64"
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:39:34.870Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": " insights. What kind of tech work excites you?\n",
        "tool_calls": [],
        "invalid_tool_calls": [],
        "tool_call_chunks": [],
        "additional_kwargs": {},
        "response_metadata": {},
        "id": "68e76c02-7b86-4401-b655-80a46e29b670"
      }
    }
  },
  {
    "timestamp": "2025-03-22T11:39:34.885Z",
    "message": {
      "lc": 1,
      "type": "constructor",
      "id": [
        "langchain_core",
        "messages",
        "AIMessageChunk"
      ],
      "kwargs": {
        "content": "Hi there! I'm Mira, your AI career advisor. I can offer guidance on choosing a tech career, suggest skills to learn, give job search tips, and share industry insights. What kind of tech work excites you?\n",
        "additional_kwargs": {},
        "response_metadata": {},
        "tool_call_chunks": [],
        "id": "run-8fca304b-c25b-4efa-9e89-71601a744918",
        "usage_metadata": {
          "input_tokens": 590,
          "output_tokens": null,
          "total_tokens": null
        },
        "tool_calls": [],
        "invalid_tool_calls": []
      }
    }
  }
]

@ccurme
Copy link
Collaborator

ccurme commented Mar 22, 2025

@Yogesh-Dubey-Ayesavi could you open a separate issue in https://github.com/langchain-ai/langchainjs?

@Yogesh-Dubey-Ayesavi
Copy link

Yes @ccurme Sure.

@andrePankraz
Copy link
Author

Hi, thank you for locking into this.
I've got a minimal example, that shows the problem.

It only happens in astream-mode with callback handler - what many chatbots are using for step tracking - it's embedded into bigger agent graph structures.

The final usage_metadata is wrong:

async def astream():

    class MyCallbackHandler(AsyncCallbackHandler):
        async def on_llm_end(self, response, **kwargs):
            # This method is called when the LLM stream ends and contains wrong final usage_metadata
            print("LLM ended:", response)

    async for token in llm.astream(
        [
            SystemMessage(content="Answer briefly and concisely, follow the instructions exactly."),
            HumanMessage(content="Repeat the word 'TEST' 10 times"),
        ],
        config=RunnableConfig(callbacks=[MyCallbackHandler()]),
    ):
        # Corrent usage data per token
        print(token)


asyncio.run(astream())

The problematic parts are called like described in original error description.
The methods are working wrong for usage_metadata aggregation.

But it only appears in final on_llm_end callback (with agregated generation-message), else astream() delivers the original usage_metadata per token (yield chunk.message).
See here:

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/language_models/chat_models.py#L456

@le-codeur-rapide
Copy link

Hello @andrePankraz
That is weird, I tested your minimal code example and stilll get the expected behaviour:
usage_metadata={'input_tokens': 32, 'output_tokens': 11, 'total_tokens': 43,

content=' TEST' additional_kwargs={} response_metadata={} id='run-ba481c1a-e135-4d3a-b860-a3659694c58a'
content=' TEST' additional_kwargs={} response_metadata={} id='run-ba481c1a-e135-4d3a-b860-a3659694c58a'
content=' TEST' additional_kwargs={} response_metadata={} id='run-ba481c1a-e135-4d3a-b860-a3659694c58a'
content='' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_6ec83003ad'} id='run-ba481c1a-e135-4d3a-b860-a3659694c58a'
content='' additional_kwargs={} response_metadata={} id='run-ba481c1a-e135-4d3a-b860-a3659694c58a' usage_metadata={'input_tokens': 32, 'output_tokens': 11, 'total_tokens': 43, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
LLM ended: generations=[[ChatGenerationChunk(text='TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST', generation_info={'finish_reason': 'stop', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_6ec83003ad'}, message=AIMessageChunk(content='TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST', additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_6ec83003ad'}, id='run-ba481c1a-e135-4d3a-b860-a3659694c58a', usage_metadata={'input_tokens': 32, 'output_tokens': 11, 'total_tokens': 43, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}))]] llm_output=None run=None type='LLMResult'

Note that to have this usage metadata I had to pass the stream_usage=True argument to llm.astream() function.
Then I wonder how you have usage matadata without this.

Maybe it is due to the llm then, what model are you using ?

@andrePankraz
Copy link
Author

I have passed stream_usage = True to llm initialization - that works too.

But I see the problem with your log, here is mine:

content='' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 0, 'total_tokens': 35, 'input_token_details': {}, 'output_token_details': {}}
content='TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content=' TEST' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content='' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'Qwen/Qwen2.5-72B-Instruct'} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}
content='' additional_kwargs={} response_metadata={} id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d' usage_metadata={'input_tokens': 35, 'output_tokens': 11, 'total_tokens': 46, 'input_token_details': {}, 'output_token_details': {}}
LLM ended: generations=[[ChatGenerationChunk(text='TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST', generation_info={'finish_reason': 'stop', 'model_name': 'Qwen/Qwen2.5-72B-Instruct'}, message=AIMessageChunk(content='TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST', additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'Qwen/Qwen2.5-72B-Instruct'}, id='run-6c4cfc6e-191d-46fb-8be9-2d2f7acb9d0d', usage_metadata={'input_tokens': 455, 'output_tokens': 22, 'total_tokens': 477, 'input_token_details': {}, 'output_token_details': {}}))]] llm_output=None run=None type='LLMResult'

As you can see, my OpenAI compatible (!) API gives usage_metadata per token (!) with output_tokens = 1:
usage_metadata={'input_tokens': 35, 'output_tokens': 1, 'total_tokens': 36, 'input_token_details': {}, 'output_token_details': {}}

I use VLLM Inference Server with OpenAI API.
So I could try using different version and see, if they have changed this slightly different behaviour. They are not wrong, giving usage_metadata with output_tokens=1 and input_tokens already known - but it's a bit different from OpenAI, if I see your log.

At the end, the usage_metadata aggregation code in

def add_ai_message_chunks(

doesn't make any sense at all, you can not aggregate input_tokens and total_tokens like this in a stream. It kind of works for output_tokens, but even than the final message also doubles the expected output_tokens number with this addition.

It just works for OpenAI original API, because they just forward usage_metadata once in "final output token" (more like a synthetic technical final message, not a real token, stop token comes before in stream).

@le-codeur-rapide
Copy link

Ahh ok I see. Yes you are right this add_ai_message_chunks does not look like it is working in the general case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate Flagged for investigation.
Projects
None yet
Development

No branches or pull requests

4 participants