Skip to content

Conversation

VarSuren
Copy link
Contributor

Description
Request with gzip were failing because of EOF ( trying to read in complete chunks )

Example:

Failing request prior to the change where gcp-claude.json any Claude request.

gcp-claude.json:

{
        "model": "gcp.claude-3-5-haiku",
        "messages": [
          {"role": "user", "content":"Give me long story about love of NYC"}
        ],
        "temperature": 0.1, "max_tokens":512, "stream" : true
}

curl -X POST
-H "Authorization: Bearer $TOKEN"
-H "Content-Type: application/json" -H "accept-encoding: br, gzip, deflate"
-d @gcp-claude.json
"http://localhost:8080/anthropic/v1/messages" -v --compressed




Current approach:

Add new function in utils for buffered streaming, call it once when stream is done, so no duplication of token usage happens during intermediate chunks. 

For incomplete chunks just pass it through but append to buffer for subsequent checks 

Anthropic sends usage in first and last chunk.


Test:

both compressed/non compressed request on both streaming/non-streaming 

@VarSuren VarSuren requested a review from a team as a code owner September 29, 2025 21:44
@VarSuren VarSuren changed the title buffer gzip data for anthropic messages /fix buffer gzip data for anthropic messages Sep 29, 2025
@VarSuren VarSuren changed the title /fix buffer gzip data for anthropic messages fix: buffer gzip data for anthropic messages Sep 29, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 90.58824% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.21%. Comparing base (0d43423) to head (eb5d1fa).
⚠️ Report is 30 commits behind head on main.

Files with missing lines Patch % Lines
internal/extproc/messages_processor.go 82.60% 3 Missing and 1 partial ⚠️
...ernal/extproc/translator/anthropic_gcpanthropic.go 60.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1247   +/-   ##
=======================================
  Coverage   79.21%   79.21%           
=======================================
  Files          97       97           
  Lines       11125    11178   +53     
=======================================
+ Hits         8813     8855   +42     
- Misses       1919     1930   +11     
  Partials      393      393           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

// Try to decompress the accumulated buffer
if len(*gzipBuffer) > 0 {
gzipReader, err := gzip.NewReader(bytes.NewReader(*gzipBuffer))
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to check the specific error like "unexpected EOF" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, I'm not really that good with gzip errors so not sure what are other error and whether we can buffer or cannot buffer them

@yuzisun
Copy link
Contributor

yuzisun commented Sep 30, 2025

@mathetake do you remember why we do not use the decompression and compression filter?

// ResponseBody implements [AnthropicMessagesTranslator.ResponseBody] for Anthropic to GCP Anthropic.
// This is essentially a passthrough since both use the same Anthropic response format.
func (a *anthropicToGCPAnthropicTranslator) ResponseBody(_ map[string]string, body io.Reader, endOfStream bool) (
func (a *anthropicToGCPAnthropicTranslator) ResponseBody(_ map[string]string, body io.Reader, isStreaming bool) (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why renaming this? endOfStream is different from isStreaming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, but I indeed mean isStreaming rather than endOfStream basically we have 2 cases:

  1. translation for streaming
  2. translation for regular request

This boolean basically tells which one to use.

Once you raised this question let me elaborate. Previously we would do translation while it's streaming, now because we accumulate we don't care about endOfStream. Let me know if that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment so other maintainers are aware of the difference?

@sukumargaonkar
Copy link
Contributor

lgtm

Copy link
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Mainly I am concerned about the diff of the impl with this and others in chat/completions. At the end of the day, the exact same logic can be applicable, so I would like the same logic appear here as well.

Especially, special-casing gzip and buffering in the processor side feels a bit confusing just to solve the underlying the issue inside translate/anthropic_gcpanthropic.go compared to the other chat/completion streaming stuff.

Let me push some fix to make this in line with others


edited: I am getting to understand the underlying problem more. I think I need more time to think about this one..

@mathetake
Copy link
Member

mathetake commented Oct 1, 2025

Sorry a couple of things I need to know

  • Does this mean that Anthropic returns the entire giant event stream in a single gzipped response body containing potentially hundreds of events? How is that acceptable when it is supposed to return "streaming" response?
  • If the above is true, how the Anthropic SDK is supposed to parse the response body in a streaming way? I see the current code is waiting until the end of stream but if the anthropic SDK can perform the event parsing in a streaming way, then the current code looks not correct to me? What happens when we want to do the translation later into another input format (openai likely) hypothetically? shouldn't we be able to parse event-by-event instead of waiting for the end of the stream?

@mathetake
Copy link
Member

had a quick offline chat with Dan and getting the context more. I think the problem here is bigger than anthropic specific and i think we should be able to handle this at the fundamental level. Let me dig into this one more tomorrow

@mathetake
Copy link
Member

my comment in the chat with Dan

oh i think we can use the decompressor filter in envoy since now i realize that response handling is happening at the router level, not upstream level so we can use the decompress filter for sure

@VarSuren
Copy link
Contributor Author

VarSuren commented Oct 2, 2025

Let’s be clear having buffering and gzip handling in the current code doesn’t look right to me either, if compressions/decompressions work I’d go that way but we already have some gzip logic , I don’t want to mess up with other processors, additionally there is a time pressure to get this done , several teams are now blocked by this

@VarSuren
Copy link
Contributor Author

VarSuren commented Oct 2, 2025

Sorry a couple of things I need to know

  • Does this mean that Anthropic returns the entire giant event stream in a single gzipped response body containing potentially hundreds of events? How is that acceptable when it is supposed to return "streaming" response?
  • If the above is true, how the Anthropic SDK is supposed to parse the response body in a streaming way? I see the current code is waiting until the end of stream but if the anthropic SDK can perform the event parsing in a streaming way, then the current code looks not correct to me? What happens when we want to do the translation later into another input format (openai likely) hypothetically? shouldn't we be able to parse event-by-event instead of waiting for the end of the stream?

Anthropic doesn’t send all events in one response it’s done one be one, but you never know when you going to have complete chunk to decompress, therefore we buffer. You can skip waiting till the end of stream, but because I didn’t what to accidentally double parse usage in intermediate chunks and when chunks are over I do in the end of stream. The other way would be once chunks complete ( can be decompressed ) you can translate etc

@yuzisun
Copy link
Contributor

yuzisun commented Oct 2, 2025

Synced with @mathetake offline, will get this fix in first and then explore a proper long term fix using the decompression filter. @VarSuren can you help create an issue to track that?

@VarSuren
Copy link
Contributor Author

VarSuren commented Oct 2, 2025

Synced with @mathetake offline, will get this fix in first and then explore a proper long term fix using the decompression filter. @VarSuren can you help create an issue to track that?

Sure , do we want internal ticket or open source ?

@yuzisun
Copy link
Contributor

yuzisun commented Oct 2, 2025

Synced with @mathetake offline, will get this fix in first and then explore a proper long term fix using the decompression filter. @VarSuren can you help create an issue to track that?

Sure , do we want internal ticket or open source ?

issues here https://github.com/envoyproxy/ai-gateway/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants