Skip to content

Conversation

@codefromthecrypt
Copy link
Contributor

What this PR does / why we need it:

This commit fixes a data race in the host infrastructure where proxyContextMap was being accessed concurrently without synchronization.

The race occurred between:

  • runEnvoy() writing to the map
  • Close() reading from the map
  • stopEnvoy() reading and deleting from the map

Which issue(s) this PR fixes:

Race found in envoyproxy/ai-gateway#1314

@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.12%. Comparing base (afaa7e3) to head (02aeb1a).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/infrastructure/host/proxy_infra.go 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7176      +/-   ##
==========================================
+ Coverage   71.10%   71.12%   +0.01%     
==========================================
  Files         228      228              
  Lines       40678    40686       +8     
==========================================
+ Hits        28926    28937      +11     
+ Misses      10059    10054       -5     
- Partials     1693     1695       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codefromthecrypt codefromthecrypt force-pushed the fix/proxy-context-map-race branch from ef4c378 to 415abb5 Compare October 9, 2025 03:36
@codefromthecrypt codefromthecrypt force-pushed the fix/proxy-context-map-race branch from dab5d78 to c73790d Compare October 9, 2025 15:48
@codefromthecrypt codefromthecrypt requested a review from jukie October 9, 2025 15:49
@jukie
Copy link
Contributor

jukie commented Oct 9, 2025

Code LGTM and if you fix the linting/import issues I'll approve

@codefromthecrypt codefromthecrypt deleted the fix/proxy-context-map-race branch October 9, 2025 23:04
@codefromthecrypt codefromthecrypt restored the fix/proxy-context-map-race branch October 10, 2025 17:18
@codefromthecrypt
Copy link
Contributor Author

weird no idea why this closed.. sorry about that.

@zirain zirain force-pushed the fix/proxy-context-map-race branch from 0678aae to 02aeb1a Compare October 11, 2025 00:50
jukie
jukie previously approved these changes Oct 11, 2025
zirain
zirain previously approved these changes Oct 11, 2025
@jukie
Copy link
Contributor

jukie commented Oct 12, 2025

@codefromthecrypt thanks for adding this. Could you please add more test coverage to make the bots happy?

This commit fixes a data race in the host infrastructure where
proxyContextMap was being accessed concurrently without synchronization.

The race occurred between:
- runEnvoy() writing to the map (line 92)
- Close() reading from the map (line 37)
- stopEnvoy() reading and deleting from the map (lines 125, 129)

Added a sync.RWMutex to protect all accesses to proxyContextMap:
- Use RLock for reads in Close() and CreateOrUpdateProxyInfra()
- Use Lock for writes in runEnvoy() and stopEnvoy()
- Extract map keys before iteration in Close() to avoid holding
  the lock while calling stopEnvoy()

This ensures thread-safe access to the proxy context map during
concurrent infrastructure operations.

Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt
Copy link
Contributor Author

ok sure.. will add more coverage

Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt codefromthecrypt dismissed stale reviews from zirain and jukie via 8bbb4f1 October 13, 2025 09:33
@codefromthecrypt codefromthecrypt force-pushed the fix/proxy-context-map-race branch from c28e20b to 8bbb4f1 Compare October 13, 2025 09:33
@codefromthecrypt
Copy link
Contributor Author

at least locally coverage is good now. 🤞

@jukie
Copy link
Contributor

jukie commented Oct 14, 2025

/retest

Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from jukie and zirain October 15, 2025 01:02
@jukie
Copy link
Contributor

jukie commented Oct 15, 2025

Looks like the new tests are failing @codefromthecrypt

--- FAIL: TestNewInfra (0.00s)
    proxy_infra_test.go:494: 
        	Error Trace:	/home/runner/work/gateway/gateway/internal/infrastructure/host/proxy_infra_test.go:494
        	Error:      	Received unexpected error:
        	            	failed to stat dir: lstat /tmp/envoy-gateway/certs/envoy: no such file or directory
        	Test:       	TestNewInfra

@codefromthecrypt
Copy link
Contributor Author

I'll close this and re-open it after #7225 is merged because the /tmp tech debt is haunting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants