Skip to content

Commit efafe9c

Browse files
authored
fix: adds upgrade testing & graceful termination workaround (envoyproxy#1248)
**Description** This adds an end-to-end upgrade testing where we are supposed to test two test scenarios: simply rolling upgrade and control plane upgrade while keep making requests and verify no requests are dropped. What we found is that, as reported in envoyproxy#1241, there's a slight gap between Envoy stop receiving requests and extproc termination, hence users might experience requests being dropped during upgrade. The fundamental fix is to set extproc sidecar container in the k8s API sense, but it's only available after k8s v1.33 by default. So, this adds a common workaround to sleep before the context cancelation. The workaround fix is verified to work in the newly added e2e upgrade tests. After this is merged, adding k8s-version detection and enabling sidecar by default automatically as well as backporting the fix to v0.3 would be necessary. After that, we can enable the control-plane upgrade variant of the test case that is currently disabled in this commit. **Related Issues/PRs (if applicable)** Closes envoyproxy#1241 Closes envoyproxy#1060 --------- Signed-off-by: Takeshi Yoneda <[email protected]>
1 parent c3321f9 commit efafe9c

File tree

10 files changed

+580
-79
lines changed

10 files changed

+580
-79
lines changed

.codespell.skip

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
go.mod
1010
go.sum
1111
./tests/e2e/logs
12+
./tests/e2e-inference-extension/logs
13+
./tests/e2e-upgrade/logs
1214
*_for_tests.yaml
1315
./tests/extproc/testdata/server.*
1416
./tests/internal/testopenai/cassettes/*.yaml

.github/workflows/build_and_test.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,31 @@ jobs:
242242
TEST_GEMINI_API_KEY: ${{ secrets.ENVOY_AI_GATEWAY_GEMINI_API_KEY }}
243243
run: make test-e2e
244244

245+
test_e2e_upgrade:
246+
needs: changes
247+
if: ${{ needs.changes.outputs.code == 'true' }}
248+
name: E2E Test for Upgrades
249+
runs-on: ubuntu-latest
250+
steps:
251+
- uses: actions/checkout@v4
252+
- uses: actions/setup-go@v5
253+
with:
254+
cache: false
255+
go-version-file: go.mod
256+
- uses: actions/cache@v4
257+
with:
258+
path: |
259+
~/.cache/go-build
260+
~/.cache/golangci-lint
261+
~/go/pkg/mod
262+
~/go/bin
263+
key: e2e-test-${{ hashFiles('**/go.mod', '**/go.sum', '**/Makefile') }}
264+
- uses: docker/setup-buildx-action@v3
265+
- run: make test-e2e-upgrade
266+
env:
267+
# We only need to test the upgrade from the latest stable version of EG.
268+
EG_VERSION: v1.5.0
269+
245270
test_e2e_inference_extension:
246271
needs: changes
247272
if: ${{ needs.changes.outputs.code == 'true' }}
@@ -312,6 +337,7 @@ jobs:
312337
- test_controller
313338
- test_extproc
314339
- test_e2e
340+
- test_e2e_upgrade
315341
- test_e2e_inference_extension
316342
# We need this to run always to force-fail (and not skip) if any needed
317343
# job has failed. Otherwise, a skipped job will not fail the workflow.

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ out/
1111
# This is the placeholder for the access log file during extproc tests.
1212
ACCESS_LOG_PATH
1313

14-
tests/e2e/logs
14+
tests/e2e/logs/
15+
tests/e2e-inference-extension/logs/
16+
tests/e2e-upgrade/logs/
1517

1618
# Files and directories to ignore in the site directory
1719
# dependencies

Makefile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,12 @@ test-e2e-inference-extension: build-e2e ## Run the end-to-end tests with a local
182182
@echo "Run E2E tests for inference extension"
183183
@go test -v ./tests/e2e-inference-extension/... $(GO_TEST_ARGS) $(GO_TEST_E2E_ARGS)
184184

185+
# This runs the end-to-end upgrade tests for the controller and extproc with a local kind cluster.
186+
.PHONY: test-e2e-upgrade
187+
test-e2e-upgrade: build-e2e
188+
@echo "Run E2E upgrade tests"
189+
@go test -v ./tests/e2e-upgrade/... $(GO_TEST_ARGS) $(GO_TEST_E2E_ARGS)
190+
185191
##@ Common
186192

187193
# This builds a binary for the given command under the internal/cmd directory.

cmd/extproc/main.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import (
1111
"os"
1212
"os/signal"
1313
"syscall"
14+
"time"
1415

1516
"github.com/envoyproxy/ai-gateway/cmd/extproc/mainlib"
1617
)
@@ -21,6 +22,20 @@ func main() {
2122
signal.Notify(signalsChan, syscall.SIGINT, syscall.SIGTERM)
2223
go func() {
2324
<-signalsChan
25+
log.Printf("signal received, shutting down...")
26+
// Give some time for graceful shutdown. Right after the sigterm is issued for this pod,
27+
// Envoy's health checking endpoint starts returning 503, but there's a gap between
28+
// actual stop of the traffic to Envoy and the time when Envoy receives the SIGTERM since
29+
// the propagation of the readiness info to the load balancer takes some time.
30+
// We need to keep the extproc alive until after Envoy stops receiving traffic.
31+
// https://gateway.envoyproxy.io/docs/tasks/operations/graceful-shutdown/
32+
//
33+
// This is a workaround for older k8s versions that don't support sidecar feature.
34+
// This can be removed after the floor of supported k8s versions is larger than 1.32.
35+
//
36+
// 15s should be enough to propagate the readiness info to the load balancer for most cases.
37+
time.Sleep(15 * time.Second)
38+
log.Printf("shutting down the server now")
2439
cancel()
2540
}()
2641
if err := mainlib.Main(ctx, os.Args[1:], os.Stderr); err != nil {

tests/e2e-inference-extension/e2e_inference_extension_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ import (
1212
)
1313

1414
func TestMain(m *testing.M) {
15-
e2elib.TestMain(m, nil, true)
15+
e2elib.TestMain(m, e2elib.AIGatewayHelmOption{}, true, false)
1616
}
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Copyright Envoy AI Gateway Authors
2+
# SPDX-License-Identifier: Apache-2.0
3+
# The full text of the Apache license is available in the LICENSE file at
4+
# the root of the repo.
5+
6+
apiVersion: gateway.networking.k8s.io/v1
7+
kind: GatewayClass
8+
metadata:
9+
name: upgrade-test
10+
spec:
11+
controllerName: gateway.envoyproxy.io/gatewayclass-controller
12+
---
13+
apiVersion: gateway.networking.k8s.io/v1
14+
kind: Gateway
15+
metadata:
16+
name: upgrade-test
17+
namespace: default
18+
spec:
19+
gatewayClassName: upgrade-test
20+
listeners:
21+
- name: http
22+
protocol: HTTP
23+
port: 80
24+
infrastructure:
25+
parametersRef:
26+
group: gateway.envoyproxy.io
27+
kind: EnvoyProxy
28+
name: upgrade-test
29+
---
30+
apiVersion: gateway.envoyproxy.io/v1alpha1
31+
kind: EnvoyProxy
32+
metadata:
33+
name: upgrade-test
34+
namespace: default
35+
spec:
36+
provider:
37+
type: Kubernetes
38+
kubernetes:
39+
envoyDeployment:
40+
container:
41+
# Clear the default memory/cpu requirements for local tests.
42+
resources: {}
43+
---
44+
apiVersion: aigateway.envoyproxy.io/v1alpha1
45+
kind: AIGatewayRoute
46+
metadata:
47+
name: upgrade-test
48+
namespace: default
49+
spec:
50+
parentRefs:
51+
- name: upgrade-test
52+
kind: Gateway
53+
group: gateway.networking.k8s.io
54+
rules:
55+
- matches:
56+
- headers:
57+
- type: Exact
58+
name: x-ai-eg-model
59+
value: some-cool-model
60+
backendRefs:
61+
- name: upgrade-test-cool-model-backend
62+
timeouts:
63+
request: 120s
64+
---
65+
apiVersion: aigateway.envoyproxy.io/v1alpha1
66+
kind: AIServiceBackend
67+
metadata:
68+
name: upgrade-test-cool-model-backend
69+
namespace: default
70+
spec:
71+
schema:
72+
name: OpenAI
73+
backendRef:
74+
name: testupstream
75+
kind: Backend
76+
group: gateway.envoyproxy.io
77+
---
78+
apiVersion: gateway.envoyproxy.io/v1alpha1
79+
kind: Backend
80+
metadata:
81+
name: testupstream
82+
namespace: default
83+
spec:
84+
endpoints:
85+
- fqdn:
86+
hostname: testupstream.default.svc.cluster.local
87+
port: 80
88+
---
89+
apiVersion: apps/v1
90+
kind: Deployment
91+
metadata:
92+
name: testupstream
93+
namespace: default
94+
spec:
95+
replicas: 1
96+
selector:
97+
matchLabels:
98+
app: testupstream
99+
template:
100+
metadata:
101+
labels:
102+
app: testupstream
103+
spec:
104+
containers:
105+
- name: testupstream
106+
image: docker.io/envoyproxy/ai-gateway-testupstream:latest
107+
imagePullPolicy: IfNotPresent
108+
ports:
109+
- containerPort: 8080
110+
env:
111+
- name: TESTUPSTREAM_ID
112+
value: whatever
113+
readinessProbe:
114+
httpGet:
115+
path: /health
116+
port: 8080
117+
initialDelaySeconds: 1
118+
periodSeconds: 1
119+
---
120+
apiVersion: v1
121+
kind: Service
122+
metadata:
123+
name: testupstream
124+
namespace: default
125+
spec:
126+
selector:
127+
app: testupstream
128+
ports:
129+
- protocol: TCP
130+
port: 80
131+
targetPort: 8080
132+
type: ClusterIP

0 commit comments

Comments
 (0)