Description
datadog-agent release 7.62.0 and later cannot comunicate with datadoghq through AWS Network Firewalls. The root cause of this regression issue is the upgrade to Go v1.23 which enables X25519Kyber768Draft00 by default This draft lacks support by the majority of existing firewall-deployments. Any docker setup on later releases, like "image": "datadog/agent:latest"
or "image": "public.ecr.aws/datadog/agent:7"
will be silenced.
The issue with Go 1.23 is described in detail in hashicorp/terraform-provider-aws#39311 My report is just a copy and replace to suit the datadog-agent issue:
datadog-agent 7.62.0 is upgraded to Go 1.23.0, which introduced a minor change to the crypto/tls standard library package:
The experimental post-quantum key exchange mechanism X25519Kyber768Draft00 is now enabled by default when Config.CurvePreferences is nil. The default can be reverted by adding tlskyber=0 to the GODEBUG environment variable.
This additional key exchange mechanism causes the length of the TLS ClientHello message to increase. The increased message length leads to AWS Network Firewall dropping the message.
AWS Network Firewall drops the message (causing the TLS handshake to timeout) because its stateful rule capability currently uses Suricata version 6.0.9, and this version of Suricata is known to drop TLS packets beyond a certain length.
Test 1 using public.ecr.aws/datadog/agent:7.63.0
datadog-agent logs
2025-02-21 10:46:38 UTC | PROCESS | ERROR | (comp/forwarder/defaultforwarder/transaction/transaction.go:116 in 4) | TLS Handshake failure: net/http: TLS handshake timeout
2025-02-21 10:46:40 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/transaction/transaction.go:116 in 4) | TLS Handshake failure: net/http: TLS handshake timeout
2025-02-21 10:46:40 UTC | CORE | ERROR | (pkg/config/remote/service/service.go:593 in pollOrgStatus) | [Remote Config] Could not refresh Remote Config: failed to issue org data request: Get "https://config.datadoghq.eu/api/v0.1/status": net/http: TLS handshake timeout
2025-02-21 10:52:30 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/worker.go:222 in process) | Error while processing transaction: error while sending transaction, rescheduling it: Post "https://7-63-0-app.agent.datadoghq.eu/intake/": net/http: TLS handshake timeout
DNS lookups (from tcpdump) are similar to previous agent versions, so my AWS Network Firewall domain whitelist is OK
7-63-0-app.agent.datadoghq.eu
api.datadoghq.eu.
config.datadoghq.eu
instrumentation-telemetry-intake.datadoghq.eu
process.datadoghq.eu
trace.agent.datadoghq.eu.
TLS handshakes (from tcpdump) are dropped by firewall
12 8.515165 10.5.16.138 → 34.107.178.244 TLSv1 157 Client Hello
[...]
584 528.207660 10.5.16.138 → 34.107.178.244 TLSv1 157 Client Hello
Firewall egress alerts
{"firewall_name":"dev-egress-firewall","availability_zone":"eu-west-1b","event_timestamp":"1740143757","event":{"app_proto":"tls","src_ip":"10.5.16.138","src_port":59340,"event_type":"alert","alert":{"severity":3,"signature_id":6,"rev":0,"signature":"","action":"blocked","category":""},"flow_id":1538496652483912,"dest_ip":"34.107.178.244","proto":"TCP","verdict":{"action":"drop"},"tls":{"version":"UNDETERMINED"},"dest_port":443,"pkt_src":"geneve encapsulation","timestamp":"2025-02-21T13:15:57.099958+0000","direction":"to_server"}}
Test 2 using public.ecr.aws/datadog/agent:7.63.0 and environment variable GODEBUG="tlskyber=0"
datadog-agent communication with datadoghq.eu works OK. No firewall drops.