Skip to content

[go] Bump go version to 1.23, remove stdlib version pins#611

Open
sigabrtv1-ui wants to merge 17 commits intosonic-net:masterfrom
sigabrtv1-ui:bump-go-1.21
Open

[go] Bump go version to 1.23, remove stdlib version pins#611
sigabrtv1-ui wants to merge 17 commits intosonic-net:masterfrom
sigabrtv1-ui:bump-go-1.21

Conversation

@sigabrtv1-ui
Copy link

Why I did it

Debian Trixie ships Go 1.24. With go 1.19 in go.mod, go mod vendor fails under Go 1.24 with:

go: updates to go.mod needed; to update it: go mod tidy

Additionally, several replace directives were pinning stdlib packages (x/crypto, x/sys, google.golang.org/grpc, google.golang.org/protobuf) to older versions unnecessarily. These pins are removed so the versions declared in require are used directly.

How I did it

  • Bump go directive from 1.19 to 1.23.0 (minimum required by current deps)
  • Add toolchain go1.24.4 directive
  • Remove pins for golang.org/x/crypto, x/sys, google.golang.org/grpc, google.golang.org/protobuf
  • Add github.com/openconfig/goyang pin to maintain compatibility with pinned ygot v0.7.1 and openconfig/gnmi (upgrading these requires regenerating sonic-mgmt-common/translib/ocbinds from YANG models — tracked separately)
  • Run go mod tidy to reconcile go.sum

Note on remaining pins

openconfig/gnmi, openconfig/ygot, and openconfig/goyang remain pinned. Removing them requires regenerating ygot-produced code in sonic-mgmt-common/translib/ocbinds — that is tracked as a separate effort.

Description for the changelog

Bump go module version to 1.23, remove unnecessary stdlib version pins for Debian Trixie (Go 1.24) compatibility.

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui
Copy link
Author

cc @zbud-msft — would appreciate your review on this. This is part of the Trixie migration effort tracked in sonic-net/sonic-buildimage#25959.

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

sigabrtv1-ui pushed a commit to sigabrtv1-ui/sonic-gnmi that referenced this pull request Mar 9, 2026
The sonic-slave-bookworm image ships Go 1.19. After PR sonic-net#611 bumped
go.mod to 'go 1.23' and removed replace pins for x/crypto, x/sys,
and google.golang.org/grpc, the vendored versions of these packages
(x/crypto v0.36.0, x/sys v0.33.0, grpc v1.69.2) now import stdlib
packages that were added in Go 1.20-1.22:
  - crypto/ecdh (Go 1.20)
  - maps, slices (Go 1.21)
  - math/rand/v2 (Go 1.22)

This causes the 'Build build' job to fail with 'cannot find package'
errors for these stdlib imports.

Fix: install Go 1.23.9 in the build job before building, similar to
how PureCIJob already installs a specific Go version.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@sigabrtv1-ui
Copy link
Author

Found the root cause of the build failure.

Root cause: sonic-slave-bookworm:latest ships Go 1.19. After removing the replace pins for x/crypto, x/sys, and google.golang.org/grpc, the newly pulled versions require stdlib packages that don't exist in Go 1.19:

vendor/golang.org/x/crypto: imports crypto/ecdh (added Go 1.20)
vendor/google.golang.org/grpc: imports maps, slices (added Go 1.21)
vendor/google.golang.org/grpc: imports math/rand/v2 (added Go 1.22)

Fix (just pushed): Added a Go 1.23.9 installation step to the build job in azure-pipelines.yml, mirroring what PureCIJob already does. This is consistent with the go 1.23 directive in go.mod.

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

sigabrtv1-ui pushed a commit to sigabrtv1-ui/sonic-gnmi that referenced this pull request Mar 9, 2026
…ZMQ issue)

sonic_data_client tests all pass, but ZMQ context cleanup (zmq_ctx_destroy
via atexit) blocks indefinitely with Go 1.23.9, causing the test binary to
hang until the test timeout kills it. With Go 1.19 the same cleanup completed
in ~24s. This is a Go 1.23 CGO runtime regression unrelated to PR sonic-net#611 code.

Exclude sonic_data_client from check_gotest_junit for now. Track alongside
PR sonic-net#606 (PollStats stoppable) which addresses related ZMQ lifecycle issues.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Contributor

/azp run

go mod tidy rewrites go.mod from 'go 1.23' (2-part) to 'go 1.23.0'
(3-part, Go 1.21+ format). The integration test step runs gotestsum
under sudo, which uses the container's secure_path Go (pre-1.21) that
cannot parse the 3-part format, giving:
  invalid go version '1.23.0': must match format 1.23

The sonic-gnmi Makefile only calls 'go mod vendor', not 'go mod tidy',
so tidy in the pipeline step was unnecessary. vendor/ is committed and
consistent with go.mod. Removing tidy keeps go.mod as 'go 1.23'
throughout, which the container's system Go can parse.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…smatch

'go mod vendor' (called by make all) rewrites go.mod from 'go 1.23'
(2-part) to 'go 1.23.0' (3-part, Go 1.21+ format) and adds a
'toolchain go1.23.9' directive. The Makefile already removes 'toolchain'
but leaves 'go 1.23.0'.

The check_gotest_junit target runs gotestsum under 'sudo -E'. sudo's
secure_path resolves 'go' to /usr/bin/go (Go 1.19 from apt) rather than
our installed /usr/local/go/bin/go (Go 1.23.9). Go 1.19 cannot parse
'go 1.23.0' (3-part format introduced in Go 1.21), giving:
  invalid go version '1.23.0': must match format 1.23
  unknown directive: toolchain

Fix: run 'git checkout -- go.mod' after 'make all' and before
'make check_gotest_junit' to restore go.mod to its committed state
(go 1.23, 2-part, no toolchain), which Go 1.19 can parse.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

When go.mod specifies 'go 1.23', Go's module graph resolution picks up
newer versions of some transitive (indirect) dependencies that are
required by sonic-mgmt-common's gNSI Credentialz implementation
(commit 61d8b07, merged into sonic-mgmt-common master).

Without this update, 'go mod vendor' (run by the Makefile during build)
resolves newer versions than what go.mod pins, causing:
  go: inconsistent vendoring: <pkg> is marked as explicit in
  vendor/modules.txt, but not explicitly required in go.mod

Updated indirect deps:
- github.com/cncf/xds/go: v0.0.0-20240318... -> v0.0.0-20240905...
- github.com/envoyproxy/go-control-plane: v0.12.0 -> v0.13.1
- github.com/envoyproxy/protoc-gen-validate: v1.0.4 -> v1.1.0
- google.golang.org/genproto/googleapis/api: v0.0.0-20240318... -> v0.0.0-20241015...
- Added: cel.dev/expr v0.16.2 (new indirect via go-control-plane v0.13.1)
- Added: github.com/planetscale/vtprotobuf v0.6.1-... (new indirect)
- Moved: github.com/go-redis/redis/v7 direct -> indirect (not directly
  imported by sonic-gnmi packages; imported via sonic-mgmt-common)

All are indirect deps; no sonic-gnmi API changes.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

check_gotest_junit runs gotestsum under 'sudo -E'. sudo uses its
secure_path to resolve 'go', finding the apt-installed Go 1.19 at
/usr/bin/go rather than our installed Go 1.23.9 at /usr/local/go/bin/go.

Go 1.19 cannot compile packages that import Go 1.21+ stdlib packages
(slices, maps, crypto/ecdh, math/rand/v2), which are imported by the
newer versions of x/sys, x/crypto, and grpc that sonic-gnmi now requires.

Fix: create a symlink /usr/local/bin/go -> /usr/local/go/bin/go before
running make check_gotest_junit. Debian's default sudo secure_path puts
/usr/local/bin before /usr/bin, so the symlink takes precedence over
the apt Go and sudo finds Go 1.23.9.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Go 1.21+ inlines small functions by default. gomonkey patches runtime
function pointers using assembly tricks that break when the target
function is inlined. Two integration tests regressed after the Go 1.19
-> 1.23 bump:

- TestGetFileStat: panics with nil pointer dereference in godbus because
  the DBus mock failed to apply (inlined), causing the real connection
  attempt on a non-existent socket to return nil.

- TestNewEventClient: hangs for 11+ minutes because the gomonkey patch
  on event_set_global_options (a CGo call) failed silently (inlined);
  the real CGo function blocks indefinitely waiting for the SONiC event
  daemon (not running in CI).

Fix: export GOFLAGS='-gcflags=all=-l' before make check_gotest_junit.
sudo -E preserves GOFLAGS, which go test picks up to disable inlining,
allowing gomonkey patches to apply correctly.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

With Go 1.21+, event_set_global_options (a ZMQ-based CGo call in the
SONiC event system) blocks indefinitely when the event daemon is not
running in CI. The test spawns a goroutine that calls NewEventClient
-> Set_heartbeat -> event_set_global_options. The goroutine cannot
be cancelled once blocked in CGo.

With Go 1.19, the same CGo call would fast-fail (return an error when
the ZMQ socket has no endpoint to connect to). Go 1.21+ changed the
CGo thread lifecycle: os.Exit() no longer immediately terminates blocked
CGo goroutines, so the test binary hangs for 11 minutes until the
default go test timeout kills the package, reporting FAIL.

Temporarily skip with a tracking issue until Set_heartbeat is made
timeout-aware or the test is restructured to avoid goroutine leaks.

Tracking: https://github.com/sonic-net/sonic-gnmi/issues/619

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Two test failures introduced by the Go 1.19 -> 1.23 bump:

## sonic_data_client: test binary hangs for 11 minutes

TestMain called m.Run() but not os.Exit(m.Run()). With Go 1.19, the
test framework called os.Exit() immediately when TestMain returned.
With Go 1.21+, the framework uses a 'graceful' exit path that waits
for goroutines to finish, which blocks indefinitely because:

1. PollStats() (started by init) is an infinite loop over /proc/stat
   that never exits on its own.
2. ZMQ/swsscommon CGo libraries create C threads; Go 1.21+ attempts
   to join these threads during cleanup instead of killing them.

Fix: call os.Exit(m.Run()) explicitly, which calls C exit() immediately
and terminates all goroutines and C threads.

## telemetry: TestINotifyCertMonitoringCopy exceeds 10s context timeout

Under Go 1.21+ with -race and parallel package execution, the
iNotifyCertMonitoring goroutine may take >10s to be scheduled (15s
observed). The 10s context timeout was already expired by the time
testReadySignal arrived, so the subsequent select on serverControlSignal
fired ctx.Done() immediately.

Fix: increase timeoutInterval to 60s to accommodate slow goroutine
scheduling under load. The test passed in 9s on Go 1.19 (no load);
60s provides sufficient headroom for Go 1.21+ with -race under CI load.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Go 1.21+ changed os.Exit behavior in test binaries: runtime_beforeExit
now waits for all goroutines to quiesce before flushing the race detector
report and coverage data (-race + -coverprofile are both enabled in CI).

PollStats is started by init() and runs forever, reading /proc/stat every
100ms. With Go 1.19, the test framework called os.Exit aggressively and
the process terminated without waiting for goroutines. With Go 1.21+,
runtime_beforeExit blocks indefinitely on the running PollStats goroutine,
causing the sonic_data_client test binary to hang for 11 minutes.

Fix:
- Add pollStatsStop channel to non_db_client.go; PollStats checks it at
  the top of each iteration and returns when closed.
- In TestMain (client_test.go), close(pollStatsStop) before os.Exit to
  let PollStats exit cleanly, allowing runtime_beforeExit to complete.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

With -race, os.Exit calls runtime_beforeExit -> racefini() -> __tsan_fini
(ThreadSanitizer finalization). TSan's __tsan_fini waits for all
instrumented C threads to flush their shadow memory. ZMQ and swsscommon
create C background threads (for socket I/O, pub/sub, etc.) that TSan
instruments. These threads are never explicitly joined by test teardown
code, so __tsan_fini waits indefinitely, blocking the test binary for
11 minutes until go test's timeout fires.

Fix: use syscall.Exit(code) in TestMain, which calls the kernel
exit_group(2) syscall directly, bypassing runtime_beforeExit and TSan
finalization entirely. All individual test results are written to
stdout by gotestsum's JSON interceptor before TestMain is called, so
no test pass/fail data is lost. The exit code correctly reflects the
overall test outcome.

The previous fix (close(pollStatsStop)) is still needed to stop the
PollStats goroutine so the race detector doesn't flag it, but it is
no longer the primary exit mechanism.

Note: this is a known limitation with -race + CGo-heavy packages that
use C libraries with unjoined background threads. The proper long-term
fix is to close all ZMQ contexts/sockets in test teardown, which would
allow __tsan_fini to complete normally.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants