Skip to content

Conversation

@dt
Copy link
Member

@dt dt commented Oct 7, 2025

runtime/pprof: add debug=3 goroutine profile with goid and labels

This adds a new goroutine profile debug mode, debug=3.

This mode emits in the same binary proto output format as debug=0, with
the only difference being that it does not aggregate matching
stack/label combinations into a single count, and instead emits a sample
per goroutine with additional synthesized labels to communicate some of
the details of each goroutine, specifically:

  • go::goroutine: the goroutine's ID
  • go::goroutine_created_by: ID of the goroutine's creator (if any)
  • go::goroutine_state: current state of the goroutine (e.g. runnable)
  • go::goroutine_wait_minutes: approximate minutes goroutine has spent
    waiting (if applicable)

Previously the debug=2 mode was the only way to get this kind of
per-goroutine information, that is sometimes vital to understanding the
state of a process. However debug=2 has two major drawbacks:

  1. its collection incurs a lengthy and disruptive stop-the-world pause and
  2. it does not include user-set labels along side per-goroutine details in the same profile.

This new debug=3 mode uses the same concurrent collection mechanism used
to produce debug=0 and debug=1 profiles, meaning it has the same minimial
stop-the-world penalty. At the same time, it includes the per-goroutine
details like status and wait time that make debug=2 so useful, providing
a "best-of-both-worlds" option.

A new mode is introduced, rather than changing the implementation of the
debug=2 format in-place, as it is not clear that debug=2 can utilize a
concurrent collection mechanism while maintaining the correctness of its
existing output, which includes argument values in its printed stacks.

The difference in STW latency observed by running goroutines during
profile collection is demonstrated by an included benchmark which spawns
a number of goroutines to be profiled and then measures the latency of a
short timer while collecting goroutine profiles.

BenchmarkGoroutineProfileLatencyImpact

                       │        debug=2  │                   debug=3             │
                       │ max_latency_ns  │ max_latency_ns  vs base               │
goroutines=100x3-14         422.2k ± 13%   190.3k ±   38%  -54.93% (p=0.002 n=6)
goroutines=100x10-14        619.7k ± 10%   171.1k ±   43%  -72.38% (p=0.002 n=6)
goroutines=100x50-14       1423.6k ±  7%   174.3k ±   44%  -87.76% (p=0.002 n=6)
goroutines=1000x3-14       2424.8k ±  8%   298.6k ±  106%  -87.68% (p=0.002 n=6)
goroutines=1000x10-14      7378.4k ±  2%   268.2k ±  146%  -96.36% (p=0.002 n=6)
goroutines=1000x50-14     23372.5k ± 10%   330.1k ±  173%  -98.59% (p=0.002 n=6)
goroutines=10000x3-14      42.802M ± 47%   1.991M ±  105%  -95.35% (p=0.002 n=6)
goroutines=10000x10-14    36668.2k ± 95%   743.1k ±   72%  -97.97% (p=0.002 n=6)
goroutines=10000x50-14   120639.1k ±  2%   188.2k ± 2582%  -99.84% (p=0.002 n=6)
geomean                     6.760M         326.2k          -95.18%

The per-goroutine details are included in the profile as labels, along
side any user-set labels. While the pprof format allows for multi-valued
labels, so a collision with a user-set label would preserve both values,
it also discourages them, thus the 'go::' namespace prefix is used to
minimize collisions with user-set labels. The form 'go::' follows the
convention established in the pprof format, which reserves 'pprof::'.

Fixes golang#74954.

Change-Id: If90eb01887ae3f35be8acc3d239b88dc29d338a8

@dt
Copy link
Member Author

dt commented Oct 7, 2025

This PR is against master, rather than the CRDB branch, so mostly just want review here for content and approach, to arrive at a patch we think could potentially be, whether we submit it or not, eligible for submission as a CL upstream.

If/when we are happy that this is good current/future go versions, then I'll go back and rebase it against our current go fork branches, just since there is some churn in the area to handle (a recent change added a new 'leak' profile upstream) and I'd rather only do that once, after we arrive at what we think is the correct patch going forward/for upstream.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new debug=3 mode for goroutine profiles that provides detailed per-goroutine information (ID, creator, state, wait time) in binary proto format while using the efficient concurrent collection mechanism. The key advantage is dramatically reduced latency impact on running goroutines compared to the existing debug=2 mode.

Key changes:

  • Adds debug=3 goroutine profile mode with per-goroutine metadata labels
  • Refactors goroutine profiling to support both aggregated and non-aggregated collection
  • Improves waitsince timing precision by delaying reset until after profile collection

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/runtime/traceback.go Extracts goroutine status/wait time logic into reusable functions for profile consumption
src/runtime/proc.go Adjusts waitsince field reset timing to preserve accurate wait duration during profiling
src/runtime/pprof/proto.go Adds pbLabelNum helper and updates parameter naming for label encoding
src/runtime/pprof/pprof_test.go Adds comprehensive tests and benchmarks for the new debug=3 mode
src/runtime/pprof/pprof.go Implements debug=3 mode logic and refactors profile interfaces to support goroutine metadata
src/runtime/mprof.go Updates goroutine profiling to collect and store per-goroutine metadata
src/internal/profilerecord/profilerecord.go Adds GoroutineRecord type and interface methods for unified profile handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like this idea. I gave the code an initial read through, but I'm not familiar enough with the go pprof internals to offer much more than superficial comments yet.

Is there anything you think deserves special attention and/or a second look?

@dt
Copy link
Member Author

dt commented Oct 7, 2025

Is there anything you think deserves special attention and/or a second look?

If I had to pick two things:
a) I moved waitsince around a little in exitsyscall. A second pair of eyes on the function -- in is entirety -- would be nice to double-check my change is sound and sufficient.
b) Am I outputting enough/the right stuff in printCountProfile? I'm not actually emitting creation location, just creator goid, since I don't know what I'd need creation location for, but it is in debug=2. Similarly I think there are more fields in goroutiune header for debug=2 than just status/waitreason these days (scan, leaked, bubble, etc) but I don't know if those are "status" and if we need them here. I think they're mostly there for traceback to show in a crash and only make it into "goroutine profiles" since debug=2 shares code with traceback.

@dt dt force-pushed the debug3 branch 3 times, most recently from c1c325a to e43f204 Compare October 26, 2025 13:54
This adds a new goroutine profile debug mode, debug=3.

This mode emits in the same binary proto output format as debug=0, with
the only difference being that it does not aggregate matching
stack/label combinations into a single count, and instead emits a sample
per goroutine with additional synthesized labels to communicate some of
the details of each goroutine, specifically:
  - go::goroutine: the goroutine's ID
  - go::goroutine_created_by: ID of the goroutine's creator (if any)
  - go::goroutine_state: current state of the goroutine (e.g. runnable)
  - go::goroutine_wait_minutes: approximate minutes goroutine has spent
    waiting (if applicable)

Previously the debug=2 mode was the only way to get this kind of
per-goroutine information, that is sometimes vital to understanding the
state of a process. However debug=2 has two major drawbacks:
  1) its collection incurs a lengthy and disruptive stop-the-world pause and
  2) it does not include user-set labels along side per-goroutine details in the same profile.

This new debug=3 mode uses the same concurrent collection mechanism used
to produce debug=0 and debug=1 profiles, meaning it has the same minimial
stop-the-world penalty. At the same time, it includes the per-goroutine
details like status and wait time that make debug=2 so useful, providing
a "best-of-both-worlds" option.

A new mode is introduced, rather than changing the implementation of the
debug=2 format in-place, as it is not clear that debug=2 can utilize a
concurrent collection mechanism while maintaining the correctness of its
existing output, which includes argument values in its printed stacks.

The difference in STW latency observed by running goroutines during
profile collection is demonstrated by an included benchmark which spawns
a number of goroutines to be profiled and then measures the latency of a
short timer while collecting goroutine profiles.

BenchmarkGoroutineProfileLatencyImpact

                       │        debug=2  │                   debug=3             │
                       │ max_latency_ns  │ max_latency_ns  vs base               │
goroutines=100x3-14         422.2k ± 13%   190.3k ±   38%  -54.93% (p=0.002 n=6)
goroutines=100x10-14        619.7k ± 10%   171.1k ±   43%  -72.38% (p=0.002 n=6)
goroutines=100x50-14       1423.6k ±  7%   174.3k ±   44%  -87.76% (p=0.002 n=6)
goroutines=1000x3-14       2424.8k ±  8%   298.6k ±  106%  -87.68% (p=0.002 n=6)
goroutines=1000x10-14      7378.4k ±  2%   268.2k ±  146%  -96.36% (p=0.002 n=6)
goroutines=1000x50-14     23372.5k ± 10%   330.1k ±  173%  -98.59% (p=0.002 n=6)
goroutines=10000x3-14      42.802M ± 47%   1.991M ±  105%  -95.35% (p=0.002 n=6)
goroutines=10000x10-14    36668.2k ± 95%   743.1k ±   72%  -97.97% (p=0.002 n=6)
goroutines=10000x50-14   120639.1k ±  2%   188.2k ± 2582%  -99.84% (p=0.002 n=6)
geomean                     6.760M         326.2k          -95.18%

The per-goroutine details are included in the profile as labels, along
side any user-set labels. While the pprof format allows for multi-valued
labels, so a collision with a user-set label would preserve both values,
it also discourages them, thus the 'go::' namespace prefix is used to
minimize collisions with user-set labels. The form 'go::' follows the
convention established in the pprof format, which reserves 'pprof::'.

Fixes golang#74954.

Change-Id: If90eb01887ae3f35be8acc3d239b88dc29d338a8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

proposal: runtime/pprof: add faster per-goroutine debug=3 goroutine profile

2 participants