-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Inspecting individual goroutines, their creators and their states can be a critical debugging tool in some situations, but this level of per-goroutine information is not surfaced in aggregated debug=1 or debug=0 goroutine profiles, while the STW pauses of debug=2 / runtime.Stack(all=true) profiles that do include it can be extremely disruptive in processes with larger numbers of goroutines.
This proposal recommends adding a new debug=3 goroutine profile mode based on debug=0 that provides details on each goroutine, rather than just counts by unique stacks, by adding per-goroutine attributes such as ID, creator, state and wait-duration as additional labels.
This mode would thus make available information previously only accessible via debug=2, while dramatically reducing the latency impact of of collecting it, thanks to the low-pause, concurrent snapshot mechanism introduced in CL 387415 and already used for debug=0 and debug=1.
Background
The /debug/pprof/goroutine endpoint supports two main human-readable formats:
-
debug=1: A collapsed textual profile consisting of counts of goroutines with matching top function frame and label values, followed by a representative stack trace. It does not show individual goroutines or attributes such as their state or wait times. Aggregation by top frame means that not all goroutine stack traces are represented. -
debug=2: A full, panic-style dump of the complete stack of all goroutines individually, with additional metadata useful for debugging, including status, scheduling state, and stack creation information (but not pprof labels).
CL 387415 introduced a new mechanism for low-latency goroutine stack profiling that uses a brief STW to enable profiling that then runs concurrently after it is resumed. This approach is now used internally by pprof.Lookup("goroutine") and is significantly less disruptive than the full stop-the-world scan.
However, debug=2 continues to use the previous STW approach (runtime.Stack(all=true)), which scales poorly in systems with high goroutine counts. In real-world systems with tens or hundreds of thousands of goroutines, this can result in STW pauses of tens or hundreds of milliseconds, making debug=2 disruptive to use in production debugging workflows.
Alternatives Considered
-
Reduce
debug=2STW in-place:debug=2already has this per-goroutine information (except labels), so modifying it in-place to just reduce STW pause associated with collecting it would be appealing, however the implementation could be a challenge:debug=2includes function call argument values, has slightly different length-limiting controls, and including per-goroutine labels (proposal: runtime/pprof: add individual goroutine profile with labels #74964) would likely require a breaking change of the output format. -
Add to
debug=0anddebug=1in-place: These profiles already have minimal collection impact latency and their format includes labels which could be added to to add the extra information. However the change in aggregation behavior of reporting at per-goroutine granularity instead of an aggregated count likely makes existing profiles much larger, and is a breaking behavior change. This suggests the addition of labels to these formats should be opt-in instead (e.g. via a newdebugvalue).
Compatibility
- Adding a new
debug=3format while leaving existing debug=0,1,2 behavior unchanged avoids breaking changes to existing users of the existing profile modes, presenting the new functionality strictly to those who opt-in by using the new format.
Implementation Sketch
The existing runtime.goroutineProfileWithLabelsConcurrent function already supports concurrent, low-pause stack trace collection for aggregated profiles.
To support the richer, per-goroutine output required by a debug=3 format, it can be extended to also collect the additional fields from each g, such as goid, parentGoid and waitsince, along side the labels and StackRecords it already collects. This implementation would reuse the same synchronization and collection infrastructure already proven for debug=1-style profiles, with minimal added runtime complexity.
The printing/proto encoding of the collected information in runtime/pprof would then, optionally based on the debug mode, add these fields to its output, with debug=3 (and potentially also debug=4, if we want to offer both binary proto and textual options) causing the inclusion of these extra fields in pprof labels when formatting the profile (in countProfile).
A proof of concept of this approach seems to produce promising benchmark results.
Prior Work and Related Proposals
- Issue #50794: explored alternatives to full stack dumps for large numbers of goroutines, including sampling.
- CL 387415: introduced the barrier-based, low-pause stack snapshot mechanism now used by
pprof.Lookup("goroutine"). - CL 574795: abandoned CL for stack size profiling that also altered this implementation, related to Issue #66566.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status