Skip to content

Conversation

@brenns10
Copy link
Contributor

From the commit message:

Add helpers to detect stack traces in kernel logs, and convert them to drgn stack traces. Since drgn typically has DWARF debuginfo available, it can provide filenames, line numbers, and inline function frames, which make it much easier to use the logged stack traces.

The helper is currently only tested on x86_64. It's quite easy to play around with by something like the following:

sudo sh -c 'echo l>/proc/sysrq-trigger'
sudo drgn contrib/logged_stack_traces.py

I recently was working on a bug and realized that the logged stack traces were not nearly as useful as drgn's. So I cobbled the first version of this together which takes advantage of Program.stack_trace_from_pcs(). I'd like to get it working on all the officially supported drgn architectures, which is why I'm filing this as a draft for now. But I'd appreciate any feedback at this stage.

Add helpers to detect stack traces in kernel logs, and convert them to
drgn stack traces. Since drgn typically has DWARF debuginfo available,
it can provide filenames, line numbers, and inline function frames,
which make it much easier to use the logged stack traces.

The helper is currently only tested on x86_64. It's quite easy to play
around with by something like the following:

    sudo sh -c 'echo l>/proc/sysrq-trigger'
    sudo drgn contrib/logged_stack_traces.py

Signed-off-by: Stephen Brennan <[email protected]>
@osandov
Copy link
Owner

osandov commented Jul 16, 2025

Oh this is cool. It has some overlap with this thing that I wrote for getting a "stack trace" of an oops register dump: https://gist.github.com/osandov/00b1ceec909af87bae063f5d77807be1 (it can only "unwind" inline calls, but it lets you access local variables that were in registers, which is really cool). I never published that because raising FaultError didn't actually work until you fixed it recently, and it probably needs #463 in order to not be a total hack.

Anyways, I like the general idea. One caveat of stack_trace_from_pcs() is that it creates all frames with interrupted = False. I think that's fine for this except for interrupts. Depending on how far we want to go with this, we probably want an API for constructing stack traces with full control over register contents, interrupted, etc., which would allow us to stitch together register dumps and traces across interrupt boundaries.

Even without all of that, I'm okay with something like this as long as it supports the other architectures. Otherwise, I'd be happy to take it in contrib as-is.

@brenns10
Copy link
Contributor Author

Awesome, I'll play around with the different architectures in vmtest and see what I can get. I do think I may need to break it down into some per-architecture regexes and come up with a more general way of detecting the end of the trace, but nothing too wild.

It has some overlap with this thing that I wrote for getting a "stack trace" of an oops register dump: https://gist.github.com/osandov/00b1ceec909af87bae063f5d77807be1 (it can only "unwind" inline calls, but it lets you access local variables that were in registers, which is really cool). I never published that because raising FaultError didn't actually work until you fixed it recently, and it probably needs #463 in order to not be a total hack.

Heck yes, getting variables from the logged registers is so cool!

Depending on how far we want to go with this, we probably want an API for constructing stack traces with full control over register contents, interrupted, etc., which would allow us to stitch together register dumps and traces across interrupt boundaries.

I've definitely got some code in drgn-tools that works with List[StackFrame] rather than StackTrace in order to patch up cases where we couldn't unwind through an interrupt boundary. So I do really like the idea of being able to create a fully custom trace.

One thing that I was hoping to explore with this (though not in this PR... in some hacky code off to the side) is some way of validating the stack trace with the DWARF info after it's constructed, as a confirmation that you've created a valid stack trace. As far as I can tell there is DWARF metadata for which functions call which other functions, and from which PCs. Of course indirect calls would need to be handled (and I guess that gets weirder with retpolines).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants