Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port span correlation #2251

Merged
merged 34 commits into from
Apr 3, 2025
Merged

Port span correlation #2251

merged 34 commits into from
Apr 3, 2025

Conversation

tduncan
Copy link
Contributor

@tduncan tduncan commented Mar 27, 2025

This PR add span correlation snapshot profiling stack traces. Span correlation is done by intercepting OpenTelemetry context changes via a ContextStorage that maps trace IDs to SpanContexts as new Contexts are attached and detached.

tduncan added 27 commits March 19, 2025 05:09
…tests to reliably inject the snapshot profiling active span tracker into the OpenTelemetry SDK runtime.
…tting class in favor of direct use of OpenTelemetry's IdGenerator.
…ace instances so to always have a trace ID associated even if span tracking is offline.
@tduncan tduncan requested review from a team as code owners March 27, 2025 16:46
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class primarily exists to ease testing since ContextStorage.addWrapper must be called prior to the ContextStorage being initialized. Putting this behind a Supplier offers tests a way to hook in (via the SettableContextStorageProvider) without trying to coordinate the ContextStorage initialization across every single test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early designs had ActiveSpanTracker as a singleton object. That worked but was quite intricate to use correctly. Pushing the singleton behavior into a provider removes a lot of the complexities with a minor inconvenience in ScheduledExecutorStackTraceSampler where the SpanTracker is needed.

}

private boolean doNotTrack(SpanContext spanContext) {
return !spanContext.isSampled() || !registry.isRegistered(spanContext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should also add !spanContext.isValid()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for invalid spans but it did not require adding a explicit check. I think the isSampled check doubles as a check for isValid. I'll leave the test to make that scenario documented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, Span.wrap(SpanContext.getInvalid()) creates a span where sampled flag is not set. I thought a bit more about this and since for registry.isRegistered(spanContext) to pass span must be registered and since the register is called from a span processor it should be impossible for an invalid span to pass this method. Since SnapshotProfilingSpanProcessor calls register only for the root span only root spans will be reported as active spans. Is this true? If it is perhaps it would make sense to spell this out somewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that the trace registry only checks for the trace id.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private final Map<String, SpanContext> traceIds = new ConcurrentHashMap<>();
could be a set as the map value is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you answered your question, however:

Since SnapshotProfilingSpanProcessor calls register only for the root span only root spans will be reported as active spans. Is this true?

No. Its true that only a root span will trigger the trace registration process but once a trace is registered the ActiveSpanTracker will keep track of child spans as they are added to the OpenTelemetry Context.

private final Map<String, SpanContext> traceIds = new ConcurrentHashMap<>();

could be a set as the map value is not used

Yes it can! I will submit a separate PR to update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you answered your question, however:

Since SnapshotProfilingSpanProcessor calls register only for the root span only root spans will be reported as active spans. Is this true?

No. Its true that only a root span will trigger the trace registration process but once a trace is registered the ActiveSpanTracker will keep track of child spans as they are added to the OpenTelemetry Context.

One minor -- but important -- detail I left out: trace selection only happens for root spans (see SnapshotVolumePropagator) but trace registration happens for service entry spans (see SnapshotProfilingSpanProcessor).

@@ -88,13 +91,19 @@ public void run() {
try {
Instant now = Instant.now();
ThreadInfo threadInfo = threadMXBean.getThreadInfo(threadId, Integer.MAX_VALUE);
StackTrace stackTrace = StackTrace.from(now, samplingPeriod, traceId, threadInfo);
SpanContext spanContext = retrieveActiveSpan(threadId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is race between taking the thread dump and asking the active span it is possible that the stack trace and the span won't match (stack trace won't include the method where the span is active). Is that an issue?

Copy link
Contributor Author

@tduncan tduncan Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, yes, but in practice I would suspect the risk is likely quite low. Low enough that I don't think we would want to block the trace thread(s) every time the snapshot profiler wants to take a sample.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for now. Continuous profiler has the same issue, there the dump is taken by jfr and we can't really do anything to synchronize with that. If blocking is of concern then keep in mind that, as far as I know, taking the stack trace suspends all threads. You could reduce the blocking by having only one scheduled task that takes the dump for all threads that you care about in one operation.

Comment on lines 121 to 130
var executor = Executors.newFixedThreadPool(2);
try (var scope1 = executor.submit(attach(span1)).get();
var scope2 = executor.submit(attach(span2)).get()) {
assertEquals(
Optional.of(span1.getSpanContext()), spanTracker.getActiveSpan(scope1.getThreadId()));
assertEquals(
Optional.of(span2.getSpanContext()), spanTracker.getActiveSpan(scope2.getThreadId()));
} finally {
executor.shutdown();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it hard to reason why both actions should be executed on separate threads. Looks line an implementation detail of ThreadPoolExecutor. I guess it is fine as long as it works reliably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActiveSpanTracker records the thread ID using Thread.currentThread().getId() so if both spans are attached to the OpenTelemetry Context on the same thread only one would be considered the active thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase. If you change Executors.newFixedThreadPool(2) to Executors.newFixedThreadPool(1) the test will fail. As far as I can tell the fact that when using Executors.newFixedThreadPool(2) both of the actions run on separate threads is an implementation detail of ThreadPoolExecutor not a specified behavior.

Copy link
Contributor Author

@tduncan tduncan Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. You are technically correct, there aren't any documented guarantees that the two operations will actually be performed on separate threads so the test is relying on the first task not finishing prior to the second task actually being submitted. I've used this technique quite a few times over the years and it hasn't caused issues but I can add an explicit release step so the test ensures that the first submitted operation doesn't end to quickly to ensure that both threads in the thread pool much be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the test is relying on the first task not finishing prior to the second task actually being submitted

I think this is not necessary. Looking at ThreadPoolExecutor I got the impression that it creates new threads for tasks as long as the thread pool has reached the requested size. So even when the first task is completed it will create a new thread for the second task to reach the pool size.

tduncan added 5 commits April 1, 2025 09:25
…eak reference keys in ActiveSpanTracker to ensure than map entries are automatically removed if there are a thread terminates without the OpenTelemetry scope being closed properly.
…tasks to ensure those tasks execute on multiple threads.
@laurit laurit merged commit c05c5b3 into signalfx:main Apr 3, 2025
26 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants