-
Notifications
You must be signed in to change notification settings - Fork 468
feat(profiling): add sample count to internal payload #15177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(profiling): add sample count to internal payload #15177
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 251 ± 4 ms. The average import time from base is: 257 ± 3 ms. The import time difference between this PR and base is: -6.3 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate kowalski/feat-profiling-add-sample-count-to-internal-payload (de09d48) with baseline main (73f2611) 📈 Performance Regressions (1 suite)📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.215µs (SLO: <20.000µs 📉 -83.9%) vs baseline: 📈 +11.0% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 198.126µs (SLO: <220.000µs -9.9%) vs baseline: -0.4% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.6% ✅ 1-distribution-metric-1-timesTime: ✅ 3.411µs (SLO: <20.000µs 📉 -82.9%) vs baseline: +3.2% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +5.0% ✅ 1-distribution-metrics-100-timesTime: ✅ 214.003µs (SLO: <220.000µs -2.7%) vs baseline: -0.7% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.8% ✅ 1-gauge-metric-1-timesTime: ✅ 2.154µs (SLO: <20.000µs 📉 -89.2%) vs baseline: ~same Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +5.0% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.548µs (SLO: <150.000µs -9.0%) vs baseline: ~same Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.6% ✅ 1-rate-metric-1-timesTime: ✅ 3.089µs (SLO: <20.000µs 📉 -84.6%) vs baseline: +0.9% Memory: ✅ 34.465MB (SLO: <35.500MB -2.9%) vs baseline: +4.7% ✅ 1-rate-metrics-100-timesTime: ✅ 211.668µs (SLO: <250.000µs 📉 -15.3%) vs baseline: -0.6% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.8% ✅ 100-count-metrics-100-timesTime: ✅ 20.314ms (SLO: <22.000ms -7.7%) vs baseline: +0.8% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +5.0% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.293ms (SLO: <2.300ms 🟡 -0.3%) vs baseline: ~same Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +5.4% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.409ms (SLO: <1.550ms -9.1%) vs baseline: -0.4% Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.9% ✅ 100-rate-metrics-100-timesTime: ✅ 2.222ms (SLO: <2.550ms 📉 -12.9%) vs baseline: +1.6% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.9% ✅ flush-1-metricTime: ✅ 4.651µs (SLO: <20.000µs 📉 -76.7%) vs baseline: -0.3% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +5.1% ✅ flush-100-metricsTime: ✅ 175.751µs (SLO: <250.000µs 📉 -29.7%) vs baseline: ~same Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.8% ✅ flush-1000-metricsTime: ✅ 2.105ms (SLO: <2.500ms 📉 -15.8%) vs baseline: -0.7% Memory: ✅ 35.291MB (SLO: <36.500MB -3.3%) vs baseline: +4.9% 🟡 Near SLO Breach (1 suite)🟡 otelspan - 22/22✅ add-eventTime: ✅ 38.313ms (SLO: <47.150ms 📉 -18.7%) vs baseline: -0.9% Memory: ✅ 38.903MB (SLO: <47.000MB 📉 -17.2%) vs baseline: +4.6% ✅ add-metricsTime: ✅ 257.661ms (SLO: <344.800ms 📉 -25.3%) vs baseline: +0.6% Memory: ✅ 43.215MB (SLO: <47.500MB -9.0%) vs baseline: +4.7% ✅ add-tagsTime: ✅ 315.348ms (SLO: <321.000ms 🟡 -1.8%) vs baseline: ~same Memory: ✅ 43.234MB (SLO: <47.500MB -9.0%) vs baseline: +4.7% ✅ get-contextTime: ✅ 78.795ms (SLO: <92.350ms 📉 -14.7%) vs baseline: +0.4% Memory: ✅ 39.294MB (SLO: <46.500MB 📉 -15.5%) vs baseline: +4.8% ✅ is-recordingTime: ✅ 36.060ms (SLO: <44.500ms 📉 -19.0%) vs baseline: ~same Memory: ✅ 38.746MB (SLO: <47.500MB 📉 -18.4%) vs baseline: +5.4% ✅ record-exceptionTime: ✅ 57.003ms (SLO: <67.650ms 📉 -15.7%) vs baseline: -0.2% Memory: ✅ 39.466MB (SLO: <47.000MB 📉 -16.0%) vs baseline: +5.0% ✅ set-statusTime: ✅ 42.578ms (SLO: <50.400ms 📉 -15.5%) vs baseline: +0.8% Memory: ✅ 38.833MB (SLO: <47.000MB 📉 -17.4%) vs baseline: +4.8% ✅ startTime: ✅ 35.351ms (SLO: <43.450ms 📉 -18.6%) vs baseline: +0.2% Memory: ✅ 38.894MB (SLO: <47.000MB 📉 -17.2%) vs baseline: +5.6% ✅ start-finishTime: ✅ 81.989ms (SLO: <88.000ms -6.8%) vs baseline: -0.1% Memory: ✅ 36.667MB (SLO: <46.500MB 📉 -21.1%) vs baseline: +4.9% ✅ start-finish-telemetryTime: ✅ 83.511ms (SLO: <89.000ms -6.2%) vs baseline: ~same Memory: ✅ 36.687MB (SLO: <46.500MB 📉 -21.1%) vs baseline: +4.9% ✅ update-nameTime: ✅ 36.874ms (SLO: <45.150ms 📉 -18.3%) vs baseline: -0.5% Memory: ✅ 38.994MB (SLO: <47.000MB 📉 -17.0%) vs baseline: +5.1%
|
ddtrace/internal/datadog/profiling/dd_wrapper/include/profiler_stats.hpp
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/dd_wrapper/include/profiler_stats.hpp
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/dd_wrapper/include/profiler_stats.hpp
Outdated
Show resolved
Hide resolved
b7a5bfb to
6c2b440
Compare
ddtrace/internal/datadog/profiling/dd_wrapper/src/profiler_stats.cpp
Outdated
Show resolved
Hide resolved
| std::cerr << "Error writing to output file " << filename << ": " << strerror(errno) << std::endl; | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: what version of C++ are we on ? I don't know if we could be using std::path
bool write_internal_metadata(std::string_view filename, std::string_view json)
{
namespace fs = std::filesystem;
auto path = fs::path(filename).concat(".internal_metadata.json");
std::error_code ec;
std::filesystem::create_directories(path.parent_path(), ec); // probably not needed ?
std::ofstream(path, std::ios::binary).write(json.data(), json.size());
return !ec;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mostly for my curiosity, feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are on C++ 17 in dd-trace-py. DSN asked why a few days ago and I believe the answer was "for no specific reason" – we could be in C++20 if we wanted to I think.
That said, I remember reading that using std::filesystem was not OK in dd-trace-py (from a document written by David). I don't know if that's still the case – here it is for reference: https://github.com/datadog/dd-trace-py/blob/acd8a7daef303f4a294a6dce1559162214b868d0/ddtrace/internal/datadog/profiling/docs/Design.md?plain=1#L51-L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're on C++20,
| target_compile_features(${target} PUBLIC cxx_std_20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean David's comment/note is outdated? If so I can update it and use std::filesystem here (in another PR)
r1viollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just a thought around reducing usage of statics
7c5292f to
a2453d8
Compare
r1viollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
8120980 to
d60b3b4
Compare
|
We're still having issues with availability of instances for benchmarks to run, I'll probably not be able to merge today |
ddtrace/internal/datadog/profiling/dd_wrapper/src/profiler_stats.cpp
Outdated
Show resolved
Hide resolved
2f53db7 to
de09d48
Compare
Description
https://datadoghq.atlassian.net/browse/PROF-12975
This PR adds data to the
internalpayload we send tolibdatadog. In thisinternalpayload, we can push custom metrics (that are not exposed to customers) but that we can access for analytics.For the time being, I only added the number of Samples taken (one per thread) and Sampling Events (one for each time we run
for_each_interp) for the current Profile. We may want to add more in the short term (e.g. number of times adaptive sampling was adjusted, history of adaptive sampling intervals, etc.)In order to implement this feature in a way that keeps clear semantics around locking the Profile object, I refactored the code not to
borrowonly the singleProfilebut instead both theProfileand theProfilerStats(in which we store our metrics). To keep locking clear and explicit, locking returns aProfileBorrowobjects which allows to access both theProfileand theProfilerStatsand that is RAII-compatible (automatically unlocks when it goes out of scope).Note the PR does add some "lock contention" because we now need to take the Profile lock in two new places (one for each metric we increment) – it should be the same order of magnitude as what we already do though (in number of lock acquisitions).
We plan to refactor Stack V2 in the short to avoid having to hold the Profile / ProfilerStats lock during the upload, by using double buffering or by releasing earlier (after the Profile data has been serialised). This should improve the performance and reduce lost Samples.
The PR also adds an integration test, ensuring that the generated JSON string is correct (we can't guarantee exact numbers, but we can check the numbers we have are meaningful).
Note I marked the PR as
no-changelogas this feature isn't public/visible by customers.Open questions:
This is an example of querying through the Events UI.