Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Thread Metrics #13483

Open
akats7 opened this issue Mar 10, 2025 · 15 comments
Open

Additional Thread Metrics #13483

akats7 opened this issue Mar 10, 2025 · 15 comments
Labels
enhancement New feature or request

Comments

@akats7
Copy link
Contributor

akats7 commented Mar 10, 2025

Is your feature request related to a problem? Please describe.

The current scope of thread metrics appears to be limited to thread count, there are other thread based metrics that are rather critical, such as thread cpu time and metrics based on thread state.

Describe the solution you'd like

Add additional thread metrics for:

jvm.thread.cpu_time
jvm.thread.user_time

Describe alternatives you've considered

Using the JMX Gatherer

Additional context

No response

@akats7 akats7 added enhancement New feature or request needs triage New issue that requires triage labels Mar 10, 2025
@steverao
Copy link
Contributor

Could you clarify whether the metrics jvm.thread.blocked, jvm.thread.waiting and jvm.thread.timed_waiting represent the number of threads in the corresponding states?

@steverao steverao added needs author feedback Waiting for additional feedback from the author and removed needs triage New issue that requires triage labels Mar 10, 2025
@akats7
Copy link
Contributor Author

akats7 commented Mar 10, 2025

Yeah that is what I meant, but now see that there's a count metric emitted per state, in that case just CPU and User time seems to be a gap

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Mar 10, 2025
@trask
Copy link
Member

trask commented Mar 10, 2025

hi @akats7!

what attributes would you propose on jvm.thread.cpu_time / jvm.thread.user_time?

@akats7
Copy link
Contributor Author

akats7 commented Mar 10, 2025

Hey @trask,

I'd have to dig a bit into the internals of the runtime metric modules, but one approach could be to just support this for JMX

@akats7
Copy link
Contributor Author

akats7 commented Mar 13, 2025

@trask can we add cpu time to runtime metrics through the thread MBean using ManagementFactory. Then we rely on mbean operations to get the time values. I get there may be cardinality concerns since thread name/pool name would have to be an attribute, so it can be disabled by default.

@trask
Copy link
Member

trask commented Mar 13, 2025

@SylvainJuge @robsunday I'm hesitant for people to add new JMX metrics in the middle of your convergence effort, so would like to defer to you here

@akats7
Copy link
Contributor Author

akats7 commented Mar 14, 2025

Thanks @trask! I do want to point out that these are rather important metrics. We've had a lot of internal requests for this from users who are migrating from vendor products that supported this out of the box.

@SylvainJuge
Copy link
Contributor

To expand a bit on the "convergence effort" context, we are currently trying to add JVM metrics in a YAML descriptor with #13392, this YAML will NOT be directly used by instrumentation but will in the future be used by jmx-scraper which is a CLI program replacing JMX Gatherer, but using the same JMX implementation as instrumentation (and thus inheriting it's yaml support).

What we are currently focusing on for JVM metrics in YAML, is the ability to capture them in a way that is compliant with semantic conventions, which is already done by the instrumentation/runtime-telemetry modules but with code.

The instrumentation/runtime-telemetry uses code and JMX listeners that can't be replicated with YAML, thus some of the metrics we can capture with YAML can't be exactly replicated.
For example, with jvm.thread.count we can't capture the jvm.thread.state or jvm.thread.daemon. In short, depending on how the metric is captured we might or might not provide the expected details and attributes, which then makes those attributes as optional/recommended only.

I think we can add new metrics even if the current work is still in-progress, I would suggest to do that in a few steps:

  • experiment with yaml to see if and how those could be captured remotely through JMX with jmx-scraper
  • discuss their definition here
  • add them to runtime-telemetry modules to validate we can capture them as expected
  • contribute their definitions to semantic conventions as experimental (this could mean having to change the implementations done previously).
  • add their semconv-compliant definitions to jvm.yaml that is being added in jmx add jvm metrics yaml #13392 , which hopefully would have been merged in the mean time.

As a temporary work-around, if you are able to capture those with YAML configuration, you should be able to provide a YAML file for them. However this is not a great OOTB experience, could easily break if the metric definition changes when adding it to semconv.

@akats7
Copy link
Contributor Author

akats7 commented Mar 14, 2025

Hey @SylvainJuge, thanks for the context. So part of the issue is that I believe the jmx-scraper is only able to scrape attributes and not execute operations which would be required for these metrics.

In regards to the experimentation, I've already done this with the JMX Gatherer since it allows you to directly interact with the mbeans if using a custom script. However since the gatherer instruments also only allow the use of attributes, I had to rely on transformation closures to overwrite other mbeans which is not ideal.

Also, if possible part of this ask is to be able to move away from the remote approach, I might be missing something but is there a reason its preferable to interact with a JMX server vs just scraping it directly since the javaagent runs in the same JVM?

@SylvainJuge
Copy link
Contributor

Also, if possible part of this ask is to be able to move away from the remote approach, I might be missing something but is there a reason its preferable to interact with a JMX server vs just scraping it directly since the javaagent runs in the same JVM?

Ideally, we should not force users to deploy an instrumentation agent to capture runtime metrics if those could be obtained externally with JMX scraper or gatherer.

However, we already have the case of some metrics that can't be captured without instrumentation and explicit code as they can only be captured from within the JVM, either because they require advanced JMX features or rely on JFR events. So this is something we can do already, but it adds more constraints on the users, for example the JVM metrics are not exactly the same if using Java 17 or Java 8, which could lead to user confusion or missed expectations.

If I understand it correctly, those metrics would be more in the "runtime-telemetry only" and would be very unlikely supported through YAML due to needing some post-processing, is that correct ?

Also, could you try to elaborate a bit on their definitions/attributes and from which MBean attribute would they be captured ?

@akats7
Copy link
Contributor Author

akats7 commented Mar 14, 2025

Yep, thats exactly right, for example to get cpu_time we'd likely need get the AllThreadIds attribute and then call getThreadCpuTime and getThreadInfo for attributes such as name.

And I understand that this utility should still exist for users who want these metrics but don't need the other functionality of the agent. But the situation that we find ourselves in is that the majority of our teams are that are already leveraging the agent for instrumentation also have a need for these metrics, so it would be ideal to not have to configure a jmx server and an additional scraping process when the agent is already in place.

@SylvainJuge
Copy link
Contributor

I agree with you @akats7 , this is probably a use-case for which we could either document (or provide a dedicated config option) when only runtime metrics (or JMX metrics) needs to be captured and sent to OTLP, without any instrumentation nor tracing involved

For JMX metrics that are defined in yaml, this could help providing details on JVM rumtime metrics while still allowing to capture metrics defined in yaml, for example if you run a Kafka broker or cluster it would be relevant to capture both by adding the agent to the JVM.

@akats7
Copy link
Contributor Author

akats7 commented Mar 14, 2025

So just to clarify, is there a path forward to add these as experimental out of the box jvm metrics. I'd be happy to contribute this

@SylvainJuge
Copy link
Contributor

If those new metrics are only captured through code, their implementation is part of instrumentation/runtime-telemetry and would not have to be replicated with YAML at all, which makes things a bit simpler.

In order to add/change things to semconv, we need to have at least an experimental implementation to validate what is being added in semconv is correct and technically achievable, that creates a kind of chicken-egg problem and you have to work on both sides at the same time.

I would suggest to do the following:

  1. add definition of any new experimental metrics in semconv in a draft PR
  2. add support for those in instrumentation in a draft PR, the implementation would be in instrumentation/runtime-telemetry and would likely require adding variants for java 8 (JMX) and Java 17 (JFR/JMX)
  3. update the semconv draft PR to match the implementation, make it non-draft and start discussing any details like metric names and attributes if needed, if it's reusing existing attributes and fits existing metrics I don't expect this to raise many discussions.
  4. Once semconv PR is merged, update the instrumentation PR with any semconv changes, then mark it as ready for review. This last step should be quite quick as most of the discussion should have been done in the semconv PR.

@akats7
Copy link
Contributor Author

akats7 commented Mar 17, 2025

@SylvainJuge That sounds like a plan to me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants