Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified jmx metrics definition and their evolution #13238

Closed
SylvainJuge opened this issue Feb 6, 2025 · 6 comments
Closed

Unified jmx metrics definition and their evolution #13238

SylvainJuge opened this issue Feb 6, 2025 · 6 comments
Assignees

Comments

@SylvainJuge
Copy link
Contributor

This is more a brain-dump/discussion-starter to gather feedback rather than a real issue, I'm sorry if it's getting a bit too long.

With the addition of the jmx-scraper in contrib, we can reuse the YAML-based JMX metrics capture from instrumentation, however the metric definitions themselves are still distinct and spread in multiple places:

  • JMX Gatherer relies on groovy metric definitions, is expected to be replaced with JMX Scraper
  • JMX Insight feature in instrumentation provides a YAML-configured JMX implementation with embedded YAML definitions for supported systems here.
  • JMX Scraper is a replacement of JMX Gatherer which relies on JMX Insight implementation with a currently distinct set of supported systems here.

With steps described in open-telemetry/opentelemetry-java-contrib#1362 that cover the supported systems, the goal is to provide almost equivalent YAML metric definitions to the ones that are currently provided with JMX Gatherer to provide a smooth migration path.

However, even when the migration from groovy defined metrics is complete, we still have two sets of distinct metrics in instrumentation/jmx-metrics and in JMX Scraper and we should aim to merge them to provide the following expected benefits:

  • minimize maintenance efforts and keep things aligned
  • capture identical metrics from within the JVM with instrumentation or from outside of the JVM with JMX Scraper
  • allow to build common consumers of those metrics, for example dashboards that do not have to depend on how the metric has been captured.

One of the downsides of having "one set of JMX metrics to rule them all" is that current users of JMX-scraper might not have any control over the version or stability of those metrics, for example:

  • let's assume that the reference YAML definitions are now part of instrumentation/jmx-metrics artifact
  • on every release of instrumentation, a new set of metrics will be included in JMX scraper with the dependency upgrade
  • as a result, changing the version of JMX scraper (for example when used through the otel jmxreceiver) would result in different metrics being captured, which could lead to unexpected behavior

For the "equivalent to groovy" legacy metric definitions that allow migration from JMX Gatherer, we can still embed them directly into JMX Scraper and then provide a config option to use them.

However, whenever the JMX definitions get enhanced/modified, the same compatibility issues can arise, and I wonder if and how we could iterate over the metrics definitions without breaking user expectations.

I think we could explore the following ideas to help providing some stability:

  • keep it simple: make jmx-scraper always use the latest version provided by instrumentation/jmx-metrics, thus giving the end-user control by selecting which version to use through the version of jmx-scraper.
  • keep it simple + legacy: same as previous but with an extra copy of the "legacy definitions"
  • if stability is needed on the user side, leverage the custom metrics YAML definition to manually download them from github and use local copies.
  • package the YAML metrics definitions to a separate repository/artifact, which could then be reused by instrumentation and jmx-scraper, switching from one version to another would be done through a simple config option
  • on every release of new JMX metrics rules, embed a copy directly into every consumer, which has the downside of always grow up.

I think that if JMX metrics were defined as part of semantic conventions, we would probably have a "use last version" approach, so I think the "keep it simple + legacy" option is probably the best compromise here. Using a local copy of previous or custom definitions is always possible, for example to use latest version of jmx-scraper with older definitions.

In addition to all of that, from the perspective of the consumers, I think we need to have a way to know which version of the metrics has been used. If the metrics are embedded in instrumentation/jmx-metrics, the "metrics version" then we should probably reuse that version in the sent data, and when using custom yaml files we should probably allow to set an explicit value per yaml file for later indentification. I am not very familiar with the OTLP protocol for metrics so maybe this is not something doable.

@SylvainJuge
Copy link
Contributor Author

After discussing this a bit in slack and live with @robsunday today, I think we should aim to do the following:

  • make instrumentation, scraper and gatherer aligned
  • for instrumentation and scraper this can be achieved by using a single location for the yaml files, for the gatherer it will be done by updating the groovy definitions
  • keeping the gatherer aligned helps to reduce the gap between scraper and gatherer, hence keeping the migration to scraper an easy step
  • changes should be done incrementally, one system at a time
  • only include "latest version" of the metrics definitions, so updating jmx-gatherer or jmx-scraper version will imply different metrics captured
  • stability or compatibility with existing metrics can always be provided by using jmx-scraper and previous versions of the yaml files.

We plan to start with the following target systems:

  • JVM metrics, which are defined in semconv runtime metrics and where the instrumentation implementation does not rely on yaml but is compliant with semconv.
  • Tomcat, which is not defined in semconv but is covered in instrumentation, jmx-gatherer and jmx-scraper.

@SylvainJuge
Copy link
Contributor Author

SylvainJuge commented Feb 21, 2025

Following yesterday Java SIG meeting, here is an update on the approach we will take to move this forward:

  • keep jmx-gatherer as-is including its metrics definitions
  • keep the current YAML definitions of jmx-scraper as-is so they can provide a "compatibility mode" that will 99% match current jmx-gatherer implementation
  • in instrumentation, move the yaml files so they are included in the jmx insight library to allow reusing them with jmx-scraper (and maybe others)
  • add an option (disabled by default) in jmx-scraper to use the "legacy" yaml definitions that are almost equivalent to the jmx gatherer, the default will use the ones from instrumentation
  • once the yaml files have been moved, we can start modifying them in instrumentation and start to align/modify them.
  • changes in metrics definitions in instrumentation will be propagated to jmx-scraper when instrumentation is released and dependency is updated. Making sure that jmx-scraper changelog explicitly include those changes would be important for end-users.

For the JVM metrics:

  • we currently can't exactly replicate semconv definitions due to jmx metrics with yaml: need ability to post-process some attribute values #13361
  • we will create a jvm.yaml in instrumentation even if it's not used directly, as the current implementation in instrumentation/runtime-telemetry/ is already available and aligned with semconv, there are also a few metrics (I think one or two) that could not be captured with JMX so those can't be replicated with YAML. This YAML file will only be used by jmx-scraper to capture those metrics.

For other metrics

@SylvainJuge
Copy link
Contributor Author

SylvainJuge commented Feb 25, 2025

We discussed time unit conversion with @robsunday today and found that there is probably an extra challenge to align with semconv recommendations for JMX metrics regarding the metric type worth discussing during the next SIG meeting.

  • http.server.request.duration in semconv is defined as an histogram with seconds as unit (with explicit boundaries).
  • histograms are relevant when all individual values can be collected, in this case the duration of every HTTP request processed
  • tomcat exposes a "time spent serving requests" which has a close definition of http.server.request.duration but is a cumulative sum of milliseconds spent for which we don't have the individual values
  • histograms are not currently supported in yaml metrics
  • with JMX, the metrics captured are always sampled when querying MBean attributes, only using JMX notifications (which very probably can't be supported with YAML) allows to get some individual values.
  • When collecting "time spent serving requests in ms", we can use a counter metric type as we rely on the fact this provided value will always increase. Switching the unit to seconds would require to capture it as a gauge because it's now a float number and we need the consumer to be aware that this value will always increase, which was previously implicitly carried by the counter type. UPDATE: we have DoubleCounter so converting to a gauge is not needed.

There are a few likely implications:

  • alignment (or at least close approximation) of semconv can't be followed, we can't easily create tomcat.http.server.request.duration metric that is similar to http.server.request.duration, if we do so then we need to properly document the differences, the tomcat. namespace here would be relevant to provide some clarity.
  • metrics captured through JMX and defined in YAML are limited to sampling, so they don't allow to capture histograms.
  • time-based units conversion to seconds might not be something we should always achieve because it means either loosing accuracy (keep counters but discard decimal part), or capturing things as gauge and loosing some properties of the metric. UPDATE: unlike what I initially thought we can have DoubleCounter so this is very probably not an issue.

I think it pushes us to capture JMX metrics to "as they are exposed" rather than attempt to always fit the semantic conventions that could apply.

This does not apply to metrics like jvm.cpu.time (captured in nanoseconds) that is defined in semconv as seconds as the conversion should be covered with #13369.


EDIT: it seems that it is possible to have a "counter for doubles" as seen in the current example of jvm.cpu.time metric capture in instrumentation, so changing the value from a long to a double does not seem to prevent us from capturing similar metrics as fractions of a second. However the question remains open weither it's a relevant choice to make for metrics that are not part of semconv.

@trask
Copy link
Member

trask commented Feb 26, 2025

Switching the unit to seconds would require to capture it as a gauge because it's now a float number

Just checking that you've seen DoubleCounter?

@SylvainJuge
Copy link
Contributor Author

Switching the unit to seconds would require to capture it as a gauge because it's now a float number

Just checking that you've seen DoubleCounter?

Yes, I wasn't aware of it when I wrote it but found it in the mean time, I'll update my comment to better reflect that.

@SylvainJuge
Copy link
Contributor Author

I think we can close this issue as the plan is now clear, the way to deal with changing metrics definitions will be done using the following strategy:

  • jmx-scraper provides a set of yaml definitions that are very close to JMX gatherer
  • "unified and aligned" yaml metrics definitions will be added to instrumentation, for example with jmx add jvm metrics yaml #13392
  • for JMX scraper, priority between latest metrics definitions and jmx scraper equivalent can be set using otel.jmx.target.source configuration (see jmx reuse instrumentation metrics opentelemetry-java-contrib#1782 for details), by default the definitions from instrumentation are used with a fallback on the legacy ones if not yet available in instrumentation.
  • metrics captured with YAML have some inherent limitations
    • there is currently no support for histograms
    • some metrics can't be captured at all through JMX, or might have missing attributes (for example some from the JVM can only be captured with JFR).
  • for the ability to filter negative values and negative values, see related issues/PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants