Skip to content

Conversation

@aaronc
Copy link
Member

@aaronc aaronc commented Oct 29, 2025

Description

This PR:

  • adds out of the box global OpenTelemetry declarative configuration support to the telemetry/ package via https://pkg.go.dev/go.opentelemetry.io/contrib/otelconf/v0.3.0
  • deprecates all the existing methods in telemetry/ which are based on https://pkg.go.dev/github.com/hashicorp/go-metrics which is under-maintained and requires mutex locks and map lookups on every telemetry method
  • provides default routing of go-metrics telemetry data to OpenTelemetry when OpenTelemetry is enabled
  • instruments BaseApp with OpenTelemetry spans and block and tx counter metrics
  • integrates OpenTelemetry shutdown into server and provides a TestingMain function for tests to export telemetry data
  • configures the log/slog default logger to send logs to OpenTelemetry and allows otelslog bridged to be used for logging. NOTE: this leaves a bit of a disconnect between the SDK's existing logging infrastructure which currently just writes to stdout - this can either be default with in this PR or dealt with in a follow up

The OpenTelemetry go libraries are very actively maintained, most vendors in the space are adding OpenTelemetry support and generally it seems like the industry is headed in this direction. Much of our existing telemetry code is to configure basic telemetry exporting, but with otelconf declarative config, we don't need to maintain any of this ourselves and the out of the box experience is quite simple even for usage in testing.

@github-actions github-actions bot removed the C:log label Oct 31, 2025
@aaronc aaronc changed the title feat: add tracing api and instrument BaseApp feat: OpenTelemetry configuration and BaseApp instrumentation Nov 7, 2025
@aaronc aaronc requested a review from technicallyty November 10, 2025 22:19
return s.listener.Close()
}

// Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merge, lets give the deprecate notices some more context, and encourage them to use otel

Comment on lines 57 to 61
cfgJson, err := json.Marshal(cfg)
if err != nil {
return fmt.Errorf("failed to marshal telemetry config file: %w", err)
}
fmt.Printf("\nInitializing telemetry with config:\n%s\n\n", cfgJson)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is just for testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I think that it is somewhat useful to have some debugging indicating whether opentelemetry is getting initialized or not. Maybe just a single line without dumping the config?

}
}

func doInit() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite a lot going on here - do you see this simplifying in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream, I think otelconf will add the file exporters so most of the cosmos extra will go away although we probably want to keep the runtime instrumentation stuff. And we may also want some other cosmos extra config options

@github-actions github-actions bot added the C:log label Nov 11, 2025
@github-actions github-actions bot removed the C:log label Nov 14, 2025
return sdkerrors.QueryResult(errorsmod.Wrap(sdkerrors.ErrUnknownRequest, "no query path provided"), app.trace), nil
}

// TODO: propagate context with span into the sdk.Context used for queries so that we can trace queries properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a tracking issue for this?

# The fallback is the db_backend value set in CometBFT's config.toml.
app-db-backend = "{{ .BaseConfig.AppDBBackend }}"
###############################################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this actually break existing telemetry users? like without this its not exactly deprecated but fully broken?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it breaks the testnet command which is using this to generate config files. So I can revert. I was just thinking that we shouldn't encourage users to use it by generating this in the config file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just put a deprecation notice comment here as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did more testing, if we started a new app with this removed, then we'd lose access to all cosmos SDK metrics currently wired with the legacy go-metrics system.

@technicallyty
Copy link
Contributor

technicallyty commented Nov 14, 2025

if you don't have the grafana/otel-lgtm stack running, and you run the SDK with a valid otel configuration, there is quite a delay from pressing ctrl-c and killing the app,

edit: and it displays a nil pointer exception error

@gjermundgaraba
Copy link
Contributor

Some metrics I would like to have (maybe some of them exist)

  • Number of peers (and ideally a breakdown of
  • Block creation times
  • Number of transactions in blocks
  • Consensus status (basically things that you get today from the consensus status endpoint)
  • RPC Query latency (mostly for EVM though)
  • Account counter
  • Contract counter (EVM)

As we keep debugging, my assumption is that we'll find more things, so I would also like good docs on how to contribute after this is merged in.

@aaronc
Copy link
Member Author

aaronc commented Nov 17, 2025

Thanks for that input @gjermundgaraba! So generally, for each metric type, it would be good to know what sort of data type we want. In otel, we can basically choose from:

  • traces - not pure metrics, but in grafana we can analyze them like metrics when they're stored in tempo and they provide us with start/end/duration timing info
  • pure metrics like in prometheus:
    • counters
    • gauges
    • histograms

In this PR, I've leaned towards trace/span instrumentation of the ABCI lifecycle because we get a detailed execution trace and we can also analyze all the timing and count info from that. But we can really do any types of metrics you specify.

To contribute, I'm suggesting that we follow the official otel go examples: https://opentelemetry.io/docs/languages/go/instrumentation/. But note, that we don't need to do any provider/exporter setup (this PR covers that), we just need to do the instrumentation. The roll dice example from the otel docs looks like a perfect minimal example:

package main

import (
	"fmt"
	"io"
	"log"
	"math/rand"
	"net/http"
	"strconv"

	"go.opentelemetry.io/contrib/bridges/otelslog"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/metric"
)

const name = "go.opentelemetry.io/otel/example/dice"

var (
	tracer = otel.Tracer(name)
	meter  = otel.Meter(name)
	logger = otelslog.NewLogger(name)
	rollCnt metric.Int64Counter
)

func init() {
	var err error
	rollCnt, err = meter.Int64Counter("dice.rolls",
		metric.WithDescription("The number of rolls by roll value"),
		metric.WithUnit("{roll}"))
	if err != nil {
		panic(err)
	}
}

func rolldice(w http.ResponseWriter, r *http.Request) {
	ctx, span := tracer.Start(r.Context(), "roll")
	defer span.End()

	roll := 1 + rand.Intn(6)

	var msg string
	if player := r.PathValue("player"); player != "" {
		msg = fmt.Sprintf("%s is rolling the dice", player)
	} else {
		msg = "Anonymous player is rolling the dice"
	}
	logger.InfoContext(ctx, msg, "result", roll)

	rollValueAttr := attribute.Int("roll.value", roll)
	span.SetAttributes(rollValueAttr)
	rollCnt.Add(ctx, 1, metric.WithAttributes(rollValueAttr))

	resp := strconv.Itoa(roll) + "\n"
	if _, err := io.WriteString(w, resp); err != nil {
		log.Printf("Write failed: %v\n", err)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants