Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry in nais #612

Merged
merged 15 commits into from
Mar 4, 2024
Merged
Binary file removed docs/assets/envoy-tracing.png
Binary file not shown.
Binary file removed docs/assets/example-trace.png
Binary file not shown.
Binary file added docs/assets/grafana-tempo-logs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/grafana-tempo-query-builder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/grafana-tempo-trace-view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/grafana-tempo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/assets/kiali-400-sample.gif
Binary file not shown.
Binary file removed docs/assets/kiali-sample.gif
Binary file not shown.
Binary file removed docs/assets/logging_overview.png
Binary file not shown.
Binary file removed docs/assets/prometheus_alertmanager_overview.png
Binary file not shown.
Binary file removed docs/assets/trace-span-ids.png
Binary file not shown.
Binary file added docs/assets/tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 28 additions & 16 deletions docs/explanation/observability/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,27 @@ The tree pillars of observability are:
2. **Metrics** - Metrics are a numerical measurement of something in your application. They are useful for understanding the performance of your application and is generally more scalable than logs both in terms of storage and querying since they are structured data.
3. **Traces** - Traces are a record of the path a request takes through your application. They are useful for understanding how a request is processed in your application.

<center>

```mermaid
graph
A[Application] --> B((Logs))
A --> C((Metrics))
A --> D((Traces))
A[Application] --> B(Logs)
A --> C(Metrics)
A --> D(Traces)

click B "#logs"
click C "#metrics"
click D "#traces"
```

</center>

## Automatic observability

NAIS provides a new way to get started with observability. By enabling auto-instrumentation, you can get started with observability without having to write any code. This is the easiest way to get started with observability, as it requires little to no effort on the part of the team developing the application.

[:bulb: Get started with auto-instrumentation](../../how-to-guides/observability/auto-instrumentation.md)

## Metrics

Metrics are a way to measure the state of your application. Metrics are usually numerical values that can be aggregated and visualized. Metrics are often used to create alerts and dashboards.
Expand All @@ -41,7 +51,7 @@ We use the [OpenMetrics][openmetrics] format for metrics. This is a text-based f

[openmetrics]: https://openmetrics.io/

[:octicons-arrow-right-24: Get started with metrics](./metrics.md)
[:bulb: Get started with metrics](./metrics.md)

### Prometheus

Expand All @@ -57,13 +67,13 @@ graph LR
Prometheus --GET /metrics--> Application
```

[:octicons-arrow-right-24: Access Prometheus here](./metrics.md#prometheus-environments)
[:simple-prometheus: Access Prometheus here](./metrics.md#prometheus-environments)

### Grafana

[Grafana][grafana] is a tool for visualizing metrics. It is used to create dashboards that can be used to monitor your application. Grafana is used by many open source projects and is the de facto standard for metrics in the cloud native world.

[:octicons-arrow-right-24: Access Grafana here][nais-grafana]
[:simple-grafana: Access Grafana here][nais-grafana]

[grafana]: https://grafana.com/
[nais-grafana]: <<tenant_url("grafana")>>
Expand All @@ -82,25 +92,24 @@ graph LR
Router --> C[Elastic / Kibana]
```

[:octicons-arrow-right-24: Configure your logs](./logging.md)
[:bulb: Configure your logs](./logging.md)

## Traces

With tracing, we can get application performance monitoring (APM). Tracing gives deep insight into the execution of your application. For instance, you can use tracing to see if parallel function are actually run in parallel,
or what amount of time your application spends in a given function.

Traces from NAIS applications are collected using the [OpenTelemetry](https://opentelemetry.io/) standard. Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.
Traces from NAIS applications can be collected using the [OpenTelemetry](https://opentelemetry.io/) standard. Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.

Visualization of traces can be done in [Grafana](https://grafana.<<tenant()>>.cloud.nais.io),
using the `*-tempo` data sources (one for each environment).
Visualization of traces can be done in [Grafana](https://grafana.<<tenant()>>.cloud.nais.io), using the `*-tempo` data sources (one for each environment).

```mermaid
graph LR
Application --gRPC--> Tempo
Tempo --> Grafana
```

[:octicons-arrow-right-24: Read more about tracing](./tracing.md)
[:bulb: Read more about tracing](./tracing.md)

## Alerts

Expand All @@ -117,16 +126,19 @@ graph LR
Alertmanager --> Slack
```

[:octicons-arrow-right-24: Read more about alerts](./alerting.md)
[:bulb: Read more about alerts](./alerting.md)

## Learning more

Observability is a very broad topic and there is a lot more to learn. Here are some resources that you can use to learn more about observability:

- [:octicons-video-24: Monitoring, the Prometheus Way][youtube-prometheus]
- [:octicons-book-24: SRE Book - Monitoring distributed systems][sre-book-monitoring]
- [:octicons-book-24: SRE Workbook - Monitoring][sre-workbook-monitoring]
- [:octicons-book-24: SRE Workbook - Alerting][sre-workbook-alerting]
[:octicons-video-24: Monitoring, the Prometheus Way][youtube-prometheus]

[:octicons-book-24: SRE Book - Monitoring distributed systems][sre-book-monitoring]

[:octicons-book-24: SRE Workbook - Monitoring][sre-workbook-monitoring]

[:octicons-book-24: SRE Workbook - Alerting][sre-workbook-alerting]

[sre-book-monitoring]: https://sre.google/sre-book/monitoring-distributed-systems/
[sre-workbook-monitoring]: https://sre.google/workbook/monitoring/
Expand Down
4 changes: 2 additions & 2 deletions docs/explanation/observability/frontend.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ Instrumenting mounts and unmounts can be quite data intensive, take due care.

Navigate your web browser to the new Grafana at <https://grafana.<<tenant()>>.cloud.nais.io>.

Traces are available from the `dev-gcp-tempo` and `prod-gcp-tempo` data sources, whereas
logs and metrics are available from the `dev-gcp-loki` and `prod-gcp-loki` data sources.
Traces are available from the data sources ending with `-tempo`, whereas
logs and metrics are available from data sources sources ending with `-loki`.

Use the "Explore" tab under either the Loki or Tempo tab and run queries.

Expand Down
89 changes: 76 additions & 13 deletions docs/explanation/observability/tracing.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,88 @@
---
description: >-
Application Performance Monitoring or tracing using Grafana Tempo on NAIS.
tags: [explanation]
tags: [explanation, tracing]
---

# Tracing
# Distributed Tracing

[Traces](https://en.wikipedia.org/wiki/Observability_(software)#Distributed_traces) are a record of the path a request takes through your application. They
are useful for understanding how a request is processed in your application.
Tracing is a way to track a request as it passes through the various services needed to handle it. This is especially useful in a microservices architecture, where a single user action often results in a series of calls to different services.

NAIS does not collect trace data automatically. If you want tracing integration,
you must first instrument your application to collect traces, and then configure
the tracing library to send it to the correct place.
Tracing allows developers to understand the entire journey of a request, making it easier to identify bottlenecks, latency issues, or failures that can impact user experience.

Traces from NAIS applications are collected using the [OpenTelemetry](https://opentelemetry.io/) standard.
Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.
## How tracing works

## Visualizing application performance
When a request is made to your application, a trace is started. This creates a Trace which serves as a container for all the work done for that request.

Visualization of traces can be done in [the new Grafana installation](https://grafana.<<tenant()>>.cloud.nais.io).
![Tracing](../../assets/tracing.png)

You can use the **Explore** feature of Grafana with the _prod-gcp-tempo_ and _dev-gcp-tempo_ data sources.
<small>Trace visualization by Logshero licensed under Apache License 2.0</small>

There are no ready-made dashboards at this point, but feel free to make one yourself and contribute to this page.
The work done by individual services (or components of a single service) is captured in Spans. A span represents a single unit of work in a trace, like a SQL query or a call to an external service.

Spans can be nested and form a trace tree. The Trace is the root of the tree, and each Span is a node that represents a specific operation in your application. The tree of spans captures the causal relationships between the operations in your application (i.e., which operations caused others to occur).

Each Span carries a Context that includes metadata about the trace (like a unique trace identifier and span identifier) and any other data you choose to include. This context is propagated across process boundaries, allowing all the work that's part of a single trace to be linked together, even if it spans multiple services.

By analyzing the data captured in traces and spans, you can gain a deep understanding of how requests flow through your system, where time is being spent, and where problems might be occurring. This can be invaluable for debugging, performance optimization, and understanding the overall health of your system.

## OpenTelemetry

OpenTelemetry, a project under the Cloud Native Computing Foundation (CNCF), has become the standard for tracing and application telemetry due to its unified APIs for tracing and metrics, which simplify instrumentation and data collection from applications.

It supports a wide range of programming languages, including Java, JavaScript, Python, Go, and more, allowing for consistent tooling across different parts of a tech stack.

OpenTelemetry also provides automatic instrumentation for popular frameworks and libraries, enabling the collection of traces and metrics without the need for modifying application code.

It's vendor-neutral, allowing telemetry data export to any backend, providing the flexibility to switch between different analysis tools as needs change. Backed by leading companies in the cloud and software industry, and with a vibrant community, OpenTelemetry ensures project longevity and continuous improvement.

[:octicons-link-external-24: Learn more about OpenTelemetry][open-telemetry]

## Tracing in NAIS

NAIS does not collect application trace data automatically, but it provides the infrastructure to do so using OpenTelemetry, Grafana Tempo for storage and querying, and easy-to-use configuration options.

### The easy way: Auto-instrumentation

The preferred way to get started with tracing is to enable auto-instrumentation for your application. This will automatically collect traces and send them to the correct place using the OpenTelemetry Agent.

This is the easiest way to get started with tracing, as it requires little to no effort on the part of the team developing the application and provides instrumentation for popular libraries, frameworks and external services such as PostgreSQL, Redis, Kafka and HTTP clients.

[:bulb: Get started with auto-instrumentation](../../how-to-guides/observability/auto-instrumentation.md)

### The hard way: Manual instrumentation

If you want more control over how your application is instrumented, you can manually instrument your application using the OpenTelemetry SDK for your programming language.

To get the correct configuration for you can still use the auto-instrumentation configuration, but set the `runtime` to `sdk` as this will only set up the OpenTelemetry configuration, without injecting the OpenTelemetry Agent.

[:bulb: Get started with manual-instrumentation](../../how-to-guides/observability/auto-instrumentation.md#enable-auto-instrumentation-for-other-applications)

### OpenTelemetry SDKs

OpenTelemetry provides SDKs for a wide range of programming languages:

* [:fontawesome-brands-java: OpenTelemetry Java][otel-java]
* [:fontawesome-brands-js: OpenTelemetry JavaScript][otel-node]
* [:fontawesome-brands-python: OpenTelemetry Python][otel-python]
* [:fontawesome-brands-golang: OpenTelemetry Go][otel-go]

## Visualizing traces in Grafana Tempo

Visualizing and querying traces is done in Grafana using the Grafana Tempo. Tempo is an open-source, easy-to-use, high-scale, and cost-effective distributed tracing backend that stores and queries traces.

The easiest way to get started with Tempo is to use the [Explore view in Grafana][grafana-explore], which provides a user-friendly interface for querying and visualizing traces.

[:octicons-link-external-24: Open Grafana Explore][grafana-explore]

[:bulb: Get started with Grafana Tempo](../../how-to-guides/observability/tracing/tempo.md)

![Grafana Tempo](../../assets/grafana-tempo.png)

[open-telemetry]: https://opentelemetry.io/
[otel-java]: https://opentelemetry.io/docs/languages/java/
[otel-node]: https://opentelemetry.io/docs/languages/js/
[otel-python]: https://opentelemetry.io/docs/languages/python/
[otel-go]: https://opentelemetry.io/docs/languages/go/
[grafana]: <<tenant_url("grafana")>>
[grafana-explore]: <<tenant_url("grafana", "explore")>>
67 changes: 67 additions & 0 deletions docs/how-to-guides/observability/auto-instrumentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
description: Get started with auto-instrumentation for your applications with OpenTelemetry data for Tracing, Metrics and Logs using the OpenTelemetry Agent.
tags: [guide, tracing]
---
# Get started with auto-instrumentation

This guide will explain how to get started with auto-instrumentation your applications with OpenTelemetry data for [Tracing](../../explanation/observability/tracing.md), [Metrics](../../explanation/observability/metrics.md) and [Logs](../../explanation/observability/logging.md) using the OpenTelemetry Agent.

The main benefit of auto-instrumentation is that is requires little to no effort on the part of the team developing the application while providing insight into popular libraries, frameworks and external services such as PostgreSQL, Redis, Kafka and HTTP clients.

Auto-instrumentation is a preferred way to get started with tracing in NAIS, and can also be used for metrics and logs collection.This type of instrumentation is available for Java, Node.js and Python applications, but can also be used for other in `sdk` mode where it will only set up the OpenTelemetry configuration.

!!! info

:new: Auto-instrumentation is a new feature and is only available for nais applications running in GCP.

## Enable auto-instrumentation for Java/Kotlin applications

```yaml
...
spec:
observability:
autoInstrumentation:
enabled: true
runtime: java
```

## Enable auto-instrumentation for Node.js applications

```yaml
...
spec:
observability:
autoInstrumentation:
enabled: true
runtime: node
```

## Enable auto-instrumentation for Python applications

```yaml
...
spec:
observability:
autoInstrumentation:
enabled: true
runtime: python
```

## Enable auto-instrumentation for other applications

If your application runtime is not one of the supported runtimes or you want to instrument your application yourself you can stil get benefit from the auto instrumentation configuration.

This will only set up the OpenTelemetry configuration for the application, but it will not inject the OpenTelemetry Agent into the application.

```yaml
...
spec:
observability:
autoInstrumentation:
enabled: true
runtime: sdk
```

## Resources

[:bulb: OpenTelemetry Auto-Instrumentation Configuration Reference](../../reference/observability/auto-config.md)
18 changes: 18 additions & 0 deletions docs/how-to-guides/observability/tracing/context-propagation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
description: Learn how to propagate trace context across process boundaries in a few common scenarios.
tags: [guide, tracing]
---
# Trace context propagation

Each Span carries a Context that includes metadata about the trace (like a unique trace identifier and span identifier) and any other data you choose to include. This context is propagated across process boundaries, allowing all the work that's part of a single trace to be linked together, even if it spans multiple services.

This guide explains how to propagate trace context across process boundaries in a few common scenarios. If you are using [auto-instrumentation](../auto-instrumentation.md), trace context propagation is already handled for you.

[:octicons-link-external-24: OpenTelemetry Context Propagation](https://opentelemetry.io/docs/concepts/context-propagation/)

## Propagate trace context in HTTP requests

When a service makes an HTTP request to another service, it should include the trace context in the request headers. The receiving service can then use this context to create a new Span that's part of the same trace. OpenTelemetry provides a standard for how trace context should be propagated in HTTP requests, called the [W3C Trace Context](https://www.w3.org/TR/trace-context/) standard.

* [OpenTelemetry Setup in Spring Boot Application](https://opentelemetry.io/docs/languages/java/automatic/spring-boot)
* [OpenTelemetry Setup in Ktor Application](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/ktor/ktor-2.0/library)
83 changes: 83 additions & 0 deletions docs/how-to-guides/observability/tracing/correlate-traces-logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
description: Learn how to correlate traces with logs in Grafana Tempo.
tags: [guide, tracing]
---
# Correlate traces and logs

This guide will explain how to correlate traces with logs in Grafana Tempo.

## Step 1: Configure Tracing

First you need to configure OpenTelemetry tracing in your application. The easiest way to get started with tracing is to enable auto-instrumentation for your application. This will automatically collect traces and send them to the correct place using the OpenTelemetry Agent.

[:bulb: Get started with auto-instrumentation](../auto-instrumentation.md)

## Step 2: Configure Logging

If you are using auto-instrumentation for logs they are automatically correlated with traces. If you are not using auto-instrumentation for logs, you need to configure your log output to include trace information.


=== "log4j"

Add the [opentelemetry-javaagent-log4j-context-data-2.17](https://mvnrepository.com/artifact/io.opentelemetry.javaagent.instrumentation/opentelemetry-javaagent-log4j-context-data-2.17) package to your `pom.xml` or `build.gradle` to include trace information in your logs:

```
io.opentelemetry.instrumentation:opentelemetry-log4j-context-data-2.17-autoconfigure:2.1.0-alpha
```

Add the following pattern to your log4j configuration to include trace information in your logs:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} traceId: %X{trace_id} spanId: %X{span_id} - %msg%n" />
</Console>
</Appenders>
<Loggers>
<Root level="All" >
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
```

=== "logback"

Add the [opentelemetry-logback-mdc-1.0](https://mvnrepository.com/artifact/io.opentelemetry.instrumentation/opentelemetry-logback-mdc-1.0) package to your `pom.xml` or `build.gradle` to include trace information in your logs:

```
io.opentelemetry.instrumentation:opentelemetry-logback-mdc-1.0:2.1.0-alpha
```

Add the following pattern to your logback configuration to include trace information in your logs:

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern><![CDATA[%date{HH:mm:ss.SSS} [%thread] %-5level %logger{15}#%line %X{req.requestURI} traceId: %X{trace_id} spanId: %X{span_id} %msg\n]]></pattern>
</encoder>
</appender>

<appender name="OTEL" class="io.opentelemetry.instrumentation.logback.v1_0.OpenTelemetryAppender">
<appender-ref ref="STDOUT" />
</appender>

<root>
<level value="DEBUG" />
<appender-ref ref="STDOUT" />
</root>

</configuration>
```

## 3. Profit

Now that you have tracing and logging set up, you can use Grafana Tempo to correlate traces and logs. When you view a trace in Grafana Tempo, you can see the logs that are associated with that trace. This makes it easy to understand what happened in your application and troubleshoot issues.

![Correlate traces and logs](../../../assets/grafana-tempo-logs.png)

[:arrow_backward: Back to the list of guides](../index.md)
Loading