nais · Starefossen · Mar 4, 2024 · Feb 22, 2024 · Mar 2, 2024 · Mar 2, 2024
diff --git a/docs/assets/envoy-tracing.png b/docs/assets/envoy-tracing.png
diff --git a/docs/assets/example-trace.png b/docs/assets/example-trace.png
diff --git a/docs/assets/grafana-tempo-logs.png b/docs/assets/grafana-tempo-logs.png
diff --git a/docs/assets/grafana-tempo-query-builder.png b/docs/assets/grafana-tempo-query-builder.png
diff --git a/docs/assets/grafana-tempo-trace-view.png b/docs/assets/grafana-tempo-trace-view.png
diff --git a/docs/assets/grafana-tempo.png b/docs/assets/grafana-tempo.png
diff --git a/docs/assets/kiali-400-sample.gif b/docs/assets/kiali-400-sample.gif
diff --git a/docs/assets/kiali-sample.gif b/docs/assets/kiali-sample.gif
diff --git a/docs/assets/logging_overview.png b/docs/assets/logging_overview.png
diff --git a/docs/assets/prometheus_alertmanager_overview.png b/docs/assets/prometheus_alertmanager_overview.png
diff --git a/docs/assets/trace-span-ids.png b/docs/assets/trace-span-ids.png
diff --git a/docs/assets/tracing.png b/docs/assets/tracing.png
diff --git a/docs/explanation/observability/README.md b/docs/explanation/observability/README.md
@@ -22,17 +22,27 @@ The tree pillars of observability are:
 2. **Metrics** - Metrics are a numerical measurement of something in your application. They are useful for understanding the performance of your application and is generally more scalable than logs both in terms of storage and querying since they are structured data.
 3. **Traces** - Traces are a record of the path a request takes through your application. They are useful for understanding how a request is processed in your application.
 
+<center>
+
 ```mermaid
 graph
-  A[Application] --> B((Logs))
-  A --> C((Metrics))
-  A --> D((Traces))
+  A[Application] --> B(Logs)
+  A --> C(Metrics)
+  A --> D(Traces)
 
   click B "#logs"
   click C "#metrics"
   click D "#traces"
 ```
 
+</center>
+
+## Automatic observability
+
+NAIS provides a new way to get started with observability. By enabling auto-instrumentation, you can get started with observability without having to write any code. This is the easiest way to get started with observability, as it requires little to no effort on the part of the team developing the application.
+
+[:bulb: Get started with auto-instrumentation](../../how-to-guides/observability/auto-instrumentation.md)
+
 ## Metrics
 
 Metrics are a way to measure the state of your application. Metrics are usually numerical values that can be aggregated and visualized. Metrics are often used to create alerts and dashboards.
@@ -41,7 +51,7 @@ We use the [OpenMetrics][openmetrics] format for metrics. This is a text-based f
 
 [openmetrics]: https://openmetrics.io/
 
-[:octicons-arrow-right-24: Get started with metrics](./metrics.md)
+[:bulb: Get started with metrics](./metrics.md)
 
 ### Prometheus
 
@@ -57,13 +67,13 @@ graph LR
   Prometheus --GET /metrics--> Application
 ```
 
-[:octicons-arrow-right-24: Access Prometheus here](./metrics.md#prometheus-environments)
+[:simple-prometheus: Access Prometheus here](./metrics.md#prometheus-environments)
 
 ### Grafana
 
 [Grafana][grafana] is a tool for visualizing metrics. It is used to create dashboards that can be used to monitor your application. Grafana is used by many open source projects and is the de facto standard for metrics in the cloud native world.
 
-[:octicons-arrow-right-24: Access Grafana here][nais-grafana]
+[:simple-grafana: Access Grafana here][nais-grafana]
 
 [grafana]: https://grafana.com/
 [nais-grafana]: <<tenant_url("grafana")>>
@@ -82,25 +92,24 @@ graph LR
   Router --> C[Elastic / Kibana]
 ```
 
-[:octicons-arrow-right-24: Configure your logs](./logging.md)
+[:bulb: Configure your logs](./logging.md)
 
 ## Traces
 
 With tracing, we can get application performance monitoring (APM). Tracing gives deep insight into the execution of your application. For instance, you can use tracing to see if parallel function are actually run in parallel,
 or what amount of time your application spends in a given function.
 
-Traces from NAIS applications are collected using the [OpenTelemetry](https://opentelemetry.io/) standard.  Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.
+Traces from NAIS applications can be collected using the [OpenTelemetry](https://opentelemetry.io/) standard. Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.
 
-Visualization of traces can be done in [Grafana](https://grafana.<<tenant()>>.cloud.nais.io),
-using the `*-tempo` data sources (one for each environment).
+Visualization of traces can be done in [Grafana](https://grafana.<<tenant()>>.cloud.nais.io), using the `*-tempo` data sources (one for each environment).
 
 ```mermaid
 graph LR
   Application --gRPC--> Tempo
   Tempo --> Grafana
 ```
 
-[:octicons-arrow-right-24: Read more about tracing](./tracing.md)
+[:bulb: Read more about tracing](./tracing.md)
 
 ## Alerts
 
@@ -117,16 +126,19 @@ graph LR
   Alertmanager --> Slack
 ```
 
-[:octicons-arrow-right-24: Read more about alerts](./alerting.md)
+[:bulb: Read more about alerts](./alerting.md)
 
 ## Learning more
 
 Observability is a very broad topic and there is a lot more to learn. Here are some resources that you can use to learn more about observability:
 
-- [:octicons-video-24: Monitoring, the Prometheus Way][youtube-prometheus]
-- [:octicons-book-24: SRE Book - Monitoring distributed systems][sre-book-monitoring]
-- [:octicons-book-24: SRE Workbook - Monitoring][sre-workbook-monitoring]
-- [:octicons-book-24: SRE Workbook - Alerting][sre-workbook-alerting]
+[:octicons-video-24: Monitoring, the Prometheus Way][youtube-prometheus]
+
+[:octicons-book-24: SRE Book - Monitoring distributed systems][sre-book-monitoring]
+
+[:octicons-book-24: SRE Workbook - Monitoring][sre-workbook-monitoring]
+
+[:octicons-book-24: SRE Workbook - Alerting][sre-workbook-alerting]
 
 [sre-book-monitoring]: https://sre.google/sre-book/monitoring-distributed-systems/
 [sre-workbook-monitoring]: https://sre.google/workbook/monitoring/

diff --git a/docs/explanation/observability/frontend.md b/docs/explanation/observability/frontend.md
@@ -211,8 +211,8 @@ Instrumenting mounts and unmounts can be quite data intensive, take due care.
 
 Navigate your web browser to the new Grafana at <https://grafana.<<tenant()>>.cloud.nais.io>.
 
-Traces are available from the `dev-gcp-tempo` and `prod-gcp-tempo` data sources, whereas
-logs and metrics are available from the `dev-gcp-loki` and `prod-gcp-loki` data sources.
+Traces are available from the data sources ending with `-tempo`, whereas
+logs and metrics are available from data sources sources ending with `-loki`.
 
 Use the "Explore" tab under either the Loki or Tempo tab and run queries.
 

diff --git a/docs/explanation/observability/tracing.md b/docs/explanation/observability/tracing.md
@@ -1,25 +1,88 @@
 ---
 description: >-
   Application Performance Monitoring or tracing using Grafana Tempo on NAIS.
-tags: [explanation]
+tags: [explanation, tracing]
 ---
 
-# Tracing
+# Distributed Tracing
 
-[Traces](https://en.wikipedia.org/wiki/Observability_(software)#Distributed_traces) are a record of the path a request takes through your application. They
-are useful for understanding how a request is processed in your application.
+Tracing is a way to track a request as it passes through the various services needed to handle it. This is especially useful in a microservices architecture, where a single user action often results in a series of calls to different services.
 
-NAIS does not collect trace data automatically. If you want tracing integration,
-you must first instrument your application to collect traces, and then configure
-the tracing library to send it to the correct place.
+Tracing allows developers to understand the entire journey of a request, making it easier to identify bottlenecks, latency issues, or failures that can impact user experience.
 
-Traces from NAIS applications are collected using the [OpenTelemetry](https://opentelemetry.io/) standard.
-Performance metrics are stored and queried from the [Tempo](https://grafana.com/oss/tempo/) component.
+## How tracing works
 
-## Visualizing application performance
+When a request is made to your application, a trace is started. This creates a Trace which serves as a container for all the work done for that request.
 
-Visualization of traces can be done in [the new Grafana installation](https://grafana.<<tenant()>>.cloud.nais.io).
+![Tracing](../../assets/tracing.png)
 
-You can use the **Explore** feature of Grafana with the _prod-gcp-tempo_ and _dev-gcp-tempo_ data sources.
+<small>Trace visualization by Logshero licensed under Apache License 2.0</small>
 
-There are no ready-made dashboards at this point, but feel free to make one yourself and contribute to this page.
+The work done by individual services (or components of a single service) is captured in Spans. A span represents a single unit of work in a trace, like a SQL query or a call to an external service.
+
+Spans can be nested and form a trace tree. The Trace is the root of the tree, and each Span is a node that represents a specific operation in your application. The tree of spans captures the causal relationships between the operations in your application (i.e., which operations caused others to occur).
+
+Each Span carries a Context that includes metadata about the trace (like a unique trace identifier and span identifier) and any other data you choose to include. This context is propagated across process boundaries, allowing all the work that's part of a single trace to be linked together, even if it spans multiple services.
+
+By analyzing the data captured in traces and spans, you can gain a deep understanding of how requests flow through your system, where time is being spent, and where problems might be occurring. This can be invaluable for debugging, performance optimization, and understanding the overall health of your system.
+
+## OpenTelemetry
+
+OpenTelemetry, a project under the Cloud Native Computing Foundation (CNCF), has become the standard for tracing and application telemetry due to its unified APIs for tracing and metrics, which simplify instrumentation and data collection from applications.
+
+It supports a wide range of programming languages, including Java, JavaScript, Python, Go, and more, allowing for consistent tooling across different parts of a tech stack.
+
+OpenTelemetry also provides automatic instrumentation for popular frameworks and libraries, enabling the collection of traces and metrics without the need for modifying application code.
+
+It's vendor-neutral, allowing telemetry data export to any backend, providing the flexibility to switch between different analysis tools as needs change. Backed by leading companies in the cloud and software industry, and with a vibrant community, OpenTelemetry ensures project longevity and continuous improvement.
+
+[:octicons-link-external-24: Learn more about OpenTelemetry][open-telemetry]
+
+## Tracing in NAIS
+
+NAIS does not collect application trace data automatically, but it provides the infrastructure to do so using OpenTelemetry, Grafana Tempo for storage and querying, and easy-to-use configuration options.
+
+### The easy way: Auto-instrumentation
+
+The preferred way to get started with tracing is to enable auto-instrumentation for your application. This will automatically collect traces and send them to the correct place using the OpenTelemetry Agent.
+
+This is the easiest way to get started with tracing, as it requires little to no effort on the part of the team developing the application and provides instrumentation for popular libraries, frameworks and external services such as PostgreSQL, Redis, Kafka and HTTP clients.
+
+[:bulb: Get started with auto-instrumentation](../../how-to-guides/observability/auto-instrumentation.md)
+
+### The hard way: Manual instrumentation
+
+If you want more control over how your application is instrumented, you can manually instrument your application using the OpenTelemetry SDK for your programming language.
+
+To get the correct configuration for you can still use the auto-instrumentation configuration, but set the `runtime` to `sdk` as this will only set up the OpenTelemetry configuration, without injecting the OpenTelemetry Agent.
+
+[:bulb: Get started with manual-instrumentation](../../how-to-guides/observability/auto-instrumentation.md#enable-auto-instrumentation-for-other-applications)
+
+### OpenTelemetry SDKs
+
+OpenTelemetry provides SDKs for a wide range of programming languages:
+
+* [:fontawesome-brands-java: OpenTelemetry Java][otel-java]
+* [:fontawesome-brands-js: OpenTelemetry JavaScript][otel-node]
+* [:fontawesome-brands-python: OpenTelemetry Python][otel-python]
+* [:fontawesome-brands-golang: OpenTelemetry Go][otel-go]
+
+## Visualizing traces in Grafana Tempo
+
+Visualizing and querying traces is done in Grafana using the Grafana Tempo. Tempo is an open-source, easy-to-use, high-scale, and cost-effective distributed tracing backend that stores and queries traces.
+
+The easiest way to get started with Tempo is to use the [Explore view in Grafana][grafana-explore], which provides a user-friendly interface for querying and visualizing traces.
+
+[:octicons-link-external-24: Open Grafana Explore][grafana-explore]
+
+[:bulb: Get started with Grafana Tempo](../../how-to-guides/observability/tracing/tempo.md)
+
+![Grafana Tempo](../../assets/grafana-tempo.png)
+
+[open-telemetry]: https://opentelemetry.io/
+[otel-java]: https://opentelemetry.io/docs/languages/java/
+[otel-node]: https://opentelemetry.io/docs/languages/js/
+[otel-python]: https://opentelemetry.io/docs/languages/python/
+[otel-go]: https://opentelemetry.io/docs/languages/go/
+[grafana]: <<tenant_url("grafana")>>
+[grafana-explore]: <<tenant_url("grafana", "explore")>>
diff --git a/docs/how-to-guides/observability/auto-instrumentation.md b/docs/how-to-guides/observability/auto-instrumentation.md
@@ -0,0 +1,67 @@
+---
+description: Get started with auto-instrumentation for your applications with OpenTelemetry data for Tracing, Metrics and Logs using the OpenTelemetry Agent.
+tags: [guide, tracing]
+---
+# Get started with auto-instrumentation
+
+This guide will explain how to get started with auto-instrumentation your applications with OpenTelemetry data for [Tracing](../../explanation/observability/tracing.md), [Metrics](../../explanation/observability/metrics.md) and [Logs](../../explanation/observability/logging.md) using the OpenTelemetry Agent.
+
+The main benefit of auto-instrumentation is that is requires little to no effort on the part of the team developing the application while providing insight into popular libraries, frameworks and external services such as PostgreSQL, Redis, Kafka and HTTP clients.
+
+Auto-instrumentation is a preferred way to get started with tracing in NAIS, and can also be used for metrics and logs collection.This type of instrumentation is available for Java, Node.js and Python applications, but can also be used for other in `sdk` mode where it will only set up the OpenTelemetry configuration.
+
+!!! info
+
+    :new: Auto-instrumentation is a new feature and is only available for nais applications running in GCP.
+
+## Enable auto-instrumentation for Java/Kotlin applications
+
+```yaml
+...
+spec:
+  observability:
+    autoInstrumentation:
+      enabled: true
+      runtime: java
+```
+
+## Enable auto-instrumentation for Node.js applications
+
+```yaml
+...
+spec:
+  observability:
+    autoInstrumentation:
+      enabled: true
+      runtime: node
+```
+
+## Enable auto-instrumentation for Python applications
+
+```yaml
+...
+spec:
+  observability:
+    autoInstrumentation:
+      enabled: true
+      runtime: python
+```
+
+## Enable auto-instrumentation for other applications
+
+If your application runtime is not one of the supported runtimes or you want to instrument your application yourself you can stil get benefit from the auto instrumentation configuration.
+
+This will only set up the OpenTelemetry configuration for the application, but it will not inject the OpenTelemetry Agent into the application.
+
+```yaml
+...
+spec:
+  observability:
+    autoInstrumentation:
+      enabled: true
+      runtime: sdk
+```
+
+## Resources
+
+[:bulb: OpenTelemetry Auto-Instrumentation Configuration Reference](../../reference/observability/auto-config.md)
diff --git a/docs/how-to-guides/observability/tracing/context-propagation.md b/docs/how-to-guides/observability/tracing/context-propagation.md
@@ -0,0 +1,18 @@
+---
+description: Learn how to propagate trace context across process boundaries in a few common scenarios.
+tags: [guide, tracing]
+---
+# Trace context propagation
+
+Each Span carries a Context that includes metadata about the trace (like a unique trace identifier and span identifier) and any other data you choose to include. This context is propagated across process boundaries, allowing all the work that's part of a single trace to be linked together, even if it spans multiple services.
+
+This guide explains how to propagate trace context across process boundaries in a few common scenarios. If you are using [auto-instrumentation](../auto-instrumentation.md), trace context propagation is already handled for you.
+
+[:octicons-link-external-24: OpenTelemetry Context Propagation](https://opentelemetry.io/docs/concepts/context-propagation/)
+
+## Propagate trace context in HTTP requests
+
+When a service makes an HTTP request to another service, it should include the trace context in the request headers. The receiving service can then use this context to create a new Span that's part of the same trace. OpenTelemetry provides a standard for how trace context should be propagated in HTTP requests, called the [W3C Trace Context](https://www.w3.org/TR/trace-context/) standard.
+
+* [OpenTelemetry Setup in Spring Boot Application](https://opentelemetry.io/docs/languages/java/automatic/spring-boot)
+* [OpenTelemetry Setup in Ktor Application](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/ktor/ktor-2.0/library)
diff --git a/docs/how-to-guides/observability/tracing/correlate-traces-logs.md b/docs/how-to-guides/observability/tracing/correlate-traces-logs.md
@@ -0,0 +1,83 @@
+---
+description: Learn how to correlate traces with logs in Grafana Tempo.
+tags: [guide, tracing]
+---
+# Correlate traces and logs
+
+This guide will explain how to correlate traces with logs in Grafana Tempo.
+
+## Step 1: Configure Tracing
+
+First you need to configure OpenTelemetry tracing in your application. The easiest way to get started with tracing is to enable auto-instrumentation for your application. This will automatically collect traces and send them to the correct place using the OpenTelemetry Agent.
+
+[:bulb: Get started with auto-instrumentation](../auto-instrumentation.md)
+
+## Step 2: Configure Logging
+
+If you are using auto-instrumentation for logs they are automatically correlated with traces. If you are not using auto-instrumentation for logs, you need to configure your log output to include trace information.
+
+
+=== "log4j"
+
+    Add the [opentelemetry-javaagent-log4j-context-data-2.17](https://mvnrepository.com/artifact/io.opentelemetry.javaagent.instrumentation/opentelemetry-javaagent-log4j-context-data-2.17) package to your `pom.xml` or `build.gradle` to include trace information in your logs:
+
+    ```
+    io.opentelemetry.instrumentation:opentelemetry-log4j-context-data-2.17-autoconfigure:2.1.0-alpha
+    ```
+
+    Add the following pattern to your log4j configuration to include trace information in your logs:
+
+    ```xml
+    <?xml version="1.0" encoding="UTF-8"?>
+    <Configuration status="WARN">
+        <Appenders>
+            <Console name="Console" target="SYSTEM_OUT">
+                <PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} traceId: %X{trace_id} spanId: %X{span_id} - %msg%n" />
+            </Console>
+        </Appenders>
+        <Loggers>
+            <Root level="All" >
+                <AppenderRef ref="Console"/>
+            </Root>
+        </Loggers>
+    </Configuration>
+    ```
+
+=== "logback"
+
+    Add the [opentelemetry-logback-mdc-1.0](https://mvnrepository.com/artifact/io.opentelemetry.instrumentation/opentelemetry-logback-mdc-1.0) package to your `pom.xml` or `build.gradle` to include trace information in your logs:
+
+    ```
+    io.opentelemetry.instrumentation:opentelemetry-logback-mdc-1.0:2.1.0-alpha
+    ```
+
+    Add the following pattern to your logback configuration to include trace information in your logs:
+
+    ```xml
+    <?xml version="1.0" encoding="UTF-8" ?>
+    <configuration>
+        <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
+            <encoder>
+                <pattern><![CDATA[%date{HH:mm:ss.SSS} [%thread] %-5level %logger{15}#%line %X{req.requestURI} traceId: %X{trace_id} spanId: %X{span_id} %msg\n]]></pattern>
+            </encoder>
+        </appender>
+
+        <appender name="OTEL" class="io.opentelemetry.instrumentation.logback.v1_0.OpenTelemetryAppender">
+            <appender-ref ref="STDOUT" />
+        </appender>
+
+        <root>
+            <level value="DEBUG" />
+            <appender-ref ref="STDOUT" />
+        </root>
+
+    </configuration>
+    ```
+
+## 3. Profit
+
+Now that you have tracing and logging set up, you can use Grafana Tempo to correlate traces and logs. When you view a trace in Grafana Tempo, you can see the logs that are associated with that trace. This makes it easy to understand what happened in your application and troubleshoot issues.
+
+![Correlate traces and logs](../../../assets/grafana-tempo-logs.png)
+
+[:arrow_backward: Back to the list of guides](../index.md)