Skip to content

Commit 598b4d8

Browse files
committed
Add more alerting stuff
1 parent d440c59 commit 598b4d8

File tree

2 files changed

+38
-8
lines changed

2 files changed

+38
-8
lines changed

docs/explanation/observability/alerting.md

+14-6
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,20 @@
1+
---
2+
description: >-
3+
Alerting is a crucial part of observability, and it's the first step in knowing when something is wrong with your application.
4+
---
15
# Alerting
26

3-
<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="YouTube video player" frameborder="0" allowfullscreen></iframe> -->
7+
<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="Video about Actionable Alerting" frameborder="0" allowfullscreen></iframe>
48

59
You can't fix what you can't see. Alerting is a crucial part of observability, and it's the first step in knowing when something is wrong with your application.
610

711
However, alerting is only as good as the data you have available and the conditions you set. It's important to have a good understanding of what you want to monitor and how you want to be notified. We call this the _alerting strategy_.
812

9-
While many metrics can be useful for monitoring, not all of them are useful for alerting. When setting up alerts, it's important to choose metrics that are relevant to the user experience and that can be used to detect problems early.
13+
While many metrics can be useful to gain insights into different aspects of a system, not all of them are useful for alerting. When setting up alerts, it's important to choose metrics that are relevant to the user experience and that can be used to detect problems early.
14+
15+
## Reliability
16+
17+
TBD
1018

1119
## Critical user journeys
1220

@@ -43,7 +51,7 @@ Continuing with the case management system example, let's say you want to monito
4351

4452
## Alerting objectives
4553

46-
54+
Service Level Objectives (SLOs) are the target values or ranges of values for a service level the team aims for. They are defined based on indicators, which are the quantitative measures of some aspect of the level of service, a target value or range of values for the indicator, and a time period over which the indicator is measured.
4755

4856
## Alerting conditions
4957

@@ -59,7 +67,7 @@ Consider the following attributes when setting up alerts:
5967

6068
* _Reset Time_. The amount of time it takes for the alerting system to resolve an alert after the problem has been fixed. Short reset times are desirable as they reduce the amount of time spent dealing with alerts for problems that have already been resolved.
6169

70+
## Reference
6271

63-
https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos
64-
65-
https://cloud.google.com/blog/products/management-tools/good-relevance-and-outcomes-for-alerting-and-monitoring
72+
* https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos
73+
* https://cloud.google.com/blog/products/management-tools/good-relevance-and-outcomes-for-alerting-and-monitoring

docs/explanation/observability/logging.md

+24-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
1-
# Logs (TODO: nav only)
1+
---
2+
description: |>
3+
4+
---
5+
# Logging
6+
7+
## Purpose of logs
8+
9+
Logs are a way to understand what is happening in your application. They are usually text-based and are often used for debugging. Since the format of logs is usually not standardized, it can be difficult to query and aggregate logs and thus we recommend using metrics for dashboards and alerting.
10+
11+
There are many types of logs, and they can be used for different purposes. Some logs are used for debugging, some are used for auditing, and some are used for security. Our primary use case for logs is to understand the flow of a request through a system.
12+
13+
Application logs in nais is first and foremost a tool for developers to debug their applications. It is not intended to be used for auditing or security purposes. We do not condone writing sensitive information to application logs.
14+
15+
## Good practice
16+
17+
- [x] **Establish a clear logging strategy** for your application. What do you want to log? What do you not want to log? What is the purpose of your logs?
18+
- [x] **Use log levels** to different- [x] **Use log levels** to differentiate between different types of logs. We recommend using the following log levels: `INFO`, `WARN`, `ERROR`, and `FATAL`.
19+
- [x] **Use structured logging**. This means that your logs must be written in a JSON format. This makes it easier to query and aggregate logs.
20+
- [x] **Write meaningful log messages** and attach relevant metadata to your logs. This makes it easier to understand what is happening in your application.
21+
- [ ] **Do not log sensitive information**. This includes personal information, passwords, and secrets. If you need to log sensitive information, use [secure logs](#secure-logs) or [audit logs](#audit-logs).
22+
- [ ] **Do not underestimate the cost and performance** of logging. Logging is a trade-off between observability, performance, and cost. Logging can be computational and financial expensive, so make sure you log only what you actually need.
23+
- [ ] **Do not use rely on logs for monitoring**. Use metrics for monitoring, visualization, and alerting as your first line of defense and use logs for debugging when something goes wrong.
24+
225

3-
## Logging
426

527
Configure your application to log to console \(stdout/stderr\), it will be scraped by [FluentD](https://www.fluentd.org/) running inside the cluster and sent to [Elasticsearch](https://www.elastic.co/products/elasticsearch) and made available via [Kibana](https://www.elastic.co/products/kibana). Visit our Kibana at [logs.adeo.no](https://logs.adeo.no/).
628

0 commit comments

Comments
 (0)