Skip to content

Commit 0d4cf64

Browse files
committed
Merge branch 'diataxis' of https://github.com/nais/doc into diataxis
2 parents eebbc20 + 63b7769 commit 0d4cf64

File tree

6 files changed

+129
-5
lines changed

6 files changed

+129
-5
lines changed

Dockerfile

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ COPY docs ./docs-base
77
COPY .git ./.git
88
COPY tenants ./tenants
99
RUN poetry install --no-dev --no-interaction --ansi --remove-untracked
10-
RUN for TENANT in nav dev-nais ssb tenant; do rm -rf ./docs; mkdir -p ./docs; cp -rf ./docs-base/* ./docs/; cp -rf ./tenants/$TENANT/* ./docs;TENANT=$TENANT poetry run mkdocs build -d out/$TENANT; done
10+
RUN for TENANT in nav dev-nais ci-nais ssb tenant; do rm -rf ./docs; mkdir -p ./docs; cp -rf ./docs-base/* ./docs/; cp -rf ./tenants/$TENANT/* ./docs;TENANT=$TENANT poetry run mkdocs build -d out/$TENANT; done
1111

1212
FROM busybox:latest
1313
ENV PORT=8080
1414
COPY --from=builder ./src/out /www
1515
HEALTHCHECK CMD nc -z localhost $PORT
1616
ENV TENANT=tenant
1717
# Create a basic webserver and run it until the container is stopped
18-
CMD echo "httpd started" && trap "exit 0;" TERM INT; httpd -v -p $PORT -h /www/$TENANT -f & wait
18+
CMD echo "httpd started" && trap "exit 0;" TERM INT; if [ -d "/www/$TENANT" ]; then DIR=$TENANT; else DIR="tenant"; fi; httpd -v -p $PORT -h /www/$DIR -f & wait

docs/explanation/.pages

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
nav:
2+
- nais.md
3+
- what-is-naisdevice.md
4+
- nais-teams.md
5+
- workloads.md
6+
- ...
+62-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,65 @@
11
# Alerting
22

3-
<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="YouTube video player" frameborder="0" allowfullscreen></iframe>
3+
<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="YouTube video player" frameborder="0" allowfullscreen></iframe> -->
44

5-
Alerting
5+
You can't fix what you can't see. Alerting is a crucial part of observability, and it's the first step in knowing when something is wrong with your application.
6+
7+
However, alerting is only as good as the data you have available and the conditions you set. It's important to have a good understanding of what you want to monitor and how you want to be notified. We call this the _alerting strategy_.
8+
9+
While many metrics can be useful for monitoring, not all of them are useful for alerting. When setting up alerts, it's important to choose metrics that are relevant to the user experience and that can be used to detect problems early.
10+
11+
## Critical user journeys
12+
13+
A good place to start when choosing what to monitor is to consider the most critical user journeys in your application. This is a set of interactions a user has with a service to achieve a concert end result. It is important that these journeys are modeled from the user's perspective and not a technical perspective. This is because not all technical issues are equally important, or even visible to the user.
14+
15+
Imagine your team is responsible for a hypothetical case management system. Let’s look at a few of the actions your users will be taking when they use the system:
16+
17+
* List open cases
18+
* Search for cases
19+
* View case details
20+
* Case resolution
21+
22+
Not all of these actions are equally important. For example, the "List open cases" journey might be less critical than the "Case resolution" journey. This is why it's important to prioritize the journeys and set up alerts accordingly.
23+
24+
## Alerting indicators
25+
26+
Service level indicators (or SLIs) are metrics that correlate with the user experience for a given user journey. With correlation, we mean that if the indicator is trending downwards, it's likely that the user experience is also trending in the same direction.
27+
28+
While many metrics can function as indicators of user experience, we recommend choosing an indicator that is a ratio of two numbers: the number of good events divided by the total number of events. Typical service level indicators include:
29+
30+
* Number of successful HTTP requests divided by the total number of HTTP requests (success rate)
31+
* Number of cache hits divided by the total number of cache lookups (cache hit rate)
32+
* Number of successful database queries divided by the total number of database queries (database success rate)
33+
34+
We recommend a rate because they more stable than raw counts, and they are less likely to be affected by a sudden change in traffic. Indicators of this type have some desirable properties. They range from 0% to 100%, where 100% means that all is well and 0% means that nothing is working. They are also easy to understand and easy to calculate.
35+
36+
Other types of indicators can also be useful, such as latency, throughput, and saturation. However, these are often more difficult to calculate, understand, and they are often less stable than rates.
37+
38+
Continuing with the case management system example, let's say you want to monitor the "Case resolution" user journey. You could monitor the following indicators:
39+
40+
* The rate of successful submissions of case resolution
41+
* The rate of validation errors (e.g. missing fields, invalid data)
42+
* The latency until the data is persisted to the database
43+
44+
## Alerting objectives
45+
46+
47+
48+
## Alerting conditions
49+
50+
When setting up alerts, you need to define the conditions that should trigger the alert. This could be anything from the number of requests exceeding a certain threshold, to the latency of your application exceeding a certain threshold, to the number of errors exceeding a certain threshold.
51+
52+
Consider the following attributes when setting up alerts:
53+
54+
* _Precision_. The proportion of events detected that are significant. In other words, it's the ratio of true positive alerts (alerts that correctly indicate a problem) to the total number of alerts (both true and false positives). High precision means that most of the alerts you receive are meaningful and require action.
55+
56+
* _Recall_. The proportion of significant events that are detected. It's the ratio of true positive alerts to the total number of actual problems (both detected and undetected). High recall means that you are catching most of the problems that occur.
57+
58+
* _Detection Time_. The amount of time it takes for the alerting system to trigger an alert after a problem has occurred. Short detection times are desirable as they allow for quicker response to problems.
59+
60+
* _Reset Time_. The amount of time it takes for the alerting system to resolve an alert after the problem has been fixed. Short reset times are desirable as they reduce the amount of time spent dealing with alerts for problems that have already been resolved.
61+
62+
63+
https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos
64+
65+
https://cloud.google.com/blog/products/management-tools/good-relevance-and-outcomes-for-alerting-and-monitoring

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ extra_css:
218218
- material_theme_stylesheet_overrides/uu.css
219219
- material_theme_stylesheet_overrides/grid.css
220220
plugins:
221+
- awesome-pages
221222
- macros:
222223
j2_variable_start_string: "<<"
223224
j2_variable_end_string: ">>"

poetry.lock

+57-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ mkdocs-git-revision-date-localized-plugin = "^1.2.1"
1515
mkdocs-redirects = "^1.2.1"
1616
mkdocs-git-committers-plugin-2 = "^2.2.2"
1717
mkdocs-macros-plugin = "^1.0.5"
18+
mkdocs-awesome-pages-plugin = "^2.9.2"
1819

1920
[tool.poetry.dev-dependencies]
2021

0 commit comments

Comments
 (0)