Merge branch 'diataxis' of https://github.com/nais/doc into diataxis

jhrv · jhrv · commit 0d4cf64298ac · 2024-02-06T09:05:58.000+01:00
diff --git a/Dockerfile b/Dockerfile
@@ -7,12 +7,12 @@ COPY docs ./docs-base
 COPY .git ./.git
 COPY tenants ./tenants
 RUN poetry install --no-dev --no-interaction --ansi --remove-untracked
-RUN for TENANT in nav dev-nais ssb tenant; do rm -rf ./docs; mkdir -p ./docs; cp -rf ./docs-base/* ./docs/; cp -rf ./tenants/$TENANT/* ./docs;TENANT=$TENANT poetry run mkdocs build -d out/$TENANT; done
+RUN for TENANT in nav dev-nais ci-nais ssb tenant; do rm -rf ./docs; mkdir -p ./docs; cp -rf ./docs-base/* ./docs/; cp -rf ./tenants/$TENANT/* ./docs;TENANT=$TENANT poetry run mkdocs build -d out/$TENANT; done
 
 FROM busybox:latest
 ENV PORT=8080
 COPY --from=builder ./src/out /www
 HEALTHCHECK CMD nc -z localhost $PORT
 ENV TENANT=tenant
 # Create a basic webserver and run it until the container is stopped
-CMD echo "httpd started" && trap "exit 0;" TERM INT; httpd -v -p $PORT -h /www/$TENANT -f & wait
+CMD echo "httpd started" && trap "exit 0;" TERM INT; if [ -d "/www/$TENANT" ]; then  DIR=$TENANT; else DIR="tenant"; fi; httpd -v -p $PORT -h /www/$DIR -f & wait
diff --git a/docs/explanation/.pages b/docs/explanation/.pages
@@ -0,0 +1,6 @@
+nav:
+    - nais.md
+    - what-is-naisdevice.md
+    - nais-teams.md
+    - workloads.md
+    - ...
diff --git a/docs/explanation/observability/alerting.md b/docs/explanation/observability/alerting.md
@@ -1,5 +1,65 @@
 # Alerting
 
-<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="YouTube video player" frameborder="0" allowfullscreen></iframe>
+<iframe width="560" height="315" src="https://www.youtube.com/embed/CGldVD5wR-g?si=luayvJTiZBsWK24u" title="YouTube video player" frameborder="0" allowfullscreen></iframe> -->
 
-Alerting
+You can't fix what you can't see. Alerting is a crucial part of observability, and it's the first step in knowing when something is wrong with your application.
+
+However, alerting is only as good as the data you have available and the conditions you set. It's important to have a good understanding of what you want to monitor and how you want to be notified. We call this the _alerting strategy_.
+
+While many metrics can be useful for monitoring, not all of them are useful for alerting. When setting up alerts, it's important to choose metrics that are relevant to the user experience and that can be used to detect problems early.
+
+## Critical user journeys
+
+A good place to start when choosing what to monitor is to consider the most critical user journeys in your application. This is a set of interactions a user has with a service to achieve a concert end result. It is important that these journeys are modeled from the user's perspective and not a technical perspective. This is because not all technical issues are equally important, or even visible to the user.
+
+Imagine your team is responsible for a hypothetical case management system. Let’s look at a few of the actions your users will be taking when they use the system:
+
+* List open cases
+* Search for cases
+* View case details
+* Case resolution
+
+Not all of these actions are equally important. For example, the "List open cases" journey might be less critical than the "Case resolution" journey. This is why it's important to prioritize the journeys and set up alerts accordingly.
+
+## Alerting indicators
+
+Service level indicators (or SLIs) are metrics that correlate with the user experience for a given user journey. With correlation, we mean that if the indicator is trending downwards, it's likely that the user experience is also trending in the same direction.
+
+While many metrics can function as indicators of user experience, we recommend choosing an indicator that is a ratio of two numbers: the number of good events divided by the total number of events. Typical service level indicators include:
+
+* Number of successful HTTP requests divided by the total number of HTTP requests (success rate)
+* Number of cache hits divided by the total number of cache lookups (cache hit rate)
+* Number of successful database queries divided by the total number of database queries (database success rate)
+
+We recommend a rate because they more stable than raw counts, and they are less likely to be affected by a sudden change in traffic. Indicators of this type have some desirable properties. They range from 0% to 100%, where 100% means that all is well and 0% means that nothing is working. They are also easy to understand and easy to calculate.
+
+Other types of indicators can also be useful, such as latency, throughput, and saturation. However, these are often more difficult to calculate, understand, and they are often less stable than rates.
+
+Continuing with the case management system example, let's say you want to monitor the "Case resolution" user journey. You could monitor the following indicators:
+
+* The rate of successful submissions of case resolution
+* The rate of validation errors (e.g. missing fields, invalid data)
+* The latency until the data is persisted to the database
+
+## Alerting objectives
+
+
+
+## Alerting conditions
+
+When setting up alerts, you need to define the conditions that should trigger the alert. This could be anything from the number of requests exceeding a certain threshold, to the latency of your application exceeding a certain threshold, to the number of errors exceeding a certain threshold.
+
+Consider the following attributes when setting up alerts:
+
+* _Precision_. The proportion of events detected that are significant. In other words, it's the ratio of true positive alerts (alerts that correctly indicate a problem) to the total number of alerts (both true and false positives). High precision means that most of the alerts you receive are meaningful and require action.
+
+* _Recall_. The proportion of significant events that are detected. It's the ratio of true positive alerts to the total number of actual problems (both detected and undetected). High recall means that you are catching most of the problems that occur.
+
+* _Detection Time_. The amount of time it takes for the alerting system to trigger an alert after a problem has occurred. Short detection times are desirable as they allow for quicker response to problems.
+
+* _Reset Time_. The amount of time it takes for the alerting system to resolve an alert after the problem has been fixed. Short reset times are desirable as they reduce the amount of time spent dealing with alerts for problems that have already been resolved.
+
+
+https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos
+
+https://cloud.google.com/blog/products/management-tools/good-relevance-and-outcomes-for-alerting-and-monitoring
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -218,6 +218,7 @@ extra_css:
   - material_theme_stylesheet_overrides/uu.css
   - material_theme_stylesheet_overrides/grid.css
 plugins:
+  - awesome-pages
   - macros:
       j2_variable_start_string: "<<"
       j2_variable_end_string: ">>"
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -15,6 +15,7 @@ mkdocs-git-revision-date-localized-plugin = "^1.2.1"
 mkdocs-redirects = "^1.2.1"
 mkdocs-git-committers-plugin-2 = "^2.2.2"
 mkdocs-macros-plugin = "^1.0.5"
+mkdocs-awesome-pages-plugin = "^2.9.2"
 
 [tool.poetry.dev-dependencies]