docs(decisions): add architectural decision records structure

gustavovalverde · gustavovalverde · commit 340fc26278b0 · 2025-02-28T10:07:39.000Z
Create a structured decision records system to document important technical choices across multiple domains (DevOps, Network, Consensus, etc.).

This implements a modified MADR template approach for preserving context, trade-offs, and reasoning behind significant architectural decisions.
diff --git a/docs/decisions/README.md b/docs/decisions/README.md
@@ -0,0 +1,22 @@
+# Decision Log
+
+We capture important decisions with [architectural decision records](https://adr.github.io/).
+
+These records provide context, trade-offs, and reasoning taken at our community & technical cross-roads. Our goal is to preserve the understanding of the project growth, and capture enough insight to effectively revisit previous decisions.
+
+Get started created a new decision record with the template:
+
+```sh
+cp template.md NNNN-title-with-dashes.md
+```
+
+For more rational behind this approach, see [Michael Nygard's article](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
+
+We've inherited MADR [ADR template](https://adr.github.io/madr/), which is a bit more verbose than Nygard's original template. We may simplify it in the future.
+
+## Evolving Decisions
+
+Many decisions build on each other, a driver of iterative change and messiness
+in software. By laying out the "story arc" of a particular system within the
+application, we hope future maintainers will be able to identify how to rewind
+decisions when refactoring the application becomes necessary.
diff --git a/docs/decisions/devops/0001-docker-high-uid.md b/docs/decisions/devops/0001-docker-high-uid.md
@@ -0,0 +1,51 @@
+---
+status: accepted
+date: 2025-02-28
+story: Appropriate UID/GID values for container users
+---
+
+# Use High UID/GID Values for Container Users
+
+## Context & Problem Statement
+
+Docker containers share the host's user namespace by default. If container UIDs/GIDs overlap with privileged host accounts, this could lead to privilege escalation if a container escape vulnerability is exploited. Low UIDs (especially in the system user range of 100-999) are particularly risky as they often map to privileged system users on the host.
+
+Our previous approach used UID/GID 101 with the `--system` flag for user creation, which falls within the system user range and could potentially overlap with critical system users on the host.
+
+## Priorities & Constraints
+
+* Enhance security by reducing the risk of container user namespace overlaps
+* Avoid warnings during container build related to system user ranges
+* Maintain compatibility with common Docker practices
+* Prevent potential privilege escalation in case of container escape
+
+## Considered Options
+
+* Option 1: Keep using low UID/GID (101) with `--system` flag
+* Option 2: Use unprivileged UID/GID (1000+) without `--system` flag
+* Option 3: Use high UID/GID (10000+) without `--system` flag
+
+## Decision Outcome
+
+Chosen option: [Option 3: Use high UID/GID (10000+) without `--system` flag]
+
+We decided to:
+
+1. Change the default UID/GID from 101 to 10001
+2. Remove the `--system` flag from user/group creation commands
+3. Document the security rationale for these changes
+
+This approach significantly reduces the risk of UID/GID collision with host system users while avoiding build-time warnings related to system user ranges. Using a very high UID/GID (10001) provides an additional security boundary in containers where user namespaces are shared with the host.
+
+### Expected Consequences
+
+* Improved security posture by reducing the risk of container escapes leading to privilege escalation
+* Elimination of build-time warnings related to system user UID/GID ranges
+* Consistency with industry best practices for container security
+* No functional impact on container operation, as the internal user permissions remain the same
+
+## More Information
+
+* [NGINX Docker User ID Issue](https://github.com/nginxinc/docker-nginx/issues/490) - Demonstrates the risks of using UID 101 which overlaps with `systemd-network` user on Debian systems
+* [.NET Docker Issue on System Users](https://github.com/dotnet/dotnet-docker/issues/4624) - Details the problems with using `--system` flag and the SYS_UID_MAX warnings
+* [Docker Security Best Practices](https://docs.docker.com/develop/security-best-practices/) - General security recommendations for Docker containers
diff --git a/docs/decisions/devops/0002-docker-use-gosu.md b/docs/decisions/devops/0002-docker-use-gosu.md
@@ -0,0 +1,51 @@
+---
+status: accepted
+date: 2025-02-28
+story: Volumes permissions and privilege management in container entrypoint
+---
+
+# Use gosu for Privilege Dropping in Entrypoint
+
+## Context & Problem Statement
+
+Running containerized applications as the root user is a security risk. If an attacker compromises the application, they gain root access within the container, potentially facilitating a container escape. However, some operations during container startup, such as creating directories or modifying file permissions in locations not owned by the application user, require root privileges.  We need a way to perform these initial setup tasks as root, but then switch to a non-privileged user *before* executing the main application (`zebrad`).  Using `USER` in the Dockerfile is insufficient because it applies to the entire runtime, and we need to change permissions *after* volumes are mounted.
+
+## Priorities & Constraints
+
+* Minimize the security risk by running the main application (`zebrad`) as a non-privileged user.
+* Allow initial setup tasks (file/directory creation, permission changes) that require root privileges.
+* Maintain a clean and efficient entrypoint script.
+* Avoid complex signal handling and TTY issues associated with `su` and `sudo`.
+* Ensure 1:1 parity with Docker's `--user` flag behavior.
+
+## Considered Options
+
+* Option 1: Use `USER` directive in Dockerfile.
+* Option 2: Use `su` within the entrypoint script.
+* Option 3: Use `sudo` within the entrypoint script.
+* Option 4: Use `gosu` within the entrypoint script.
+* Option 5: Use `chroot --userspec`
+* Option 6: Use `setpriv`
+
+## Decision Outcome
+
+Chosen option: [Option 4: Use `gosu` within the entrypoint script]
+
+We chose to use `gosu` because it provides a simple and secure way to drop privileges from root to a non-privileged user *after* performing necessary setup tasks.  `gosu` avoids the TTY and signal-handling complexities of `su` and `sudo`. It's designed specifically for this use case (dropping privileges in container entrypoints) and leverages the same underlying mechanisms as Docker itself for user/group handling, ensuring consistent behavior.
+
+### Expected Consequences
+
+* Improved security by running `zebrad` as a non-privileged user.
+* Simplified entrypoint script compared to using `su` or `sudo`.
+* Avoidance of TTY and signal-handling issues.
+* Consistent behavior with Docker's `--user` flag.
+* No negative impact on functionality, as initial setup tasks can still be performed.
+
+## More Information
+
+* [gosu GitHub repository](https://github.com/tianon/gosu#why) - Explains the rationale behind `gosu` and its advantages over `su` and `sudo`.
+* [gosu usage warning](https://github.com/tianon/gosu#warning) - Highlights the core use case (stepping down from root) and potential vulnerabilities in other scenarios.
+* Alternatives considered:
+  * `chroot --userspec`: While functional, it's less common and less directly suited to this specific task than `gosu`.
+  * `setpriv`: A viable alternative, but `gosu` is already well-established in our workflow and offers the desired functionality with a smaller footprint than a full `util-linux` installation.
+  * `su-exec`:  Another minimal alternative, but it has known parser bugs that could lead to unexpected root execution.
diff --git a/docs/decisions/devops/0003-filesystem-hierarchy.md b/docs/decisions/devops/0003-filesystem-hierarchy.md
@@ -0,0 +1,115 @@
+---
+status: proposed
+date: 2025-02-28
+story: Standardize filesystem hierarchy for Zebra deployments
+---
+
+# Standardize Filesystem Hierarchy: FHS vs. XDG
+
+## Context & Problem Statement
+
+Zebra currently has inconsistencies in its filesystem layout, particularly regarding where configuration, data, cache files, and binaries are stored. We need a standardized approach compatible with:
+
+1. Traditional Linux systems.
+2. Containerized deployments (Docker).
+3. Cloud environments with stricter filesystem restrictions (e.g., Google's Container-Optimized OS).
+
+We previously considered using the Filesystem Hierarchy Standard (FHS) exclusively ([Issue #3432](https://github.com/ZcashFoundation/zebra/issues/3432)). However, recent changes introduced the XDG Base Directory Specification, which offers a user-centric approach. We need to decide whether to:
+
+* Adhere to FHS.
+* Adopt XDG Base Directory Specification.
+* Use a hybrid approach, leveraging the strengths of both.
+
+The choice impacts how we structure our Docker images, where configuration files are located, and how users interact with Zebra in different environments.
+
+## Priorities & Constraints
+
+* **Security:** Minimize the risk of privilege escalation by adhering to least-privilege principles.
+* **Maintainability:** Ensure a clear and consistent filesystem layout that is easy to understand and maintain.
+* **Compatibility:** Work seamlessly across various Linux distributions, Docker, and cloud environments (particularly those with restricted filesystems like Google's Container-Optimized OS).
+* **User Experience:** Provide a predictable and user-friendly experience for locating configuration and data files.
+* **Flexibility:** Allow users to override default locations via environment variables where appropriate.
+* **Avoid Breaking Changes:** Minimize disruption to existing users and deployments, if possible.
+
+## Considered Options
+
+### Option 1: FHS
+
+* Configuration: `/etc/zebrad/`
+* Data: `/var/lib/zebrad/`
+* Cache: `/var/cache/zebrad/`
+* Logs: `/var/log/zebrad/`
+* Binary: `/opt/zebra/bin/zebrad` or `/usr/local/bin/zebrad`
+
+### Option 2: XDG Base Directory Specification
+
+* Configuration: `$HOME/.config/zebrad/`
+* Data: `$HOME/.local/share/zebrad/`
+* Cache: `$HOME/.cache/zebrad/`
+* State: `$HOME/.local/state/zebrad/`
+* Binary: `$HOME/.local/bin/zebrad` or `/usr/local/bin/zebrad`
+
+### Option 3: Hybrid Approach (FHS for System-Wide, XDG for User-Specific)
+
+* System-wide configuration: `/etc/zebrad/`
+* User-specific configuration: `$XDG_CONFIG_HOME/zebrad/`
+* System-wide data (read-only, shared): `/usr/share/zebrad/` (e.g., checkpoints)
+* User-specific data: `$XDG_DATA_HOME/zebrad/`
+* Cache: `$XDG_CACHE_HOME/zebrad/`
+* State: `$XDG_STATE_HOME/zebrad/`
+* Runtime: `$XDG_RUNTIME_DIR/zebrad/`
+* Binary: `/opt/zebra/bin/zebrad` (system-wide) or `$HOME/.local/bin/zebrad` (user-specific)
+
+## Pros and Cons of the Options
+
+### FHS
+
+* **Pros:**
+  * Traditional and well-understood by system administrators.
+  * Clear separation of configuration, data, cache, and binaries.
+  * Suitable for packaged software installations.
+
+* **Cons:**
+  * Less user-friendly; requires root access to modify configuration.
+  * Can conflict with stricter cloud environments restricting writes to `/etc` and `/var`.
+  * Doesn't handle multi-user scenarios as gracefully as XDG.
+
+### XDG Base Directory Specification
+
+* **Pros:**
+  * User-centric: configuration and data stored in user-writable locations.
+  * Better suited for containerized and cloud environments.
+  * Handles multi-user scenarios gracefully.
+  * Clear separation of configuration, data, cache, and state.
+
+* **Cons:**
+  * Less traditional; might be unfamiliar to some system administrators.
+  * Requires environment variables to be set correctly.
+  * Binary placement less standardized.
+
+### Hybrid Approach (FHS for System-Wide, XDG for User-Specific)
+
+* **Pros:**
+  * Combines strengths of FHS and XDG.
+  * Allows system-wide defaults while prioritizing user-specific configurations.
+  * Flexible and adaptable to different deployment scenarios.
+  * Clear binary placement in `/opt`.
+
+* **Cons:**
+  * More complex than either FHS or XDG alone.
+  * Requires careful consideration of precedence rules.
+
+## Decision Outcome
+
+Pending
+
+## Expected Consequences
+
+Pending
+
+## More Information
+
+* [Filesystem Hierarchy Standard (FHS) v3.0](https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.html)
+* [XDG Base Directory Specification](https://specifications.freedesktop.org/basedir-spec/latest/)
+* [Zebra Issue #3432: Use the Filesystem Hierarchy Standard (FHS) for deployments and artifacts](https://github.com/ZcashFoundation/zebra/issues/3432)
+* [Google Container-Optimized OS: Working with the File System](https://cloud.google.com/container-optimized-os/docs/concepts/disks-and-filesystem#working_with_the_file_system)
diff --git a/docs/decisions/template.md b/docs/decisions/template.md
@@ -0,0 +1,49 @@
+---
+# status and date are the only required elements. Feel free to remove the rest.
+status: {[proposed | rejected | accepted | deprecated | … | superseded by [ADR-NAME](adr-file-name.md)]}
+date: {YYYY-MM-DD when the decision was last updated}
+builds-on: {[Short Title](2021-05-15-short-title.md)}
+story: {description or link to contextual issue}
+---
+
+# {short title of solved problem and solution}
+
+## Context and Problem Statement
+
+{2-3 sentences explaining the problem and the forces influencing the decision.}
+<!-- The language in this section is value-neutral. It is simply describing facts. -->
+
+## Priorities & Constraints <!-- optional -->
+
+* {List of concerns or constraints}
+* {Factors influencing the decision}
+
+## Considered Options
+
+* Option 1: Thing
+* Option 2: Another
+
+### Pros and Cons of the Options <!-- optional -->
+
+#### Option 1: {Brief description}
+
+* Good, because {reason}
+* Bad, because {reason}
+
+## Decision Outcome
+
+Chosen option [Option 1: Thing]
+
+{Clearly state the chosen option and provide justification. Reference the "Pros and Cons of the Options" section below if applicable.}
+
+### Expected Consequences <!-- optional -->
+
+* List of outcomes resulting from this decision
+<!-- Positive, negative, and/or neutral consequences, as long as they affect the team and project in the future. -->
+
+## More Information <!-- optional -->
+
+<!-- * Resources reviewed as part of making this decision -->
+<!-- * Links to any supporting documents or resources -->
+<!-- * Related PRs -->
+<!-- * Related User Journeys -->