|
| 1 | +# About `pkg/store` |
| 2 | + |
| 3 | +## TL;DR |
| 4 | + |
| 5 | +You _may_ want to read this if you are developing something in nerdctl that would involve storing persistent information. |
| 6 | + |
| 7 | +If there is a "store" already in the codebase (eg: volumestore, namestore, etc) that does provide the methods that you need, |
| 8 | +you are fine and should just stick to that. |
| 9 | + |
| 10 | +On the other hand, if you are curious, or if what you want to write is "new", then _you should_ have a look at this document: |
| 11 | +it does provide extended information about how we manage persistent data storage, especially with regard to concurrency |
| 12 | +and atomicity. |
| 13 | + |
| 14 | +## Motivation |
| 15 | + |
| 16 | +The core of nerdctl aims at keeping its dependencies as lightweight as possible. |
| 17 | +For that reason, nerdctl does not use a database to store persistent information, but instead uses the filesystem, |
| 18 | +under a variety of directories. |
| 19 | + |
| 20 | +That "information" is typically volumes metadata, containers lifecycle info, the "name store" (which does ensure no two |
| 21 | +containers can be named the same), etc. |
| 22 | + |
| 23 | +However, storing data on the filesystem in a reliable way comes with challenges: |
| 24 | +- incomplete writes may happen (because of a system restart, or an application crash), leaving important structured files |
| 25 | +in a broken state |
| 26 | +- concurrent writes, or reading while writing would obviously be a problem as well, be it accross goroutines, or between |
| 27 | +concurrent executions of the nerdctl binary, or embedded in a third-party application that does concurrently access resources |
| 28 | + |
| 29 | +The `pkg/store` package does provide a "storage" abstraction that takes care of these issues, generally providing |
| 30 | +guarantees that concurrent operations can be performed safely, and that writes are "atomic", ensuring we never brick |
| 31 | +user installs. |
| 32 | + |
| 33 | +For details about how, and what is done, read-on. |
| 34 | + |
| 35 | +## The problem with writing a file |
| 36 | + |
| 37 | +A write may very well be interrupted. |
| 38 | + |
| 39 | +While reading the resulting mangled file will typically break `json.Unmarshall` for example, and while we should still |
| 40 | +handle such cases gracefully and provide meaningful information to the user about which file is damaged (which could be due |
| 41 | +to the user manually modifying them), using "atomic" writes will (almost always (*)) prevent this from happening |
| 42 | +on our part. |
| 43 | + |
| 44 | +An "atomic" write is usually performed by first writing data to a temporary file, and then, only if the write operation |
| 45 | +succeeded, move that temporary file to its final destination. |
| 46 | + |
| 47 | +The `rename` syscall (see https://man7.org/linux/man-pages/man2/rename.2.html) is indeed "atomic" |
| 48 | +(eg: it fully succeeds, or fails), providing said guarantees that you end-up with a complete file that has the entirety |
| 49 | +of what was meant to be written. |
| 50 | + |
| 51 | +This is an "almost always", as an _operating system crash_ MAY break that promise (this is highly dependent on specifics |
| 52 | +that are out of scope here, and that nerdctl has no control over). |
| 53 | +Though, crashing operating systems is (hopefully) a sufficiently rare event that we can consider we "always" have atomic writes. |
| 54 | + |
| 55 | +There is one caveat with "rename-based atomic writes" though: if you mount the file itself inside a container, |
| 56 | +an atomic write will not work as you expect, as the inode will (obviously) change when you modify the file, |
| 57 | +and these changes will not be propagated inside the container. |
| 58 | + |
| 59 | +This caveat is the reason why `hostsstore` does NOT use an atomic write to update the `hosts` file, but a traditional write. |
| 60 | + |
| 61 | +## Concurrency between go routines |
| 62 | + |
| 63 | +This is a (simple) well-known problem. Just use a mutex to prevent concurrent modifications of the same object. |
| 64 | + |
| 65 | +Note that this is not much of a problem right now in nerdctl itself - but it might be in third-party applications using |
| 66 | +our codebase. |
| 67 | + |
| 68 | +This is just generally good hygiene when building concurrency-safe packages. |
| 69 | + |
| 70 | +## Concurrency between distinct binary invocations |
| 71 | + |
| 72 | +This is much more of a problem. |
| 73 | + |
| 74 | +There are many good reasons and real-life scenarios where concurrent binary execution may happen. |
| 75 | +A third-party deployment tool (similar to terraform for eg), that will batch a bunch of operations to be performed |
| 76 | +to achieve a desired infrastructure state, and call many `nerdctl` invocations in parallel to achieve that. |
| 77 | +This is also common-place in testing (subpackages). |
| 78 | +And of course, a third-party tool that would be long-running and allow parallel execution, leveraging nerdctl codebase |
| 79 | +as a library, may certainly produce these circumstances. |
| 80 | + |
| 81 | +The known answer to that problem is to use a filesystem lock (or flock). |
| 82 | + |
| 83 | +Concretely, the first process will "lock" a directory. All other processes trying to do the same will then be put |
| 84 | +in a queue and wait for the prior lock to be released before they can "lock" themselves, in turn. |
| 85 | + |
| 86 | +Filesystem locking comes with its own set of challenges: |
| 87 | +- implementation is somewhat low-level (the golang core library keeps their implementation internal, and you have to |
| 88 | +reimplement your own with platform-dependent APIs and syscalls) |
| 89 | +- it is tricky to get right - there are reasons why golang core does not make it public |
| 90 | +- locking "scope" should be done carefully: having ONE global lock for everything will definitely hurt performance badly, |
| 91 | +as you will basically make everything "sequential", effectively destroying some of the benefits of parallelizing code |
| 92 | +in the first place... |
| 93 | + |
| 94 | +## Lock design... |
| 95 | + |
| 96 | +While it is tempting to just provide self-locking, individual methods as an API (`Get`, `Set`), this is not the right |
| 97 | +answer. |
| 98 | + |
| 99 | +Imagine a context where consuming code would first like to check if something exists, then later on create it if it does not: |
| 100 | +```golang |
| 101 | +if !store.Exists("something") { |
| 102 | + // do things |
| 103 | + // ... |
| 104 | + // Now, create |
| 105 | + store.Set([]byte("data"), "something") |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +You do have two methods (`Get` and `Set`) that _may individually_ guarantee they are the sole user of that resource, |
| 110 | +but a concurrent change _in between_ these two calls may very well (and _will_) happen and change the state of the world. |
| 111 | + |
| 112 | +Effectively, in that case, `Set` might overwrite changes made by another go routine or concurrent execution, possibly |
| 113 | +wrecking havoc in another process. |
| 114 | + |
| 115 | +_When_ to lock, and _for how long_, is a decision that only the embedding code can make. |
| 116 | + |
| 117 | +A good example is container creation. |
| 118 | +It may require the creation of several different volumes. |
| 119 | +In that case, you want to lock at the start of the container creation process, and only release the lock when you are fully |
| 120 | +done creating the container - not just when done creating a volume (nor even when done creating all volumes). |
| 121 | + |
| 122 | +## ... while safeguarding the developer |
| 123 | + |
| 124 | +nerdctl still provides some safeguards for the developer. |
| 125 | + |
| 126 | +Any store method that DOES require locking will fail loudly if it does not detect a lock. |
| 127 | + |
| 128 | +This is obviously not bullet-proof. |
| 129 | +For example, the lock may belong to another goroutine instead of the one we are in (and we cannot detect that). |
| 130 | +But this is still better than nothing, and will help developers making sure they **do** lock. |
| 131 | + |
| 132 | +## Using the `store` api to implement your own storage |
| 133 | + |
| 134 | +While - as mentioned above - the store does _not_ lock on its own, specific "stores implementations" may, and should, |
| 135 | +provide higher-level methods that best fit their data-model usage, and that **do** lock on their own. |
| 136 | + |
| 137 | +For example, the namestore (which is the simplest store), does provide three simple methods: |
| 138 | +- Acquire |
| 139 | +- Release |
| 140 | +- Rename |
| 141 | + |
| 142 | +Users of the `namestore` do not have to bother with locking. These methods are safe to use concurrently. |
| 143 | + |
| 144 | +This is a good example of how to leverage core store primitives to implement a developer friendly, safe storage for |
| 145 | +"something" (in that case "names"). |
| 146 | + |
| 147 | +Finaly note an important point - mentioned above: locking should be done to the smallest possible "segment" of sub-directories. |
| 148 | +Specifically, any store should lock only - at most - resources under the _namespace_ being manipulated. |
| 149 | + |
| 150 | +For example, a container lifecycle storage should not lock out any other container, but only its own private directory. |
| 151 | + |
| 152 | +## Scope, ambitions and future |
| 153 | + |
| 154 | +`pkg/store` has no ambition whatsoever to be a generic solution, usable outside nerdctl. |
| 155 | + |
| 156 | +It is solely designed to fit nerdctl needs, and if it was to be made usable standalone, would probably have to be modified |
| 157 | +extensively, which is clearly out of scope here. |
| 158 | + |
| 159 | +Furthermore, there are already much more advanced generic solutions out there that you should use instead for outside-of-nerdctl projects. |
| 160 | + |
| 161 | +As for future, one nice thing we should consider is to implement read-only locks in addition to the exclusive, write-locks |
| 162 | +we currently use. |
| 163 | +The net benefit would be a performance boost in certain contexts (massively parallel, mostly read environments). |
0 commit comments