Skip to content

Commit 2f9a4f6

Browse files
authored
Merge pull request #3940 from apostasie/analyze-dockerfile
Adding document analyzing CI/dockerfile
2 parents bd79675 + e52580e commit 2f9a4f6

File tree

1 file changed

+321
-0
lines changed

1 file changed

+321
-0
lines changed

docs/dev/auditing_dockerfile.md

Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
# Auditing dockerfile
2+
3+
Because of the nature of GitHub cache, and the time it takes to build the dockerfile for testing, it is desirable
4+
to be able to audit what is going on there.
5+
6+
This document provides a few pointers on how to do that, and some results as of 2025-02-26 (run inside lima, nerdctl main,
7+
on a macbook pro M1).
8+
9+
## Intercept network traffic
10+
11+
### On macOS
12+
13+
Use Charles:
14+
- start SSL proxying
15+
- enable SOCKS proxy
16+
- export the root certificate
17+
18+
### On linux
19+
20+
Left as an exercise to the reader.
21+
22+
### If using lima
23+
24+
- restart your lima instance with `HTTP_PROXY=http://X.Y.Z.W:8888 HTTPS_PROXY=socks5://X.Y.Z.W:8888 limactl start instance` - where XYZW
25+
is the local ip of the Charles proxy (non-localhost)
26+
27+
### On the host where you are running containerd
28+
29+
- copy the root certificate from above into `/usr/local/share/ca-certificates/charles-ssl-proxying-certificate.crt`
30+
- update your host: `sudo update-ca-certificates`
31+
- now copy the root certificate again to your current nerdctl clone
32+
33+
### Hack the dockerfile to insert our certificate
34+
35+
Add the following stages in the dockerfile:
36+
```dockerfile
37+
FROM --platform=$BUILDPLATFORM golang:${GO_VERSION}-bookworm AS hack-build-base-debian
38+
RUN apt-get update -qq; apt-get -qq install ca-certificates
39+
COPY charles-ssl-proxying-certificate.crt /usr/local/share/ca-certificates/
40+
RUN update-ca-certificates
41+
42+
FROM --platform=$BUILDPLATFORM golang:${GO_VERSION}-alpine AS hack-build-base
43+
RUN apk add --no-cache ca-certificates
44+
COPY charles-ssl-proxying-certificate.crt /usr/local/share/ca-certificates/
45+
RUN update-ca-certificates
46+
47+
FROM ubuntu:${UBUNTU_VERSION} AS hack-base
48+
RUN apt-get update -qq; apt-get -qq install ca-certificates
49+
COPY charles-ssl-proxying-certificate.crt /usr/local/share/ca-certificates/
50+
RUN update-ca-certificates
51+
```
52+
53+
Then replace any later "FROM" with our modified bases:
54+
```
55+
golang:${GO_VERSION}-bookworm => hack-build-base-debian
56+
golang:${GO_VERSION}-alpine => hack-build-base
57+
ubuntu:${UBUNTU_VERSION} => hack-base
58+
```
59+
60+
## Mimicking what the CI is doing
61+
62+
A quick helper:
63+
64+
```bash
65+
run(){
66+
local no_cache="${1:-}"
67+
local platform="${2:-arm64}"
68+
local dockerfile="${3:-Dockerfile}"
69+
local target="${4:-test-integration}"
70+
71+
local cache_shard="$CONTAINERD_VERSION"-"$platform"
72+
local shard="$cache_shard"-"$target"-"$UBUNTU_VERSION"-"$no_cache"-"$dockerfile"
73+
74+
local cache_location=$HOME/bk-cache-"$cache_shard"
75+
local destination=$HOME/bk-output-"$shard"
76+
local logs="$HOME"/bk-debug-"$shard"
77+
78+
if [ "$no_cache" != "" ]; then
79+
nerdctl system prune -af
80+
nerdctl builder prune -af
81+
rm -Rf "$cache_location"
82+
fi
83+
84+
nerdctl build \
85+
--build-arg UBUNTU_VERSION="$UBUNTU_VERSION" \
86+
--build-arg CONTAINERD_VERSION="$CONTAINERD_VERSION" \
87+
--platform="$platform" \
88+
--output type=tar,dest="$destination" \
89+
--progress plain \
90+
--build-arg HTTP_PROXY=$HTTP_PROXY \
91+
--build-arg HTTPS_PROXY=$HTTPS_PROXY \
92+
--cache-to type=local,dest="$cache_location",mode=max \
93+
--cache-from type=local,src="$cache_location" \
94+
--target "$target" \
95+
-f "$dockerfile" . 2>&1 | tee "$logs"
96+
}
97+
```
98+
99+
And here is what the CI is doing:
100+
101+
```bash
102+
ci_run(){
103+
local no_cache="${1:-}"
104+
export UBUNTU_VERSION=24.04
105+
106+
CONTAINERD_VERSION=v1.6.36 run "$no_cache" arm64 Dockerfile.origin build-dependencies
107+
UBUNTU_VERSION=20.04 CONTAINERD_VERSION=v1.6.36 run "" arm64 Dockerfile.origin test-integration
108+
109+
CONTAINERD_VERSION=v1.7.25 run "$no_cache" arm64 Dockerfile.origin build-dependencies
110+
UBUNTU_VERSION=22.04 CONTAINERD_VERSION=v1.7.25 run "" arm64 Dockerfile.origin test-integration
111+
112+
CONTAINERD_VERSION=v2.0.3 run "$no_cache" arm64 Dockerfile.origin build-dependencies
113+
UBUNTU_VERSION=24.04 CONTAINERD_VERSION=v2.0.3 run "" arm64 Dockerfile.origin test-integration
114+
115+
CONTAINERD_VERSION=v2.0.3 run "$no_cache" amd64 Dockerfile.origin build-dependencies
116+
UBUNTU_VERSION=24.04 CONTAINERD_VERSION=v2.0.3 run "" amd64 Dockerfile.origin test-integration
117+
}
118+
119+
# To simulate what happens when there is no cache, go with:
120+
ci_run no_cache
121+
122+
# Once you have a cached run, you can simulate what happens with cache
123+
# First modify something in the nerdctl tree
124+
# Then run it
125+
touch mimick_nerdctl_change
126+
ci_run
127+
```
128+
129+
## Analyzing results
130+
131+
### Network
132+
133+
#### Full CI run, cold cache (the first three pipelines, and part of the fourth)
134+
135+
The following numbers are based on the above script, with cold cache.
136+
137+
Unfortunately golang did segfault on me during the last (cross-run targetting amd), so, these numbers should be taken
138+
as (slightly) underestimated.
139+
140+
Total number of requests: 7190
141+
142+
Total network duration: 13 minutes 11 seconds
143+
144+
Outbound: 1.31MB
145+
146+
Inbound: 5202MB
147+
148+
Breakdown per domain
149+
150+
| Destination | # requests | through | duration |
151+
|----------------------------------------------|-------------------|------------|-----------------|
152+
| https://registry-1.docker.io | 123 (2 failed) | 1.22MB | 26s |
153+
| https://production.cloudflare.docker.com | 60 | 1242.41MB | 2m6s |
154+
| http://deb.debian.org | 207 | 107.14MB | 13s |
155+
| https://github.com | 105 | 977.88MB | 1m25s |
156+
| https://proxy.golang.org | 5343 (57 failed) | 753.69MB | 4m8s |
157+
| https://objects.githubusercontent.com | 42 | 900.22MB | 50s |
158+
| https://raw.githubusercontent.com | 8 | 92KB | 2s |
159+
| https://storage.googleapis.com | 19 (3 failed) | 537.21MB | 35s |
160+
| https://ghcr.io | 65 | 588.68KB | 13s |
161+
| https://auth.docker.io | 10 | 259KB | 5s |
162+
| https://pkg-containers.githubusercontent.com | 48 | 183.63MB | 20s |
163+
| http://ports.ubuntu.com | 300 | 165.36MB | 1m55s |
164+
| https://golang.org | 4 | 228.93KB | <1s |
165+
| https://go.dev | 4 | 95.51KB | <1s |
166+
| https://dl.google.com | 4 | 271.42MB | 11s |
167+
| https://sum.golang.org | 746 | 3.89MB | 17s |
168+
| http://security.ubuntu.com | 7 | 2.70MB | 3s |
169+
| http://archive.ubuntu.com | 95 | 55.95MB | 19s |
170+
| | - | - | - |
171+
| Total | 7190 | 5203MB | 13 mins 11 secs |
172+
173+
174+
#### Full CI run, warm cache (only the first three pipelines)
175+
176+
| Destination | # requests | through | duration |
177+
|------------------------------------------|------------------|---------|----------------|
178+
| https://registry-1.docker.io | 25 | 537KB | 14s |
179+
| https://production.cloudflare.docker.com | 2 | 25MB | 1s |
180+
| https://github.com | 7 (1 failed) | 105KB | 2s |
181+
| https://proxy.golang.org | 930 (11 failed) | 150MB | 37s |
182+
| https://objects.githubusercontent.com | 4 | 86MB | 4s |
183+
| https://storage.googleapis.com | 3 | 112MB | 6s |
184+
| https://auth.docker.io | 1 | 26KB | <1s |
185+
| http://ports.ubuntu.com | 133 | 67MB | 50s |
186+
| https://golang.org | 2 | 114KB | <1s |
187+
| https://go.dev | 2 | 45KB | <1s |
188+
| https://dl.google.com | 2 | 134MB | 5s |
189+
| https://sum.golang.org | 484 | 3MB | 11s |
190+
| | - | - | - |
191+
| Total | 1595 (12 failed) | 579MB | 2 mins 10 secs |
192+
193+
194+
#### Analysis
195+
196+
##### Docker Hub
197+
198+
Images from Docker Hub are clearly a source of concern (made even worse by the fact they apply strict limitations on the
199+
number of requests permitted).
200+
201+
When the cache is cold, this is about 1GB per run, for 200 requests and 3 minutes.
202+
203+
Actions:
204+
- [ ] reduce the number of images
205+
- we currently use 2 golang images, which does not make sense
206+
- [ ] reduce the round trips
207+
- there is no reason why any of the images should be queried more than once per build
208+
- [ ] move away from Hub golang image, and instead use a raw distro + golang download
209+
- Hub golang is a source of pain and issues (diverging version scheme forces ugly shell contorsions, delay in availability creates
210+
broken situations)
211+
- we are already downloading the go release tarball anyhow, so, this is just wasted bandwidth with no added value
212+
213+
Success criteria:
214+
- on a cold cache, reduce the total number of requests against Docker properties by 50% or more
215+
- on a cold cache, cut the data transfer and time in half
216+
217+
##### Distro packages
218+
219+
On a WARM cache, close to 1 minute is spent fetching Ubuntu packages.
220+
This should not happen, and distro downloads should always be cached.
221+
222+
On a cold cache, distro packages download near 3 minutes.
223+
Very likely there is stage duplication that could be reduced and some of that could be cut of.
224+
225+
Actions:
226+
- [ ] ensure distro package downloading is staged in a way we can cache it
227+
- [ ] review stages to reduce package installation duplication
228+
229+
Success criteria:
230+
- [ ] 0 package installation on a warm cache
231+
- [ ] cut cold cache package install time by 50% (XXX not realistic?)
232+
233+
234+
##### GitHub repositories
235+
236+
Clones from GitHub do clock in at 1GB on a cold cache.
237+
Containerd alone counts for more than half of it (at 160MB+ x4).
238+
239+
Hopefully, on a warm cache it is practically non-existent.
240+
241+
But then, this is ridiculous.
242+
243+
Actions:
244+
- [ ] shallow clone
245+
246+
Success criteria:
247+
- [ ] reduce network traffic from cloning by 80%
248+
249+
##### Go modules
250+
251+
At 750+MB and over 4 minutes, this is the number one speed bottleneck on a cold cache.
252+
253+
On a warm cache, it is still over 150MB and 30+ seconds.
254+
255+
In and of itself, this is hard to reduce, as we need these...
256+
257+
Actions:
258+
- [ ] we could cache the module download location to reduce round-trips on modules that are shared accross
259+
different projects
260+
- [ ] we are likely installing nerdctl modules six times - (once per architecture during the build phase, then once per
261+
ubuntu version and architecture during the tests runs (this is not even accounted for in the audit above)) - it should
262+
only happen twice (once per architecture)
263+
264+
Success criteria:
265+
- [ ] achieve 20% reduction of total time spent downloading go modules
266+
267+
##### Other downloads
268+
269+
1. At 500MB+ and 30 seconds, storage.googleapis.com is serving a SINGLE go module that gets special treatment: klauspost/compress.
270+
This module is very small, but does serve along with it a very large `testdata` folder.
271+
The fact that nerdctl downloads its module multiple times is further compounding the effect.
272+
273+
2. the golang archive is downloaded multiple times - it should be downloaded only once per run, and only on a cold cache
274+
275+
3. some of the binary releases we are retrieving are also being retrieved with a warm cache, and they are generally quite large.
276+
We could consider building certain things from source instead, and in all cases ensure that we are only downloading with a cold cache.
277+
278+
Success criteria:
279+
- [ ] 0 static downloads on a warm cache
280+
- [ ] cut extra downloads by 20%
281+
282+
#### Duration
283+
284+
Unscientific numbers, per pipeline
285+
286+
dependencies, no cache:
287+
- 224 seconds total
288+
- 53 seconds exporting cache
289+
290+
dependencies, with cache:
291+
- 12 seconds
292+
293+
test-integration, no cache:
294+
- 282 seconds
295+
296+
#### Caching
297+
298+
Number of layers in cache:
299+
```
300+
after dependencies stage: 78
301+
intermediate size: 1.5G
302+
after test-integration stage: 118
303+
total size: 2.8G
304+
```
305+
306+
## Generic considerations
307+
308+
### Caching compression
309+
310+
This is obviously heavily dependent on the runner properties.
311+
312+
With local cache, on high-performance IO (laptop SSD), zstd is definitely considerably better (about twice as fast).
313+
314+
With GHA, the impact is minimal, since network IO is heavily dominant, but zstd still has the upper
315+
hand with regard to cache size.
316+
317+
### Output
318+
319+
Loading image into the Docker store comes at a somewhat significant cost.
320+
It is quite possible that a significant performance boost could be achieved by using
321+
buildkit containerd worker and nerdctl instead.

0 commit comments

Comments
 (0)