Skip to content
This repository has been archived by the owner on Oct 2, 2023. It is now read-only.

Latest commit

 

History

History
90 lines (63 loc) · 4.35 KB

intermediate-form.md

File metadata and controls

90 lines (63 loc) · 4.35 KB

Bazel Docker Intermediate Format

Background

Over the years, the intermediate format of the Bazel Docker rules has evolved considerably. In the beginning they were simply docker save-style tarballs. This evolved into a sharded form of those tarballs, however, ultimately this format is unsuitable for the kinds of access tools need to the data contained within them. Significant time was spent seeking through the tarballs, in some cases gzipping and hashing things.

In particular, the docker save format is almost exactly wrong for docker_push since each layer needs to be extracted, gzipped and hashed before even a basic existence check can be performed against the registry. This needs to happen for all layers, so a no-op upload can easily take 10s of seconds.

Scenarios and Requirements

Our intermediate format's requirements were informed by the following scenarios:

  • docker_build needs to read/modify/write the base image's metadata (requirement: config file must be readily accessible)

  • docker_build needs to be able to incrementally load an image's layers or link them into a full tarball (requirement: uncompressed layers and their "diff ids" must be readily accessible, config file must be readily accessible)

  • docker_build needs to be able to derive from a legacy docker save tarball (requirement: legacy base must be readily accessible)

  • docker_push needs to be able to publish an image without reading layer blobs (requirement: zipped layers and their precomputed sha256 must be readily accessible, config file must be readily accessible).

This gives us a pretty good sense for what we need in our intermediate form:

  • config file
  • zipped layers (ordered as in the config file)
  • blob sums (sha256 of zipped layers, ordered as above)
  • unzipped layers (ordered as above)
  • diff ids (sha256 of unzipped layers, ordered as above)
  • legacy base tarball

It is notable that there is a fair amount of redundancy here, and some of these are simple functions of others (e.g. zip, hash), but the Bazel action graph gives us a nice way to cache some of these computations when the input isn't changing, and Bazel will actually prune the action graph so that only the actions needed by downstream dependencies are computed (read: it is ok for the intermediate format to be a superset of what we need).

Roots

docker_import

As mentioned above, there is a fair amount of redundancy in our intermediate format; however, we needn't surface that level of redundancy to users in how they import base images. docker_import only requires users to specify the first two items above, from which it computes all of the rest (except legacy base, which doesn't apply).

docker_pull

docker_pull is essentially the combination of the google/containerregistry's fast puller.par, which writes the following under a specified directory:

    directory/
      config.json  <-- the image's config
      001.tar.gz   <-- the first layer's .tar.gz filesystem delta
      001.sha256   <-- the sha256 of 1.tar.gz
      ...
      N.tar.gz     <-- the Nth layer's .tar.gz filesystem delta
      N.sha256     <-- the sha256 of N.tar.gz

... and docker_import, which is passed a subset of these files and operates as outlined above.

docker save tarball

A docker_build rule may build upon a legacy docker save tarball, so when the base attribute is passed such a tarball the legacy base attribute is propagated.

Augmentation

docker_build largely forwards its base rules' layer-related attributes augmenting them with the filesystem delta generated by the current rule. The config file is always read and written back out to include at least the new layer, but also new metadata properties introduced by the current rule.

Consumption

docker_bundle

docker_bundle is considered "consumption" because it isn't producing a single image and isn't usable in single-image contexts. It supports operations like incremental loading and assembling a docker save tarball, both of which operate identically to docker_build, except the properties of N images are effectively concatenated.

docker_push

docker_push efficiently publishes this format by reading the config file and the N blob sums, which combined dictate the v2.2 manifest. With this, we can determine the exact set of things that need to be uploaded without reading any large objects.