Skip to content

What does "Pod Sandbox" mean to Aurae? #433

@krisnova

Description

@krisnova

How did we get here?

So I was streaming recently and starting to look through the implementation detail of how we are implementing the container runtime interface, CRI.

Naturally this opened up a can of worms. One implementation detail lead to another, and it quickly spiraled out of control. This left me spending the weekend thinking to myself about what the project should do. I want this GitHub issue to serve as a decision/architecture (ADR) for the detail of what we intend to do.

But first, some context, history, and vocabulary.

What is a "Pod Sandbox" and where did it come from?

Here is the shape that I think of when I think of what most of the industry refers to as "a pod".

pod/
├── container-app
│   └── bin
│       └── my-app
├── container-log-aggregate
│   └── app
│       ├── logger.go
│       └── main.go
├── container-profiler
│   └── program.exe
└── container-proxy
    └── bin
        └── nginx

which is basically to say that its a bounded set of containers that exist within some isolation zone. Kubernetes, for example, likes to pretend that the containers within a pod all share the same localhost, storage, network, etc.

In the context of OpenShift sandboxed containers, a pod is implemented as a virtual machine. Several containers can run in the same pod on the same virtual machine.[1]

The history of a Pod (as I understand it) is relatively simple, and makes sense given the behavior of the clone(2) and clone3(2) system calls. Basically you cannot "create" a new namespace in Linux. You can however, execute a new process in a new namespace. So what do you do when you just want an "empty" boundary and aren't ready to start any work in your namespace yet? Or more importantly, how do you keep the namespace around if your container exits? Linux will destroy namespaces if there is no longer something executing in the namespace.

There is some historical context that mentions that the Kubernetes Pause Container was the answer to this problem.

  1. A user executes clone(2) or clone3(2) with a new Pause process.
  2. The new process gets a pid and a shiny new set of namespaces, and basically just falls asleep and does nothing.
  3. Now that the namespaces are established, we can schedule and reschedule other processes alongside each other in the new namespaces.

Thus, the paradigm of the pod sandbox was created as a way to hold a set of these containers together.

Here are some more resources:

Option 1) A Pod is a VM

This is a straightforward proposal and can be a viable and powerful path for Aurae to adopt.

Basically we follow suit with OpenShift, Gvisor, and Firecracker and establish a virtualization zone (basically a VM) for every pod by default.

Once the VM has been started we can delegate out to the nested auraed run a container using our own RPC. The containers can share the same namespaces as the host, and we can mount volumes between them, communicate over the local network, etc, etc. We can bake in more logic (such as network devices) as well in the future.

Implementation would look like:

  1. We finalize our decision on a VM software (I think I am leaning towards KVM) and schedule an auraed VM.
  2. We connect to the nested auraed over the network, and schedule a container using a new RPC such as RunContainer().
  3. We persist the VM regardless of workload and containers become mutable. The user destroys the VM when they are done.

Option 2) A pod is a container, and we spritz up our cells

In this option we would need to do 2 things.

  1. Schedule a pod as a plain-ol-container.
  2. Establish the ability to be able to "install" a tarball into the container filesystem using a new feature in the cells service.

This option is attractive because it solves the package management and supply chain concerns because everything becomes a tarball/OCI image at the end of the day.

Basically we would create a new Youki container with a nested auraed running as an init process. Then we can access the auraed RPC for cells, and send an OCI image to the cell service to un-tar the image and "install" it as we would with any package manager. This kind of violates the entire supply chain guarantee and image immutability thing that everyone seems to love about containers so I am not sure this is a good approach. However this also feels a lot more intuitive to anyone used to systemd and bare metal machines.

This approach would involve a new RPC for the cell service that allows the user to pass a remote URL for an OCI image/tarball for the cell service to download and install. The cells would be created inside the container, and they could just do what they needed.

One thing to figure out would be the need to chroot each cell filesystem, as otherwise we have no way of preventing 2 "containers" from sharing the same files/paths/directory structure. The fact that we would need to chroot each cell (that the user would be calling a container) is a red flag.

Option 3) A pod is a container, and your containers are also containers

Basically we create a new auraed container when a user creates a new pod sandbox. We establish new namespaces for the new container. When it comes time to schedule a nested container inside the new pod sandbox we call out to the nested auraed and say "RunContainer" but just use the namespaces from the first pod.

The output here would be a node with a LOT of containers floating around, all with a "virtual" structure. In other words we would have a flat list of containers from the host's perspective and the structure and isolation is only enforced by how we expose namespaces to containers.

This feels... wrong.. I can't explain why... I believe this is how a lot of container runtimes do things now and it just seems to be an anti pattern as we could actually build recursive isolation boundaries.

The Decision

I want some help talking through the decision -- We can close the issue once we have come to conviction internally.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions