Skip to content

Modernize Vortex Build Process #288

@Udit8348

Description

@Udit8348

Overview

Inspired by this PR (#282) which adds MPI support to simx, but fails to update CI pipelines with corresponding packages, I took some time to address the root cause of this oversight.

We suggest local development through the use of a container tool such as Apptainer or Docker. Local developers will reconfigure this to suit their needs, but it is completely independent of our GitHub Actions CI pipeline. This disconnect means it is easy to forget to update both. Additionally, due to GitHub actions running the environment on bare metal and our local development environment being containerized there is no direct 1:1 mapping between changes made in local dev container and the GitHub Actions CI's yml file. Furthermore, there is no streamlined way to test minor changes in the CI pipeline without building and running the whole pipeline which takes anywhere between 2 to 4 hours on the GitHub runners. This can be costly money and timewise**.

Solution

One proposed solution is to provide a single source of ground truth. This means spinning up the Apptainer directly in the GitHub Actions pipeline .ci/workflows/*.yml

Drawbacks

One drawback of this approach is that we are no longer running the pipeline on baremetal. Also, this approach adds more complexity to the CI pipeline. Finally, without GitHub Actions caching, the approach would take exponentially longer.

Strengths

The main benefit of this approach is consistency since we now share the same environment locally and in GitHub Actions. With a single source of truth, setup steps like ./ci/install_dependencies.sh can be baked directly into the container (it just installs apt packages). This simplifies setup and makes the build process easier to debug and reproduce since everyone’s using the same environment.

Next, I setup caching effectively so that it matches containerless performance.

Finally, it has been tested on [Ubuntu 22.04, Ubuntu 24.04] for [32, 64] XLEN and passes all tests besides VM and OpenCL. Discussion below

Full Run
vortex.yml

Results

VM and OpenCL tests fail with the following error stack smashing detectedhere

This run is based on commit 4b252faa5318f84add0885a7d9ba74e6a7a49b24. I suspect this is a source code issue and unrelated to the build pipeline.

I re-ran all tests (except VM and OpenCL) on all configurations and got 43/44 tests completed. **I had to time out one test after 4 hours because I am running low on free GitHub Actions' runner minutes for this month (1846 / 2000 minutes used over 48 hours of development).

Table below to provides some context behind the test that timed out:

Workflow OS XLEN Suite Duration / Status Logs
test ubuntu-22.04 32 stress Manually timed out after 4 hours Logs
test ubuntu-22.04 64 stress ~16 minutes Logs
test ubuntu-24.04 32 stress ~15 minutes Logs
test ubuntu-24.04 64 stress ~17 minutes Logs

Based on these results, it was a one off error and not related to the new CI pipeline configuration.

Minimum Viable Product Exploration

The minimum viable product needs to 1) Install Apptainer binary 2) Build Apptainer def file (.def into image .sif file 3) Execute demo program inside of container. 4) safely use / invalidate cached .sif between CI runs to save time

I found two MVPs that could accomplish this as summarized below.

Step Uncached Cached Notes
Install Apptainer Binaries from APT Sources yml 3 min 10 s 1 min 40 s Install from APT sources
Install a GitHub Actions Plugin yml 1 min 59 s 34 s Plugin Link

I chose to use the plugin. If this plugin fails in the future, we can default to manual install.

Full Solution Description

vortex.yml. Still a draft, but fully functional as described above.

There are 4 jobs

  1. Container job: builds or restores a cached vortex.sif from vortex.def and uploads it as an artifact (artifacts are optimized to be used between jobs). I use a hash to detect when vortex.dev has updates and it invalidates the cached vortex.sif

  2. Build job: Restores or rebuilds the toolchain and third-party deps from cache. Uses hash to detect when changes are made. Mounts dependencies into the Apptainer and builds vortex binaries across a matrix of OS versions (ubuntu-22.04, ubuntu-24.04) and bit-widths (32/64). Saves these build artifacts as artifacts for the upcoming test job.

  3. Test job: Fetch corresponding build artifacts, mount files into container and run tests in Apptainer.

  4. Complete job: check if all tests succeeded and cleanup

If you look at vortex.yml linked above you can see there are a lot of instructions that deal with updating caches and artifacts between jobs. Why not just use one job? The choice to use multiple jobs versus a monolithic job is forced. We get a small overhead from caching and passing artifacts between jobs, however each job increases "fan out" between each other. GitHub actions does not allow you to dynamically change "fan out" in the middle of a job. This means:

There is 1 container job -> 2 x 2 build jobs [OS, XLEN] -> 2 x 2 x n [OS, XLEN, tests] test jobs

It does not make sense to have 2 x 2 x n container jobs when the same container is reused across [OS, XLEN, Tests]. GitHub Actions allocates a VM for each of these jobs and each of their minutes accumulate to your total. For example, five one minute VM's that execute perfectly in parallel accumulate 5 minutes of usage on your account.

Here is how GitHub actions visualizes this.

Image

Next Steps

There are some other options to explore in terms of modernizing the vortex build / ci process.

  1. nix / nixos
  2. bazel or other advanced build vs manual scripts
  3. Some static analysis to compare if the CI env and Apptainer env match

These take more work for possible minimal reward. Will post here if any progress is made

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions