build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds #216

mart-r · 2025-11-06T11:40:46Z

Previously, a lot of the times the workflow fails the first couple of times due to running out of disk space. For instance, this one only succeeded on the 4th attempt:
https://github.com/CogStack/cogstack-nlp/actions/runs/19104323916

The specific failure reads:

build
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.Tracing.Dispose(Boolean disposing)
   at GitHub.Runner.Common.Tracing.Dispose()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

This generally happened in the "Build and push Docker Jupyter singleuser image with GPU support" step.

This PR will attempt to rectify that by:

~~Cleaning up docker before GPU build step~~
- ~~This should remove a bunch of the files docker wrote on disk to avoid the issue with running out of disk space in the next step~~
- ~~The entire (successful) workflow job previously took less than 4 minutes~~
  - ~~So the small slowdown due to pruning the cache is unlikely to be significant~~
Cleans up the worker state before it starts
- To remove a bunch of unnecessary packages / tools
- Seems to give around 16GB of extra space on disk

tomolopolis · 2025-11-06T11:40:50Z

Task linked: CU-869b2zjay Avoid out of disk space issues in docker image build workflow

alhendrickson

Looks fine so I've approved it - though generally seems off that this has now started to fail but it didnt previously. I mean ideally we wouldnt have to mess around with the runners.

One guess for why it failed is is we copy the whole medcat-v2 folder, which might have grown.

Second one is in general the gpu docker image is massive and probably growing in size as libs get updated...

As a wildcard option we could split this into two jobs - one for the cpu and one for gpu image, and could probably parameterise. This would then go on two diff runners and I'd hope fix any issue with docker layers being kept around.

.github/workflows/medcat-service_docker.yml

alhendrickson · 2025-11-06T12:23:28Z

.github/workflows/medcat-service_docker.yml


+      - name: Clean Docker to free up space
+        # NOTE: otherwise the runner tends to run out of disk space roughly 75% of the time
+        run: docker system prune -af


Would want to see a before/after here as well if possible - or remove it altogether if the other step fixes the issue

mart-r · 2025-11-06T12:35:56Z

Just as a FYI - the current setup does seem to run without an issue. Reran it 3 times. Yet it didn't after I just added the docker system prune -af command.

But will give a check to the disk space before and after purge to see how much that actually does. You're right - we don't know if it's the later addon or the combination that we need.

alhendrickson

Approved - clearing 15G is pretty nice, way more than I thought it would be

One task for a later date would be to see how to speed this up, adding 2 minutes on here is quite a lot

mart-r · 2025-11-06T13:02:29Z

Looks fine so I've approved it - though generally seems off that this has now started to fail but it didnt previously. I mean ideally we wouldnt have to mess around with the runners.

I do recall having to restart this earlier as well. Though I'm pretty sure it was a lot less frequent. And I think it was something to do with failing network calls instead.

One guess for why it failed is is we copy the whole medcat-v2 folder, which might have grown.

It's unlikely to be an issue of the folder size. It's 77MB. Though we could avoid copying over the 75MB (i.e vast majority) of the tests folder. But I don't think the sizes of the folder is the issue here.

Second one is in general the gpu docker image is massive and probably growing in size as libs get updated...

This sounds more likely. The depenedencies are locked (with uv.lock) on the CI time only for certain parts of the workflow. Pretty sure the docker images are still built without anything being locked. So new dependencies with new features (or just a bigger footprint) sounds like the likely culprit.

As a wildcard option we could split this into two jobs - one for the cpu and one for gpu image, and could probably parameterise. This would then go on two diff runners and I'd hope fix any issue with docker layers being kept around.

I did think about that as an option as well. Don't think I had a specific reason to go the route I chose (i.e cleaning up things that take up space). Though in hindsight this means there's fewer runners running (potentially in parallel).

mart-r · 2025-11-06T13:05:16Z

One task for a later date would be to see how to speed this up, adding 2 minutes on here is quite a lot

I don't think this added much time for the run. Still seems to be less than 4 minutes for the build job, just as before:
Before:
https://github.com/CogStack/cogstack-nlp/actions/runs/19104323916
Now:
https://github.com/CogStack/cogstack-nlp/actions/runs/19136113881

The overall has gone up from 5m5s to 5m20s. So 15 seconds. Could be related to the removal of files (15GB removal at 15 seconds is quite good overall I would say).

CU-869b2zjay: Clean docker to free up space before GPU build

451c987

mart-r added 4 commits November 6, 2025 11:49

CU-869b2zjay: Clean worker for space before running job

123c336

CU-869b2zjay: Mpve cleanup to correct job

d588603

CU-869b2zjay: Fix issue with running disk space freeing step (hopefully)

dbcdcdd

CU-869b2zjay: Move disk space freeing after checkout

1f853b9

alhendrickson approved these changes Nov 6, 2025

View reviewed changes

mart-r added 2 commits November 6, 2025 12:37

CU-869b2zjay: Add disk space checks before and after the 2 new steps

e5bc3ac

CU-869b2zjay: Remove docker prune step - doesn't seem to free up space

2837c11

alhendrickson approved these changes Nov 6, 2025

View reviewed changes

mart-r changed the title ~~build(medcat-service): CU-869b2zjay Clean docker to free up space before GPU build~~ build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds Nov 6, 2025

mart-r merged commit 4499501 into main Nov 6, 2025
11 checks passed

mart-r deleted the build/medcat-service/CU-869b2zjay-avoid-disk-issues-during-workflow branch November 6, 2025 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds #216

build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds #216

Uh oh!

mart-r commented Nov 6, 2025 •

edited

Loading

Uh oh!

tomolopolis commented Nov 6, 2025

Uh oh!

alhendrickson left a comment

Uh oh!

Uh oh!

alhendrickson Nov 6, 2025

Uh oh!

mart-r commented Nov 6, 2025 •

edited

Loading

Uh oh!

alhendrickson left a comment •

edited

Loading

Uh oh!

mart-r commented Nov 6, 2025

Uh oh!

mart-r commented Nov 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds #216

build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds #216

Uh oh!

Conversation

mart-r commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Nov 6, 2025

Uh oh!

alhendrickson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alhendrickson Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mart-r commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alhendrickson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mart-r commented Nov 6, 2025

Uh oh!

mart-r commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mart-r commented Nov 6, 2025 •

edited

Loading

mart-r commented Nov 6, 2025 •

edited

Loading

alhendrickson left a comment •

edited

Loading

mart-r commented Nov 6, 2025 •

edited

Loading