Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 6, 2025

Previously, a lot of the times the workflow fails the first couple of times due to running out of disk space. For instance, this one only succeeded on the 4th attempt:
https://github.com/CogStack/cogstack-nlp/actions/runs/19104323916

The specific failure reads:

build
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20251105-143106-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.Tracing.Dispose(Boolean disposing)
   at GitHub.Runner.Common.Tracing.Dispose()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

This generally happened in the "Build and push Docker Jupyter singleuser image with GPU support" step.

This PR will attempt to rectify that by:

  • Cleaning up docker before GPU build step
    • This should remove a bunch of the files docker wrote on disk to avoid the issue with running out of disk space in the next step
    • The entire (successful) workflow job previously took less than 4 minutes
      • So the small slowdown due to pruning the cache is unlikely to be significant
  • Cleans up the worker state before it starts
    • To remove a bunch of unnecessary packages / tools
    • Seems to give around 16GB of extra space on disk

@tomolopolis
Copy link
Member

Copy link
Collaborator

@alhendrickson alhendrickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine so I've approved it - though generally seems off that this has now started to fail but it didnt previously. I mean ideally we wouldnt have to mess around with the runners.

One guess for why it failed is is we copy the whole medcat-v2 folder, which might have grown.

Second one is in general the gpu docker image is massive and probably growing in size as libs get updated...

As a wildcard option we could split this into two jobs - one for the cpu and one for gpu image, and could probably parameterise. This would then go on two diff runners and I'd hope fix any issue with docker layers being kept around.

- name: Clean Docker to free up space
# NOTE: otherwise the runner tends to run out of disk space roughly 75% of the time
run: docker system prune -af
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would want to see a before/after here as well if possible - or remove it altogether if the other step fixes the issue

@mart-r
Copy link
Collaborator Author

mart-r commented Nov 6, 2025

Just as a FYI - the current setup does seem to run without an issue. Reran it 3 times. Yet it didn't after I just added the docker system prune -af command.

But will give a check to the disk space before and after purge to see how much that actually does. You're right - we don't know if it's the later addon or the combination that we need.

Copy link
Collaborator

@alhendrickson alhendrickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - clearing 15G is pretty nice, way more than I thought it would be

One task for a later date would be to see how to speed this up, adding 2 minutes on here is quite a lot

@mart-r
Copy link
Collaborator Author

mart-r commented Nov 6, 2025

Looks fine so I've approved it - though generally seems off that this has now started to fail but it didnt previously. I mean ideally we wouldnt have to mess around with the runners.

I do recall having to restart this earlier as well. Though I'm pretty sure it was a lot less frequent. And I think it was something to do with failing network calls instead.

One guess for why it failed is is we copy the whole medcat-v2 folder, which might have grown.

It's unlikely to be an issue of the folder size. It's 77MB. Though we could avoid copying over the 75MB (i.e vast majority) of the tests folder. But I don't think the sizes of the folder is the issue here.

Second one is in general the gpu docker image is massive and probably growing in size as libs get updated...

This sounds more likely. The depenedencies are locked (with uv.lock) on the CI time only for certain parts of the workflow. Pretty sure the docker images are still built without anything being locked. So new dependencies with new features (or just a bigger footprint) sounds like the likely culprit.

As a wildcard option we could split this into two jobs - one for the cpu and one for gpu image, and could probably parameterise. This would then go on two diff runners and I'd hope fix any issue with docker layers being kept around.

I did think about that as an option as well. Don't think I had a specific reason to go the route I chose (i.e cleaning up things that take up space). Though in hindsight this means there's fewer runners running (potentially in parallel).

@mart-r
Copy link
Collaborator Author

mart-r commented Nov 6, 2025

One task for a later date would be to see how to speed this up, adding 2 minutes on here is quite a lot

I don't think this added much time for the run. Still seems to be less than 4 minutes for the build job, just as before:
Before:
https://github.com/CogStack/cogstack-nlp/actions/runs/19104323916
Now:
https://github.com/CogStack/cogstack-nlp/actions/runs/19136113881

The overall has gone up from 5m5s to 5m20s. So 15 seconds. Could be related to the removal of files (15GB removal at 15 seconds is quite good overall I would say).

@mart-r mart-r changed the title build(medcat-service): CU-869b2zjay Clean docker to free up space before GPU build build(medcat-service): CU-869b2zjay Clean runner to free up space for docker builds Nov 6, 2025
@mart-r mart-r merged commit 4499501 into main Nov 6, 2025
11 checks passed
@mart-r mart-r deleted the build/medcat-service/CU-869b2zjay-avoid-disk-issues-during-workflow branch November 6, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants