Skip to content

Commit bf76bab

Browse files
d-nettoKristofferC
authored andcommitted
[Linux] Prevent GC from running during process teardown (#57832)
## Context We send a signal 15 to shutdown our servers. We noticed that some of our servers that receive the termination signal are segfaulting in GC, which leads to false alarms in our internal monitors that track GC-related crashes. ## Hypothesis We suspect this pathological case may be happening: - Process receives signal 15, which is captured by the signal listener thread. - Signal listener initiates process' teardown (e.g. through `raise`). - IIRC such operation is not atomic in Linux, i.e. the kernel will gradually kill the threads, but it's possible for us to spent a few ms in a state where part of the threads in the system are alive, and part have already been killed (this point needs some confirmation). - With part of the process alive, and part of the process dead, we try to enter a GC, see a bunch of Julia data structures in an intermediate/corrupted state, which leads us to crash when running the GC. ## Mitigation Since our main goal is to get rid of the GC crashes that happen around server shutdown, we believe that it would be sufficient to just prevent the last bullet point. I.e. we prevent the system from even running a GC when we're about to kill the process, and we wait for any ongoing GC to finish. Co-debugged with @kpamnany. (cherry picked from commit e1e3a46)
1 parent 15147da commit bf76bab

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

src/signals-unix.c

+5
Original file line numberDiff line numberDiff line change
@@ -559,6 +559,7 @@ static int thread0_exit_signo = 0;
559559
static void JL_NORETURN jl_exit_thread0_cb(void)
560560
{
561561
CFI_NORETURN
562+
jl_atomic_fetch_add(&jl_gc_disable_counter, -1);
562563
jl_critical_error(thread0_exit_signo, 0, NULL, jl_current_task);
563564
jl_atexit_hook(128);
564565
jl_raise(thread0_exit_signo);
@@ -1089,6 +1090,10 @@ static void *signal_listener(void *arg)
10891090
//#if defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 199309L && !HAVE_KEVENT
10901091
// si_code = info.si_code;
10911092
//#endif
1093+
// Let's forbid threads from running GC while we're trying to exit,
1094+
// also let's make sure we're not in the middle of GC.
1095+
jl_atomic_fetch_add(&jl_gc_disable_counter, 1);
1096+
jl_safepoint_wait_gc(NULL);
10921097
jl_exit_thread0(sig, signal_bt_data, signal_bt_size);
10931098
}
10941099
else if (critical) {

0 commit comments

Comments
 (0)