Skip to content

Fix the damn deadlock #7553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 27, 2020
Merged

Fix the damn deadlock #7553

merged 6 commits into from
Dec 27, 2020

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Dec 26, 2020

This branch adds debug code for detecting any deadlock in the Cache system. The fix is not in this branch yet but hopefully this new debug code will lead us to it.

I confirmed that when removing the code protecting detection of the same C source files:

--- a/src/Compilation.zig
+++ b/src/Compilation.zig
@@ -1865,25 +1865,6 @@ fn updateCObject(comp: *Compilation, c_object: *CObject, c_comp_progress_node: *
         }
     }
 
-    {
-        const is_collision = blk: {
-            const bin_digest = man.hash.peekBin();
-
-            const lock = comp.mutex.acquire();
-            defer lock.release();
-
-            const gop = try comp.c_object_cache_digest_set.getOrPut(comp.gpa, bin_digest);
-            break :blk gop.found_existing;
-        };
-        if (is_collision) {
-            return comp.failCObj(
-                c_object,
-                "the same source file was already added to the same compilation with the same flags",
-                .{},
-            );
-        }
-    }
-
     var arena_allocator = std.heap.ArenaAllocator.init(comp.gpa);
     defer arena_allocator.deinit();
     const arena = &arena_allocator.allocator;

The detection code is indeed triggered:

[nix-shell:~/Downloads/zig/build]$ ./zig cc simple_race.c simple_race.c 
Cache deadlock detected in Cache.hit. Manifest has 1 files:
  file: /home/andy/Downloads/zig/build/simple_race.c
Cache deadlock detected
/home/andy/Downloads/zig/src/Cache.zig:277:17: 0xb54a1a in Cache.Manifest.hit (zig1)
                @panic("Cache deadlock detected");
                ^
/home/andy/Downloads/zig/src/Compilation.zig:1890:63: 0xd95574 in Compilation.updateCObject (zig1)
    const digest = if (!comp.disable_c_depfile and try man.hit()) man.final() else blk: {
                                                              ^
/home/andy/Downloads/zig/src/Compilation.zig:1805:23: 0xd94c48 in Compilation.workerUpdateCObject (zig1)
    comp.updateCObject(c_object, progress_node) catch |err| switch (err) {
                      ^
/home/andy/Downloads/zig/src/ThreadPool.zig:117:28: 0xd97cc8 in ThreadPool.Closure.runFn (zig1)
            const result = @call(.{}, func, closure.arguments);
                           ^
/home/andy/Downloads/zig/src/ThreadPool.zig:35:38: 0xbd2452 in ThreadPool.Worker.run (zig1)
                (run_node.data.runFn)(&run_node.data);
                                     ^
/home/andy/Downloads/zig/lib/std/thread.zig:268:32: 0xbd2847 in std.thread.MainFuncs.posixThreadMain (zig1)
                        startFn(arg);
                               ^
???:?:?: 0x7f66c692deac in ??? (???)
Aborted (core dumped)

@andrewrk
Copy link
Member Author

Managed to repro this. Here is a backtrace of all threads:

https://gist.github.com/andrewrk/ee0fc81ebd4d05912a633f0108ca68cb

2 of the threads are waiting on std.child_process.ChildProcess.spawnAndWait.

We were violating the POSIX standard which resulted in a deadlock on
musl v1.1.24 on aarch64 alpine linux, uncovered with the new ThreadPool
usage in the stage2 compiler.

std.os execv functions that accept an Allocator parameter are removed
because they are footguns. The POSIX standard does not allow calls to
malloc() between fork() and execv() and since it is common to both
(1) call execv() after fork() and (2) use std.heap.c_allocator,
Programmers are encouraged to go through the `std.process` API
instead, causing some dissonance when combined with `std.os` APIs.

I also slapped a big warning message on all the relevant doc comments.
Surely this time all the problems have been fixed
@andrewrk andrewrk marked this pull request as ready for review December 26, 2020 20:56
// This means we must do the null-termination of argv and env vars here.
const argv_buf = try arena.alloc(?[*:0]u8, self.argv.len + 1);
for (self.argv) |arg, i| {
const arg_buf = try arena.alloc(u8, arg.len + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use allocSentinel here

Suggested change
const arg_buf = try arena.alloc(u8, arg.len + 1);
const arg_buf = try arena.allocSentinel(u8, arg.len, 0);

// can fail between fork() and execve().
// Therefore, we do all the allocation for the execve() before the fork().
// This means we must do the null-termination of argv and env vars here.
const argv_buf = try arena.alloc(?[*:0]u8, self.argv.len + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const argv_buf = try arena.alloc(?[*:0]u8, self.argv.len + 1);
const argv_buf = try arena.allocSentinel(?[*:0]u8, self.argv.len, null);


pub fn createNullDelimitedEnvMap(arena: *mem.Allocator, env_map: *const std.BufMap) ![:null]?[*:0]u8 {
const envp_count = env_map.count();
const envp_buf = try arena.alloc(?[*:0]u8, envp_count + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const envp_buf = try arena.alloc(?[*:0]u8, envp_count + 1);
const envp_buf = try arena.allocSentinel(?[*:0]u8, envp_count, null);

src/Cache.zig Outdated
Comment on lines 16 to 17
/// The plan is to enable this for debug builds only, but for now we enable
/// it always to catch a deadlock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now enabled only for debug builds as of 2b97b75

Copy link
Member

@ifreund ifreund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. There's a good bit of cleanup that can be achieved using std.mem.Allocator.alloSentinel() but the code was already working before it got moved around so whether or not to do that is up to you.

@andrewrk andrewrk force-pushed the fix-the-damn-deadlock branch from 2b97b75 to cb290ed Compare December 27, 2020 00:33
@andrewrk andrewrk merged commit e589422 into master Dec 27, 2020
@andrewrk andrewrk deleted the fix-the-damn-deadlock branch December 27, 2020 00:33
@ifreund
Copy link
Member

ifreund commented Dec 27, 2020

Did the allocSentinel() cleanup and added a test here: #7563

break :m @ptrCast([*:null]?[*:0]u8, os.environ.ptr);
} else {
// TODO come up with a solution for this.
@compileError("missing std lib enhancement: ChildProcess implementation has no way to collect the environment variables to forward to the child process");
Copy link

@jsirois jsirois Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the upshot of not allowing Linux libraries not linking libc to use std.process.Child even when setting an env_map on the Child (this has since moved, but the logic is the same). At first I thought this was a compiler bug since not all arms of the condition are comptime known:

const std = @import("std")

fn IterThingOnce(Thing: anytype) type {
    return struct {
        yielded: bool = false,
        optional_thing: ?Thing,

        pub fn next(self: *@This()) ?Thing {
            if (self.optional_thing) |thing| {
                if (self.yielded) {
                    return null;
                } else {
                    defer self.yielded = true;
                    return thing;
                }
            } else if (false) {
                @compileError("This can't happen since `false` is a comptime known literal.");
            } else {
                // Compiles and test passes by exchanging commented out line below.
                @compileError("This seems like it should not happen.");
                // return null;
            }
        }
    };
}

test "Compiler Bug?" {
    var example: IterThingOnce([]const u8) = .{.optional_thing = "Slatibartfast"};
    try std.testing.expectEqual("Slatibartfast", example.next().?);
    try std.testing.expect(example.next() == null);
}
... error: This seems like it should not happen.
                @compileError("This seems like it should not happen.");
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

But, on reflection, that's the point of the else arm triggering a compile error - that branch can be conditionally hit at runtime.

I guess what I'm wondering is if that was the intention; i.e.: Not allow the Linux library with no libc case to get a runtime error instead iff they do not set up an env_map. I'd prefer the runtime error, since I know what I'm doing and set the env_map up (I actually go further and setup std.os.environ with my own ~start code in the library entry point).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue about this; it's very likely to get lost on a 4 years old PR. It seems like a case we should fix.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha - ok, excellent. I'll do so.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nudge @alexrp: #23372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants