Skip to content

asan fail: ((ptr[0] == kCurrentStackFrameMagic)) != (0) #7556

@chu11

Description

@chu11

This issue is mostly documentation, so that we'll have notes in the future if we choose to investigate this again.

When running asan w/ sharness on RHEL8 I got this error log several times.

=================================================================
==659311==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:367 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0)
    #0 0x1555549f8ac8 in __asan::AsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (/usr/lib64/libasan.so.6+0xbdac8)
    #1 0x155554a193ee in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (/usr/lib64/libasan.so.6+0xde3ee)
    #2 0x1555549fe24c in __asan::AsanThread::GetStackFrameAccessByAddr(unsigned long, __asan::AsanThread::StackFrameAccess*) (/usr/lib64/libasan.so.6+0xc324c)
    #3 0x155554968fdb in __asan::GetStackAddressInformation(unsigned long, unsigned long, __asan::StackAddressDescription*) (/usr/lib64/libasan.so.6+0x2dfdb)
    #4 0x15555496a418 in __asan::AddressDescription::AddressDescription(unsigned long, unsigned long, bool) (/usr/lib64/libasan.so.6+0x2f418)
    #5 0x15555496cbc4 in __asan::ErrorGeneric::ErrorGeneric(unsigned int, unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long) (/usr/lib64/libasan.so.6+0x31bc4)
    #6 0x1555549f80e5 in __asan::ReportGenericError(unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool) (/usr/lib64/libasan.so.6+0xbd0e5)
    #7 0x15555498f340 in __interceptor_uname.part.0 (/usr/lib64/libasan.so.6+0x54340)
    #8 0x15555479b207 in ev_linux_version /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libev/ev.c:1954
    #9 0x15555479cbd7 in ev_supported_backends /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libev/ev.c:3139
    #10 0x15555479cbf4 in ev_recommended_backends /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libev/ev.c:3150
    #11 0x15555479eb17 in loop_init /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libev/ev.c:3321
    #12 0x15555479f527 in ev_loop_new /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libev/ev.c:3557
    #13 0x1555547b3524 in flux_reactor_create /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libflux/reactor.c:69
    #14 0x1555547d4280 in now_context_create /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libflux/future.c:107
    #15 0x1555547d4280 in flux_future_wait_for /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libflux/future.c:483
    #16 0x1555547ccb17 in module_set_finalizing /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libflux/module.c:68
    #17 0x1555547ccb17 in flux_module_finalize /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/libflux/module.c:377
    #18 0x41ab0c in module_thread_cleanup /g/g0/achu/chaos/git/flux-framework/flux-core/src/broker/module_thread.c:171
    #19 0x41b522 in module_thread /g/g0/achu/chaos/git/flux-framework/flux-core/src/broker/module_thread.c:95
    #20 0x1555545111c9 in start_thread (/lib64/libpthread.so.0+0x81c9)
    #21 0x1555531ca952 in clone (/lib64/libc.so.6+0x39952)

The 4 tests (and the test that produced the asan log)

t2317-resource-shrink.t

test_expect_success 'disconnect rank 3' '
	flux overlay disconnect 3 &&
	waitup 3
'

t3305-system-rpctrack-up.t

test_expect_success 'disconnect host fake6 (rank 6)' '
	flux overlay disconnect fake6
'

t3306-system-routercrash.t

test_expect_success 'restart broker 1' '
	$startctl run 1
'

t3310-system-heartbeat.t

test_expect_success 'stop heartbeat' '
	flux module remove heartbeat
'

So all tests producing this error log involve something going down or a module being removed. This matches the stack trace which goes down a module cleanup path (module_thread_cleanup(), flux_module_finalize(), etc.). Then it goes through libev, calls uname(), and libasan borks.

I tried to debug this, trying a number of asan options to try and get more info as well as using valgrind in some of the tests above to see if anything fishy turned up.

Then by chance, the asan errors went away when I tried these asan options:

export ASAN_OPTIONS="detect_leaks=0:detect_stack_use_after_return=1:thread_local_quarantine_size_kb=512:quarantine_size_mb=64"

after a bit of binary searching, detect_stack_use_after_return=1 made the asan error go away.

That single flag making the errors go away makes me highly suspicious that this is a "real bug". Rather there's something quirky going on with asan and the thread cleanup path.

So just documenting this so we won't be surprised if this comes up again in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions