[minicern] Move address computation to 64 bit.#22622
Conversation
|
Thanks for this! Consider also removing fortran=OFF from mac14 ci file. |
dpiparo
left a comment
There was a problem hiding this comment.
Thanks for this great fix.
I would suggest, if you agree, to backport it to all branches, even the ones in Maintenance mode, because the failures fixed here have been hitting ROOT's CI on all branches
Test Results 21 files 21 suites 3d 8h 17m 31s ⏱️ For more details on these failures, see this check. Results for commit 32d5b7b. ♻️ This comment has been updated with latest results. |
kernlib was using the LOCF function to compute distances between common blocks. It shifts the numbers, but in 32 bits. On 64 bit architectures, this can lead to unreasonable values, and when taking the difference, it's either correct or wraps around to very large numbers. Due to aslr, the common blocks move around, which can crash the program at init time when the blocks are on different sides of the "wrap around" addresss. Here, the LOCF function is changed to use 64 bit integers for the address, which makes the distance computation independent of where the kernel places the common blocks. Fix root-project#19329
Minicern was switched to -O0 in the hope to reduce crashes. This workaround seems unnecessary now.
|
The arm failure is real, so I'm on that node building for arm now. |
If you turn on the warnings, there are some 'concerning ones' about arrays having inconsistent size 5 vs 9, etc, as well as some never-initialized variables such as IDOL. I tried to do some fixing in #22623 and #20538 With those hacks, all warnings finally disappear, but that just makes things worse on all platforms xD (or maybe more reproducible?) probably not dealing well with global variables or passing things by reference (not a Fortran expert). |
The situation is not great, I agree. The reason many things are all over the place is that they copied the code that initialises the common blocks, but they removed quite some code that actually uses them. So there is dead code and dead memory in these programs, but this is without consequences. |
kernlib.fwas using the LOCF function to compute distances between common blocks. It shifts the numbers, but in 32 bits. On 64 bit architectures, this can lead to unreasonable values, and when taking the difference, it's either correct or wraps around to very large numbers. Due to aslr, the common blocks move around, which can crash the program at init time when the blocks are on different sides of the "wrap around" addresss.Here, the LOCF function is changed to use 64 bit integers for the address, which makes the distance computation independent of where the kernel places the common blocks.
Locally, I could usually observe the crash within a minute of running the Hbook tests in a loop. After this change, I didn't get it to crash even after 10 000 tries.
Furthermore, addressing the reports in the linked issue:
-O0workaround from minicern. Locally, the tests are stable.-O2 -march=nativethe same: Stable in 10 000 runs.Let's see what the full CI run has to say, then.
Note: The crash didn't happen in LOCF, but in the setup routines of zebra.
Fixes #19329