Skip to content

[minicern] Move address computation to 64 bit.#22622

Open
hageboeck wants to merge 6 commits into
root-project:masterfrom
hageboeck:THbookCrash
Open

[minicern] Move address computation to 64 bit.#22622
hageboeck wants to merge 6 commits into
root-project:masterfrom
hageboeck:THbookCrash

Conversation

@hageboeck

@hageboeck hageboeck commented Jun 16, 2026

Copy link
Copy Markdown
Member

kernlib.f was using the LOCF function to compute distances between common blocks. It shifts the numbers, but in 32 bits. On 64 bit architectures, this can lead to unreasonable values, and when taking the difference, it's either correct or wraps around to very large numbers. Due to aslr, the common blocks move around, which can crash the program at init time when the blocks are on different sides of the "wrap around" addresss.

Here, the LOCF function is changed to use 64 bit integers for the address, which makes the distance computation independent of where the kernel places the common blocks.

Locally, I could usually observe the crash within a minute of running the Hbook tests in a loop. After this change, I didn't get it to crash even after 10 000 tries.

Furthermore, addressing the reports in the linked issue:

  • I managed to remove the -O0 workaround from minicern. Locally, the tests are stable.
  • With -O2 -march=native the same: Stable in 10 000 runs.
  • And I managed to run with address sanitizer. In 1000 runs, tests remain stable.

Let's see what the full CI run has to say, then.

Note: The crash didn't happen in LOCF, but in the setup routines of zebra.

Fixes #19329

@hageboeck hageboeck self-assigned this Jun 16, 2026
@hageboeck hageboeck requested a review from dpiparo as a code owner June 16, 2026 07:50
@hageboeck hageboeck requested a review from bellenot as a code owner June 16, 2026 10:08
@ferdymercury

Copy link
Copy Markdown
Collaborator

Thanks for this! Consider also removing fortran=OFF from mac14 ci file.

@hageboeck hageboeck linked an issue Jun 16, 2026 that may be closed by this pull request
1 task
Comment thread misc/minicern/src/kernlib.f
ferdymercury added a commit to ferdymercury/root that referenced this pull request Jun 16, 2026

@dpiparo dpiparo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great fix.
I would suggest, if you agree, to backport it to all branches, even the ones in Maintenance mode, because the failures fixed here have been hitting ROOT's CI on all branches

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

Test Results

    21 files      21 suites   3d 8h 17m 31s ⏱️
 3 869 tests  3 856 ✅ 0 💤 13 ❌
72 703 runs  72 689 ✅ 1 💤 13 ❌

For more details on these failures, see this check.

Results for commit 32d5b7b.

♻️ This comment has been updated with latest results.

Comment thread misc/minicern/CMakeLists.txt Outdated
kernlib was using the LOCF function to compute distances between common
blocks. It shifts the numbers, but in 32 bits. On 64 bit architectures,
this can lead to unreasonable values, and when taking the difference,
it's either correct or wraps around to very large numbers. Due to aslr,
the common blocks move around, which can crash the program at init time
when the blocks are on different sides of the "wrap around" addresss.

Here, the LOCF function is changed to use 64 bit integers for the
address, which makes the distance computation independent of where the
kernel places the common blocks.

Fix root-project#19329
Minicern was switched to -O0 in the hope to reduce crashes. This
workaround seems unnecessary now.
@hageboeck

Copy link
Copy Markdown
Member Author

The arm failure is real, so I'm on that node building for arm now.

@ferdymercury

ferdymercury commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

The arm failure is real, so I'm on that node building for arm now.

If you turn on the warnings, there are some 'concerning ones' about arrays having inconsistent size 5 vs 9, etc, as well as some never-initialized variables such as IDOL.

I tried to do some fixing in #22623 and #20538

With those hacks, all warnings finally disappear, but that just makes things worse on all platforms xD (or maybe more reproducible?) probably not dealing well with global variables or passing things by reference (not a Fortran expert).

@hageboeck

Copy link
Copy Markdown
Member Author

The arm failure is real, so I'm on that node building for arm now.

If you turn on the warnings, there are some 'concerning ones' about arrays having inconsistent size 5 vs 9, etc, as well as some never-initialized variables such as IDOL.

I tried to do some fixing in #22623 and #20538

With those hacks, all warnings finally disappear, but that just makes things worse on all platforms xD (or maybe more reproducible?) probably not dealing well with global variables or passing things by reference (not a Fortran expert).

The situation is not great, I agree. The reason many things are all over the place is that they copied the code that initialises the common blocks, but they removed quite some code that actually uses them. So there is dead code and dead memory in these programs, but this is without consequences.
Our problem is that it's hard to say which parts are used and which are not ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

h2root fragilities

3 participants