FIX: Use float64 for random forest calculations #3360

david-cortes-intel · 2025-09-11T16:17:02Z

Description

This PR modifies calculations of means / variances / counts / gini in random forests for both regression and classification to use exclusively float64 type (feature borders an values are still taken with the float width they come). They are set as a separate type in line with the current logic in regression forests, so that it'd be easier to change them back to float32 if needed once we have some mechanism to detect when it wouldn't harm quality of results.

Motivation: when the data contains more than a handful million rows, calculations about these quantities done in float32 can get imprecise, since integers are not representable exactly, and these impressions accumulate as trees get deeper, to the point that leafs down a regression tree can end up computing means that are outside of the range of the 'y' variable that was passed, potentially even with different orders of magnitude, which makes the resulting models perform worse than random guessing.

Note that this comes with a performance impact even for float64 data, as currently some calculations for the classification variant are done in float32 regardless (which is currently also a problem if there's many millions of rows).

Picture before/after showing the kind of issue that it intends to fix:

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.

david-cortes-intel · 2025-09-11T16:18:45Z

/intelci: run

david-cortes-intel · 2025-09-11T16:44:20Z

/intelci: run

david-cortes-intel · 2025-09-12T06:16:24Z

CI issue is not related to the changes here.

icfaust · 2025-09-12T07:54:05Z

Benchmarking is required here.

icfaust · 2025-09-12T07:56:02Z

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i

    const size_t maxFeatures = nFeatures();
    /* minimum fraction of all samples per bin */
-    const algorithmFPType qMax = 0.02;
+    const intermSummFPType qMax = 0.02;


Why must this be changed?

Because it's used in operations with data of intermSummFPType type, so it would need to get casted either way when it gets used.

Please link to locations where qMax is used beyond line 1213

Please link to locations where qMax is used beyond line 1213

qMax is used in line 1213 in operations with intermSummFPType type. So if was a float32 and intermSummFPType is float64, it would get casted to float64.

All the other aspects of that computation are integer type. If you wanted to you could turn that hard coded value into a pure integer computation if you really wanted to (given the value of qMax). I guess I was trying to figure out what the key to that specific change was, and why it was done.

Because otherwise it would get casted. Why would you want to define a variable in one type if then it would get promoted to a different type at the moment it gets used?

I do not understand your argument @david-cortes-intel , the only use of qMax is the comparison (static_cast<intermSummFPType>(n) > qMax * static_cast<intermSummFPType>(_helper.indexedFeatures().numIndices(iFeature)) where qMax times the number of indices can not exceed the value of a float32 (even if 64 bit size_t). At worst the integer values would be set at float32 and the result would stay as such.

The line seen in terms of data types is as follows:

bool result = bool && double > qMax * double

If qMax is a float, it will get casted to double to calculate the RHS. If it's a double, it will be used as-is.

The previous logic converted the two ints to AlgorithmFPType implicitly and now you are explicitly casting them to double using static_cast explicitly, the total number of casts haven't changed. Again, I still don't see the logic here.

EDIT: I see now that it was previously explicitly casted to AlgorithmFPType

david-cortes-intel · 2025-09-12T08:00:05Z

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference:
https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74

Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

icfaust · 2025-09-12T10:54:01Z

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74

Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.

Maybe I am misunderstanding your argument? I think it would be safest to just do so.

david-cortes-intel · 2025-09-12T11:26:52Z

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74
Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.

Maybe I am misunderstanding your argument? I think it would be safest to just do so.

The benchmarks do not test float32 data. If all data is float64, then there would be no changes with this PR, save for one extra memory allocation in ExtraTrees.
If the data were to passed as float32, then either:
- If the sample size is small enough, you wouldn't see any differences in the generated trees - in this case, you should expect to see a slowdown, but if the times are already short it might be missed due to inherent volatility of times and low number of repetitions in benchmarks.
- If the sample size is large, the resulting trees will likely differ unless you start playing with parameters, and the quality of the predictions will also differ. You'd then be comparing running times for a model that is incorrect to some degree (where the degree of incorrectness depends on the data size, distribution of 'y', and other aspects) against running times for a model that is correct, which is not insightful.

icfaust · 2025-09-12T11:44:24Z

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74
Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.
Maybe I am misunderstanding your argument? I think it would be safest to just do so.

The benchmarks do not test float32 data. If all data is float64, then there would be no changes with this PR, save for one extra memory allocation in ExtraTrees.

If the data were to passed as float32, then either:

If the sample size is small enough, you wouldn't see any differences in the generated trees - in this case, you should expect to see a slowdown, but if the times are already short it might be missed due to inherent volatility of times and low number of repetitions in benchmarks.

If the sample size is large, the resulting trees will likely differ unless you start playing with parameters, and the quality of the predictions will also differ. You'd then be comparing running times for a model that is incorrect to some degree (where the degree of incorrectness depends on the data size, distribution of 'y', and other aspects) against running times for a model that is correct, which is not insightful.

Extending the initial argument would be that the entire oneDAL codebase (beyond clustering) only respects regressions in float64 data, which I don't think is the case (will wait for information from above on this).

Again, I want benchmarks on this to be sure to give me assurances of how to look in review. I appreciate your analysis, as the reviewing process should go about verifying that (and having concrete evidence to verify beyond the explanation and reading the code as a reviewer) would be nice (доверяй, но проверяй).

icfaust · 2025-09-12T11:49:38Z

@david-cortes-intel Could you also add some more code comments to guide future development (if necessary). This algorithm was already a mess, and it was never clear to me if some of the previous coding choices were purposeful or not by other devs. The commentary would help to separate your good code from the bad existing code in terms of where to look in maintaining it and in trusting it (i.e. code comments are a sign of trustable code here).

david-cortes-intel · 2025-09-12T12:20:26Z

@david-cortes-intel Could you also add some more code comments to guide future development (if necessary). This algorithm was already a mess, and it was never clear to me if some of the previous coding choices were purposeful or not by other devs. The commentary would help to separate your good code from the bad existing code in terms of where to look in maintaining it and in trusting it (i.e. code comments are a sign of trustable code here).

It has a comment block like that at the beginning, and the PR removes the comments that were mentioning they were unsure about why some type was used. Other than that, this PR doesn't change the logic in any function, just the types. I haven't checked the logic of any further functions.

david-cortes-intel · 2025-09-29T10:34:24Z

/intelci: run

david-cortes-intel · 2025-10-01T14:57:45Z

Attaching performance comparisons before/after on selected servers:
report_rf64_gnr.xlsx
report_rf64_emr.xlsx

They show slight accuracy gains in this branch at the expense of slightly timing degradations as expected, but it looks like the variances are very high and numbers sometimes show either branch being faster, so I wouldn't put much trust in the stability of those aggregates.

The changes are mostly in random forest classifier as expected, which switches to float64 here.

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i

david-cortes-intel · 2025-10-14T09:25:39Z

/intelci: run

use fp64 for all random forest aggregates

38acc2e

david-cortes-intel requested a review from Alexandr-Solovev September 11, 2025 16:17

david-cortes-intel added the bug label Sep 11, 2025

david-cortes-intel added 2 commits September 11, 2025 09:42

make compiler happy

0b745bc

linter

cafb956

david-cortes-intel marked this pull request as ready for review September 12, 2025 06:18

david-cortes-intel requested review from ahuber21, avolkov-intel, icfaust and razdoburdin as code owners September 12, 2025 06:18

icfaust reviewed Sep 12, 2025

View reviewed changes

Merge branch 'main' into rf_fp64

e51596a

Vika-F reviewed Oct 10, 2025

View reviewed changes

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i Outdated Show resolved Hide resolved

Vika-F reviewed Oct 10, 2025

View reviewed changes

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i Outdated Show resolved Hide resolved

Vika-F reviewed Oct 10, 2025

View reviewed changes

cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i Outdated Show resolved Hide resolved

david-cortes-intel added 2 commits October 10, 2025 06:38

rename variables

289a022

solve merge conflicts

94f3078

missing changes

defca94

FIX: Use float64 for random forest calculations #3360

Are you sure you want to change the base?

FIX: Use float64 for random forest calculations #3360

Uh oh!

Conversation

david-cortes-intel commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

david-cortes-intel commented Sep 11, 2025

Uh oh!

david-cortes-intel commented Sep 11, 2025

Uh oh!

david-cortes-intel commented Sep 12, 2025

Uh oh!

icfaust commented Sep 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icfaust Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel commented Sep 12, 2025

Uh oh!

icfaust commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-cortes-intel commented Sep 12, 2025

Uh oh!

icfaust commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icfaust commented Sep 12, 2025

Uh oh!

david-cortes-intel commented Sep 12, 2025

Uh oh!

david-cortes-intel commented Sep 29, 2025

Uh oh!

david-cortes-intel commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

david-cortes-intel commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

david-cortes-intel commented Sep 11, 2025 •

edited

Loading

icfaust Oct 7, 2025 •

edited

Loading

icfaust commented Sep 12, 2025 •

edited

Loading

icfaust commented Sep 12, 2025 •

edited

Loading