Skip to content

Conversation

david-cortes-intel
Copy link
Contributor

@david-cortes-intel david-cortes-intel commented Sep 11, 2025

Description

This PR modifies calculations of means / variances / counts / gini in random forests for both regression and classification to use exclusively float64 type (feature borders an values are still taken with the float width they come). They are set as a separate type in line with the current logic in regression forests, so that it'd be easier to change them back to float32 if needed once we have some mechanism to detect when it wouldn't harm quality of results.

Motivation: when the data contains more than a handful million rows, calculations about these quantities done in float32 can get imprecise, since integers are not representable exactly, and these impressions accumulate as trees get deeper, to the point that leafs down a regression tree can end up computing means that are outside of the range of the 'y' variable that was passed, potentially even with different orders of magnitude, which makes the resulting models perform worse than random guessing.

Note that this comes with a performance impact even for float64 data, as currently some calculations for the classification variant are done in float32 regardless (which is currently also a problem if there's many millions of rows).

Picture before/after showing the kind of issue that it intends to fix:
image


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

CI issue is not related to the changes here.

@icfaust
Copy link
Contributor

icfaust commented Sep 12, 2025

Benchmarking is required here.

const size_t maxFeatures = nFeatures();
/* minimum fraction of all samples per bin */
const algorithmFPType qMax = 0.02;
const intermSummFPType qMax = 0.02;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must this be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's used in operations with data of intermSummFPType type, so it would need to get casted either way when it gets used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to locations where qMax is used beyond line 1213

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to locations where qMax is used beyond line 1213

qMax is used in line 1213 in operations with intermSummFPType type. So if was a float32 and intermSummFPType is float64, it would get casted to float64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other aspects of that computation are integer type. If you wanted to you could turn that hard coded value into a pure integer computation if you really wanted to (given the value of qMax). I guess I was trying to figure out what the key to that specific change was, and why it was done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise it would get casted. Why would you want to define a variable in one type if then it would get promoted to a different type at the moment it gets used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand your argument @david-cortes-intel , the only use of qMax is the comparison (static_cast<intermSummFPType>(n) > qMax * static_cast<intermSummFPType>(_helper.indexedFeatures().numIndices(iFeature)) where qMax times the number of indices can not exceed the value of a float32 (even if 64 bit size_t). At worst the integer values would be set at float32 and the result would stay as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line seen in terms of data types is as follows:

bool result = bool && double > qMax * double

If qMax is a float, it will get casted to double to calculate the RHS. If it's a double, it will be used as-is.

Copy link
Contributor

@icfaust icfaust Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic converted the two ints to AlgorithmFPType implicitly and now you are explicitly casting them to double using static_cast explicitly, the total number of casts haven't changed. Again, I still don't see the logic here.

EDIT: I see now that it was previously explicitly casted to AlgorithmFPType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@david-cortes-intel
Copy link
Contributor Author

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference:
https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74

Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

@icfaust
Copy link
Contributor

icfaust commented Sep 12, 2025

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74

Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.

Maybe I am misunderstanding your argument? I think it would be safest to just do so.

@david-cortes-intel
Copy link
Contributor Author

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74
Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.

Maybe I am misunderstanding your argument? I think it would be safest to just do so.

  • The benchmarks do not test float32 data. If all data is float64, then there would be no changes with this PR, save for one extra memory allocation in ExtraTrees.
  • If the data were to passed as float32, then either:
    • If the sample size is small enough, you wouldn't see any differences in the generated trees - in this case, you should expect to see a slowdown, but if the times are already short it might be missed due to inherent volatility of times and low number of repetitions in benchmarks.
    • If the sample size is large, the resulting trees will likely differ unless you start playing with parameters, and the quality of the predictions will also differ. You'd then be comparing running times for a model that is incorrect to some degree (where the degree of incorrectness depends on the data size, distribution of 'y', and other aspects) against running times for a model that is correct, which is not insightful.

@icfaust
Copy link
Contributor

icfaust commented Sep 12, 2025

Benchmarking is required here.

Regular benchmark doesn't contain any case where this PR would make a difference: https://github.com/IntelPython/scikit-learn_bench/blob/525dade1f96d039ec3421ee9bae65b7a7403840f/configs/regular/ensemble.json#L74
Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'.

I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me.
Maybe I am misunderstanding your argument? I think it would be safest to just do so.

  • The benchmarks do not test float32 data. If all data is float64, then there would be no changes with this PR, save for one extra memory allocation in ExtraTrees.

  • If the data were to passed as float32, then either:

    • If the sample size is small enough, you wouldn't see any differences in the generated trees - in this case, you should expect to see a slowdown, but if the times are already short it might be missed due to inherent volatility of times and low number of repetitions in benchmarks.
    • If the sample size is large, the resulting trees will likely differ unless you start playing with parameters, and the quality of the predictions will also differ. You'd then be comparing running times for a model that is incorrect to some degree (where the degree of incorrectness depends on the data size, distribution of 'y', and other aspects) against running times for a model that is correct, which is not insightful.

Extending the initial argument would be that the entire oneDAL codebase (beyond clustering) only respects regressions in float64 data, which I don't think is the case (will wait for information from above on this).

Again, I want benchmarks on this to be sure to give me assurances of how to look in review. I appreciate your analysis, as the reviewing process should go about verifying that (and having concrete evidence to verify beyond the explanation and reading the code as a reviewer) would be nice (доверяй, но проверяй).

@icfaust
Copy link
Contributor

icfaust commented Sep 12, 2025

@david-cortes-intel Could you also add some more code comments to guide future development (if necessary). This algorithm was already a mess, and it was never clear to me if some of the previous coding choices were purposeful or not by other devs. The commentary would help to separate your good code from the bad existing code in terms of where to look in maintaining it and in trusting it (i.e. code comments are a sign of trustable code here).

@david-cortes-intel
Copy link
Contributor Author

@david-cortes-intel Could you also add some more code comments to guide future development (if necessary). This algorithm was already a mess, and it was never clear to me if some of the previous coding choices were purposeful or not by other devs. The commentary would help to separate your good code from the bad existing code in terms of where to look in maintaining it and in trusting it (i.e. code comments are a sign of trustable code here).

It has a comment block like that at the beginning, and the PR removes the comments that were mentioning they were unsure about why some type was used. Other than that, this PR doesn't change the logic in any function, just the types. I haven't checked the logic of any further functions.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

Attaching performance comparisons before/after on selected servers:
report_rf64_gnr.xlsx
report_rf64_emr.xlsx

They show slight accuracy gains in this branch at the expense of slightly timing degradations as expected, but it looks like the variances are very high and numbers sometimes show either branch being faster, so I wouldn't put much trust in the stability of those aggregates.

The changes are mostly in random forest classifier as expected, which switches to float64 here.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants