-
Notifications
You must be signed in to change notification settings - Fork 225
FIX: Use float64 for random forest calculations #3360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/intelci: run |
/intelci: run |
CI issue is not related to the changes here. |
Benchmarking is required here. |
const size_t maxFeatures = nFeatures(); | ||
/* minimum fraction of all samples per bin */ | ||
const algorithmFPType qMax = 0.02; | ||
const intermSummFPType qMax = 0.02; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must this be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's used in operations with data of intermSummFPType
type, so it would need to get casted either way when it gets used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to locations where qMax is used beyond line 1213
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to locations where qMax is used beyond line 1213
qMax
is used in line 1213 in operations with intermSummFPType
type. So if was a float32 and intermSummFPType
is float64, it would get casted to float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other aspects of that computation are integer type. If you wanted to you could turn that hard coded value into a pure integer computation if you really wanted to (given the value of qMax). I guess I was trying to figure out what the key to that specific change was, and why it was done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because otherwise it would get casted. Why would you want to define a variable in one type if then it would get promoted to a different type at the moment it gets used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand your argument @david-cortes-intel , the only use of qMax
is the comparison (static_cast<intermSummFPType>(n) > qMax * static_cast<intermSummFPType>(_helper.indexedFeatures().numIndices(iFeature))
where qMax
times the number of indices can not exceed the value of a float32 (even if 64 bit size_t). At worst the integer values would be set at float32 and the result would stay as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line seen in terms of data types is as follows:
bool result = bool && double > qMax * double
If qMax
is a float
, it will get casted to double
to calculate the RHS. If it's a double
, it will be used as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous logic converted the two ints to AlgorithmFPType
implicitly and now you are explicitly casting them to double
using static_cast
explicitly, the total number of casts haven't changed. Again, I still don't see the logic here.
EDIT: I see now that it was previously explicitly casted to AlgorithmFPType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular benchmark doesn't contain any case where this PR would make a difference: Also, what would you expect to get out of a benchmark? In the case of float32 data where it makes a difference, it wouldn't be an apples-to-apples comparison either way, because the generated trees would differ and likely end up being deeper and making predictions closer to the real values of 'y'. |
I want to benchmark regressions in oneDAL version to oneDAL version, which would come out via the values vs scikit-learn. Given that some of the performance of the RF code is tied to the vector register width which the data per operation can be impacted by size of the float type, and this PR impacts exactly that (possibly? maybe in a tangential way? I am not sure yet.), a level of security and guarantee would be required for me. Maybe I am misunderstanding your argument? I think it would be safest to just do so. |
|
Extending the initial argument would be that the entire oneDAL codebase (beyond clustering) only respects regressions in float64 data, which I don't think is the case (will wait for information from above on this). Again, I want benchmarks on this to be sure to give me assurances of how to look in review. I appreciate your analysis, as the reviewing process should go about verifying that (and having concrete evidence to verify beyond the explanation and reading the code as a reviewer) would be nice (доверяй, но проверяй). |
@david-cortes-intel Could you also add some more code comments to guide future development (if necessary). This algorithm was already a mess, and it was never clear to me if some of the previous coding choices were purposeful or not by other devs. The commentary would help to separate your good code from the bad existing code in terms of where to look in maintaining it and in trusting it (i.e. code comments are a sign of trustable code here). |
It has a comment block like that at the beginning, and the PR removes the comments that were mentioning they were unsure about why some type was used. Other than that, this PR doesn't change the logic in any function, just the types. I haven't checked the logic of any further functions. |
/intelci: run |
Attaching performance comparisons before/after on selected servers: They show slight accuracy gains in this branch at the expense of slightly timing degradations as expected, but it looks like the variances are very high and numbers sometimes show either branch being faster, so I wouldn't put much trust in the stability of those aggregates. The changes are mostly in random forest classifier as expected, which switches to float64 here. |
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
/intelci: run |
Description
This PR modifies calculations of means / variances / counts / gini in random forests for both regression and classification to use exclusively float64 type (feature borders an values are still taken with the float width they come). They are set as a separate type in line with the current logic in regression forests, so that it'd be easier to change them back to float32 if needed once we have some mechanism to detect when it wouldn't harm quality of results.
Motivation: when the data contains more than a handful million rows, calculations about these quantities done in float32 can get imprecise, since integers are not representable exactly, and these impressions accumulate as trees get deeper, to the point that leafs down a regression tree can end up computing means that are outside of the range of the 'y' variable that was passed, potentially even with different orders of magnitude, which makes the resulting models perform worse than random guessing.
Note that this comes with a performance impact even for float64 data, as currently some calculations for the classification variant are done in float32 regardless (which is currently also a problem if there's many millions of rows).
Picture before/after showing the kind of issue that it intends to fix:

Checklist:
Completeness and readability
Testing