Skip to content

Conversation

@icfaust
Copy link
Contributor

@icfaust icfaust commented Dec 2, 2024

Description

This PR refactors the Ensemble algorithms (RandomForestRegressor, RandomForestClassifier, ExtraTreesRegressor and ExtraTreesClassifier) to follow repository standards and add array API support. This reduced the code by 500+ lines and required the following changes:

  • Remove BaseEstimator inheritance from onedal ensemble estimators
  • Change estimator __init__ signatures to remove sklearn conformant kwargs in onedal ensemble estimators
  • Inline code comments added for function of various aspects for future maintenance
  • Remove random_state use from onedal estimators
  • Add class_count kwarg to fit as calculating it in python is scikit-learn conformance (oneDAL expects it a priori)
  • Remove input parameter checks from the onedal estimators
  • generalize return of out of bag values from oneDAL for use by Classifiers and Regressors
  • Remove unused _create_model function
  • Centralized predict method
  • Create ForestRegressor and ForestClasssifier objects to minimize maintenance
  • swap away from max_samples to observations_per_tree_fraction to follow oneDAL values
  • Modify tests for onedal to use numpy arrays (which can be consumed, where lists cannot)
  • Reorder warnings and errors based on type (e.g. parameter checks vs input checks etc.)
  • Refactor _save_attributes method to be specific to Classifiers vs Regressors
  • Refactor _onedal_fit_ready, _onedal_cpu_supported and _onedal_gpu_supported to reduce code duplication via inheritance and make array API enabled
  • Add enable_array_api decorators to public-facing estimators
  • Place _check_parameters function behind sklearn_check_version for future removal
  • Remove check for min_impurity_split which was removed in sklearn 0.25
  • Add array API-enabled _validate_y_class_weight method designed specifically for sklearnex estimators (missing some functionality which is irrelevant to the sklearnex estimator)
  • Remove check_n_features from sklearnex.utils.validation as it is no longer necessary

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@icfaust
Copy link
Contributor Author

icfaust commented Oct 22, 2025

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Oct 22, 2025

/intelci: run

@icfaust icfaust changed the title [enhancement] WIP: move finite check to sklearnex in ensemble algos [enhancement] Enable Array API in ensemble algos Oct 22, 2025
@icfaust
Copy link
Contributor Author

icfaust commented Oct 23, 2025

/intelci: run

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Oct 23, 2025

I'm not sure if this is a bug in DPNP or an issue with this PR where not all inputs are converted as necessary, but I see this error if I try to make predictions on numpy arrays from a model that was fitted to dpnp arrays:

File /export/users/dcortes/repos/sklex-dpnp/sklearnex/ensemble/_forest.py:867, in ForestClassifier._onedal_predict(self, X, queue)
    864 res = self._onedal_estimator.predict(X, queue=queue)
    866 if is_array_api_compliant:
--> 867     return xp.take(self.classes_, xp.astype(xp.reshape(res, (-1,)), xp.int64))
    868 else:
    869     return xp.take(self.classes_, res.ravel().astype(xp.int64, casting="unsafe"))

Reproducer:

import os, sys
os.environ["SCIPY_ARRAY_API"] = "1"
import numpy as np
import dpnp
from sklearnex import config_context, set_config
from sklearnex.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(random_state=123)
Xd = dpnp.array(X, dtype=np.float32, device="cpu")
yd = dpnp.array(y, dtype=np.float32, device="cpu")

set_config(array_api_dispatch=True)
model = RandomForestClassifier(n_estimators=1, max_depth=5).fit(Xd, yd)
model.predict(X[:5])

@david-cortes-intel
Copy link
Contributor

@icfaust Since the equivalent classes in sklearn do not have array API support: how is this meant to work for attributes? I see that the arrays in the Tree object class for example are always numpy regardless of what was passed as inputs, while other internal attributes appear to have the array API class that was used as input, with data being on the corresponding device.

I guess this might be intended, but if that is the case, please add this kind of behavior to the documentation.

),
(
not sp.issparse(data[2]),
"sample_weight is sparse. " "Sparse input is not supported.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this perhaps meant to have an additional check on 'X'?

@icfaust
Copy link
Contributor Author

icfaust commented Oct 23, 2025

I'm not sure if this is a bug in DPNP or an issue with this PR where not all inputs are converted as necessary, but I see this error if I try to make predictions on numpy arrays from a model that was fitted to dpnp arrays:

File /export/users/dcortes/repos/sklex-dpnp/sklearnex/ensemble/_forest.py:867, in ForestClassifier._onedal_predict(self, X, queue)
    864 res = self._onedal_estimator.predict(X, queue=queue)
    866 if is_array_api_compliant:
--> 867     return xp.take(self.classes_, xp.astype(xp.reshape(res, (-1,)), xp.int64))
    868 else:
    869     return xp.take(self.classes_, res.ravel().astype(xp.int64, casting="unsafe"))

Reproducer:

import os, sys
os.environ["SCIPY_ARRAY_API"] = "1"
import numpy as np
import dpnp
from sklearnex import config_context, set_config
from sklearnex.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(random_state=123)
Xd = dpnp.array(X, dtype=np.float32, device="cpu")
yd = dpnp.array(y, dtype=np.float32, device="cpu")

set_config(array_api_dispatch=True)
model = RandomForestClassifier(n_estimators=1, max_depth=5).fit(Xd, yd)
model.predict(X[:5])

@david-cortes-intel #2747 This is an issue beyond just this estimator, and should be addressed as a follow up with additional testing added in a central way.

@icfaust
Copy link
Contributor Author

icfaust commented Oct 23, 2025

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants