Skip to content

Conversation

@icfaust
Copy link
Contributor

@icfaust icfaust commented Jun 11, 2025

Description

First start by generating code which represents the basic requirements of a sklearnex and onedal estimator.

This PR serves two purposes: to ease understanding of the codebase for external development and standardize development
occurring for array API support.

Next will be to make the necessary doc page links to various aspects to act as a guide for array API development. Which will help in external user contribution.

My goal will be to see if I can get an LLM with this information to generate StandardScaler using BasicStatistics. If it can, that means an LLM can help guide a user with this starting prompt in more difficult scenarios.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@codecov
Copy link

codecov bot commented Jun 11, 2025

Codecov Report

❌ Patch coverage is 86.77686% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
sklearnex/dummy/_dummy.py 82.75% 11 Missing and 4 partials ⚠️
onedal/dummy/dummy.py 95.65% 0 Missing and 1 partial ⚠️
Flag Coverage Δ
azure 80.44% <84.29%> (+0.08%) ⬆️
github 81.98% <100.00%> (+8.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
onedal/__init__.py 86.66% <ø> (ø)
onedal/dummy/__init__.py 100.00% <100.00%> (ø)
sklearnex/__init__.py 92.85% <ø> (ø)
sklearnex/dispatcher.py 91.13% <100.00%> (+0.31%) ⬆️
sklearnex/dummy/__init__.py 100.00% <100.00%> (ø)
onedal/dummy/dummy.py 95.65% <95.65%> (ø)
sklearnex/dummy/_dummy.py 82.75% <82.75%> (ø)

... and 40 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

#
# 1) All sklearnex estimators must inherit oneDALestimator and the sklearn
# estimator that it is replicating (i.e. before in the mro). If there is
# not an equivalent sklearn estimator, then sklearn's BaseEstimator must be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also inherit from the corresponding type for what the estimator does, like RegressorMixin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added snippet to refer to the Mixins, though in most cases that should be handled by the underlying sklearn estimator, we need to be careful with sklearnex-only versions (and therefore good call).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icfaust It appears to have been missed after the latest commits.

# inherited.
#
# 2) ``check_is_fitted`` is required for any method in an estimator which
# requires first calling ``fit`` or ``partial_fit``. This is a sklearn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not a requirement to call this specific sklearn function within .fit, only to make the estimator work correctly when that function is called on it:
https://scikit-learn.org/stable/developers/develop.html#developer-api-for-check-is-fitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, though such a use should be considered way out of the norm.

Comment on lines 95 to 96
# examples are ``fit`` and ``predict``. They use a direct equivalent oneDAL
# function for evaluation. These methods are of highest priority and have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# examples are ``fit`` and ``predict``. They use a direct equivalent oneDAL
# function for evaluation. These methods are of highest priority and have
# examples are ``fit`` and ``predict``. They use a direct equivalent function
# from oneDAL. These methods are of highest priority and have

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this example! It makes many aspects of sklearnex implementation much clearer.

It would be also good to place a link to this file somewhere here:
https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/doc/sources/contribute.rst

Another [not super-important, but I have to say about it] thing that bothers me a bit is: how to maintain the validity of the recommendations here? For example, this get_namespace functionality was implemented several months ago. How the developer of a new product-wide decorator or method would know that this file also needs to be updated?

Comment on lines 166 to 174
# Sklearnex estimators follow a Matryoshka doll pattern with respect to
# the underlying oneDAL library. The sklearnex estimator is a
# public-facing API which mimics sklearn. Sklearnex estimators will
# create another estimator, defined in the ``onedal`` module, for
# having a python interface with oneDAL. Finally, this python object
# will use pybind11 to call oneDAL directly via pybind11-generated
# objects and functions This is known as the ``backend``. These are
# separate entities and do not inherit from one another. The clear
# separation has utility so long that the following rules are followed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bring a bit more structure to this part?
Because I think it is very important for the understanding of overall sklearnex implementation. But it is rather hard to grasp the idea when it is written as a single text block. Though I like the Matryoshka doll association =)

It can be something like:

  • The sklearnex estimator is a public facing API ...
  • The onedal estimator ...
  • The pybind11 backend ...
    These are separate entities...

@icfaust
Copy link
Contributor Author

icfaust commented Oct 2, 2025

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Oct 13, 2025

/intelci: run

@icfaust icfaust requested a review from ahuber21 October 13, 2025 13:41
@icfaust
Copy link
Contributor Author

icfaust commented Oct 13, 2025

private CI failure due to infrastructure issues.

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for presenting all this semi-hidden knowledge in a well structured and understandable way!

@icfaust
Copy link
Contributor Author

icfaust commented Oct 14, 2025

/intelci: run

Copy link
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I assume the codecoverage is not a concern

// policy_list is defined elsewhere which is dependent on the backend
// which is being built. Placed within a macro-check in order to prevent
// use with an spmd policy.
#ifndef ONEDAL_DATA_PARALLEL_SPMD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about else (if spmd to be instantiated)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment


#include "onedal/common.hpp"
#include "onedal/version.hpp"
#include "onedal/dummy/dummy_onedal.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here specifying that in practice this would instead look like #include oneapi/dal/algo/... would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

@icfaust
Copy link
Contributor Author

icfaust commented Oct 15, 2025

/intelci: run

@icfaust icfaust merged commit 1531c63 into uxlfoundation:main Oct 15, 2025
31 checks passed
@icfaust icfaust deleted the dev/estimator_design_docs branch October 15, 2025 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants