Skip to content

Conversation

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Dec 7, 2025

What

This SLEP is for finding a consensus on how to removing accuracy as a default metric for classifiers which currently is: classifiers.score.

Why

Because accuracy has many severe weaknesses and we even baked a probability threashold of 50% into it.

What else

I hope the irony of the title is not lost by its passive aggressiveness.

This SLEP is based on @amueller's proposal scikit-learn/scikit-learn#28995.

@scikit-learn/core-devs @scikit-learn/communication-team @scikit-learn/contributor-experience-team @scikit-learn/documentation-team ping

@jjerphan
Copy link
Member

jjerphan commented Dec 7, 2025

Thank you for writing the SLEP.

While discussing the rectification of this design choice is appropriate (I really think it is valuable), I find the title a bit aggressive or at least detrimental to the intent of the proposal (improving scikit-learn's theoretical consistency): how about "Redefining default metrics"?

I think this could be implemented as part of scikit-learn 2.0.

@lorentzenchr lorentzenchr changed the title SLEP025: Killing Accuracy in Scikit-Learn SLEP025: Losing Accuracy in Scikit-Learn Dec 7, 2025
@reshamas
Copy link
Member

reshamas commented Dec 7, 2025

@lorentzenchr How about: Removing the accuracy metric in scikit-learn scoring


The fact that different scoring metrics focus on different things, i.e. ``predict``
vs. ``predict_proba``, and not all classifiers provide ``predict_proba`` complicates
a unified choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to choose the same metric for all classifiers?

I think the answer is yes because people will use the results of est1.score(X, y) and est2.score(X, y) to evaluate which one is the better estimator. It seems very hard to educate people that they can't compare scores from different estimators

(This is almost a rhetorical question, but I wanted to double check my thinking)

Copy link
Member Author

@lorentzenchr lorentzenchr Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your assumption that users will continue to compare score results of different estimators, and given that a generally satisfying metric does not exist, the conclusion is to remove the score method.

My currently best choice for a general classifier metric is the skill score (R2) variant of the Brier score. Classifiers and regressors would then have the same metric, which is nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I am ready to remove score().

@lucyleeow
Copy link
Member

What about 'Replacing accuracy' - it reads clearer IMO and could mean that we replace it with another metric or replace it with 'nothing'.

Comment on lines +47 to +48
a. The time frame of the deprecation period. Should it be longer than the usual 2 minor
releases? Should step 1 and 2 happen in the same minor release?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think we need to decide on a default by step 2, so we can tell users what to set to emulate the new default.
  • I am okay with 1 and 2 happening at the same time as long as we choose a default. If we have not chosen a new default, then only 1 can happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can't agree a new default we should not start the process. I think this because agreeing the new default is the hard part of this task and if we start it we are on the hook for finishing the transition, which will be impossible without agreement.

Otherwise I agree with Thomas that we can do 1 and 2 at the same time.


There are three questions with this approach:

a. The time frame of the deprecation period. Should it be longer than the usual 2 minor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs a longer than the usual 2 minor releases. Personally, I'll be okay with 3 minor releases or 1 major release.

available in scikit-learn, see ``sklearn.metric`` module and [2]_. The advantages of
removing ``score`` are:

- An active choice by the user is triggered as there is no more default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that most people who blindly use score() do not know better. It is unclear to me if forcing them to make a decision is going to improve the quality of the decision they make. scikit-learn is about "machine learning without the learning curve", so we are on the hook for making a "not unreasonable" decision for the beginner user.

It doesn't stop us from extending our documentation and educational material to increase the chances of people reading it (eg we could have a blog post about this topic when the deprecation starts to explain why this is a much bigger deal than it might seem) and hopefully making a better decision than the default argument to score().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants