Skip to content

Commit d28f7ee

Browse files
committed
Pushing the docs to dev/ for branch: main, commit be5316fb57ac5dfe429d7a994a4ef34aaa0d79c7
1 parent c1089f4 commit d28f7ee

File tree

1,341 files changed

+8918
-7063
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,341 files changed

+8918
-7063
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 910925378d81091757cadf53646763e2
3+
config: 9b326f5a24b70ee04933e1c4cba406d7
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file not shown.
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"\n# Release Highlights for scikit-learn 1.5\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.5! Many bug fixes\nand improvements were added, as well as some key new features. Below we\ndetail the highlights of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes <release_notes_1_5>`.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"## FixedThresholdClassifier: Setting the decision threshold of a binary classifier\nAll binary classifiers of scikit-learn use a fixed decision threshold of 0.5 to\nconvert probability estimates (i.e. output of `predict_proba`) into class\npredictions. However, 0.5 is almost never the desired threshold for a given problem.\n:class:`~model_selection.FixedThresholdClassifier` allows to wrap any binary\nclassifier and set a custom decision threshold.\n\n"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"metadata": {
21+
"collapsed": false
22+
},
23+
"outputs": [],
24+
"source": [
25+
"from sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import confusion_matrix\n\nX, y = make_classification(n_samples=1_000, weights=[0.9, 0.1], random_state=0)\nclassifier = LogisticRegression(random_state=0).fit(X, y)\n\nprint(\"confusion matrix:\\n\", confusion_matrix(y, classifier.predict(X)))"
26+
]
27+
},
28+
{
29+
"cell_type": "markdown",
30+
"metadata": {},
31+
"source": [
32+
"Lowering the threshold, i.e. allowing more samples to be classified as the positive\nclass, increases the number of true positives at the cost of more false positives\n(as is well known from the concavity of the ROC curve).\n\n"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"metadata": {
39+
"collapsed": false
40+
},
41+
"outputs": [],
42+
"source": [
43+
"from sklearn.model_selection import FixedThresholdClassifier\n\nwrapped_classifier = FixedThresholdClassifier(classifier, threshold=0.1).fit(X, y)\n\nprint(\"confusion matrix:\\n\", confusion_matrix(y, wrapped_classifier.predict(X)))"
44+
]
45+
},
46+
{
47+
"cell_type": "markdown",
48+
"metadata": {},
49+
"source": [
50+
"## TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier\nThe decision threshold of a binary classifier can be tuned to optimize a given\nmetric, using :class:`~model_selection.TunedThresholdClassifierCV`.\n\n"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {
57+
"collapsed": false
58+
},
59+
"outputs": [],
60+
"source": [
61+
"from sklearn.metrics import balanced_accuracy_score\n\n# Due to the class imbalance, the balanced accuracy is not optimal for the default\n# threshold. The classifier tends to over predict the majority class.\nprint(f\"balanced accuracy: {balanced_accuracy_score(y, classifier.predict(X)):.2f}\")"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {},
67+
"source": [
68+
"Tuning the threshold to optimize the balanced accuracy gives a smaller threshold\nthat allows more samples to be classified as the positive class.\n\n"
69+
]
70+
},
71+
{
72+
"cell_type": "code",
73+
"execution_count": null,
74+
"metadata": {
75+
"collapsed": false
76+
},
77+
"outputs": [],
78+
"source": [
79+
"from sklearn.model_selection import TunedThresholdClassifierCV\n\ntuned_classifier = TunedThresholdClassifierCV(\n classifier, cv=5, scoring=\"balanced_accuracy\"\n).fit(X, y)\n\nprint(f\"new threshold: {tuned_classifier.best_threshold_:.4f}\")\nprint(\n f\"balanced accuracy: {balanced_accuracy_score(y, tuned_classifier.predict(X)):.2f}\"\n)"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
":class:`~model_selection.TunedThresholdClassifierCV` also benefits from the\nmetadata routing support (`Metadata Routing User Guide<metadata_routing>`)\nallowing to optimze complex business metrics, detailed\nin `Post-tuning the decision threshold for cost-sensitive learning\n<sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py>`.\n\n"
87+
]
88+
},
89+
{
90+
"cell_type": "markdown",
91+
"metadata": {},
92+
"source": [
93+
"## Performance improvements in PCA\n:class:`~decomposition.PCA` has a new solver, \"covariance_eigh\", which is faster\nand more memory efficient than the other solvers for datasets with a large number\nof samples and a small number of features.\n\n"
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": null,
99+
"metadata": {
100+
"collapsed": false
101+
},
102+
"outputs": [],
103+
"source": [
104+
"from sklearn.datasets import make_low_rank_matrix\nfrom sklearn.decomposition import PCA\n\nX = make_low_rank_matrix(\n n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0\n)\n\npca = PCA(n_components=10).fit(X)\n\nprint(f\"explained variance: {pca.explained_variance_ratio_.sum():.2f}\")"
105+
]
106+
},
107+
{
108+
"cell_type": "markdown",
109+
"metadata": {},
110+
"source": [
111+
"The \"full\" solver has also been improved to use less memory and allows to\ntransform faster. The \"auto\" option for the solver takes advantage of the\nnew solver and is now able to select an appropriate solver for sparse\ndatasets.\n\n"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {
118+
"collapsed": false
119+
},
120+
"outputs": [],
121+
"source": [
122+
"from scipy.sparse import random\n\nX = random(10000, 100, format=\"csr\", random_state=0)\n\npca = PCA(n_components=10, svd_solver=\"auto\").fit(X)"
123+
]
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {},
128+
"source": [
129+
"## ColumnTransformer is subscriptable\nThe transformers of a :class:`~compose.ColumnTransformer` can now be directly\naccessed using indexing by name.\n\n"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"metadata": {
136+
"collapsed": false
137+
},
138+
"outputs": [],
139+
"source": [
140+
"import numpy as np\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\n\nX = np.array([[0, 1, 2], [3, 4, 5]])\ncolumn_transformer = ColumnTransformer(\n [(\"std_scaler\", StandardScaler(), [0]), (\"one_hot\", OneHotEncoder(), [1, 2])]\n)\n\ncolumn_transformer.fit(X)\n\nprint(column_transformer[\"std_scaler\"])\nprint(column_transformer[\"one_hot\"])"
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"metadata": {},
146+
"source": [
147+
"## Custom imputation strategies for the SimpleImputer\n:class:`~impute.SimpleImputer` now supports custom strategies for imputation,\nusing a callable that computes a scalar value from the non missing values of\na column vector.\n\n"
148+
]
149+
},
150+
{
151+
"cell_type": "code",
152+
"execution_count": null,
153+
"metadata": {
154+
"collapsed": false
155+
},
156+
"outputs": [],
157+
"source": [
158+
"from sklearn.impute import SimpleImputer\n\nX = np.array(\n [\n [-1.1, 1.1, 1.1],\n [3.9, -1.2, np.nan],\n [np.nan, 1.3, np.nan],\n [-0.1, -1.4, -1.4],\n [-4.9, 1.5, -1.5],\n [np.nan, 1.6, 1.6],\n ]\n)\n\n\ndef smallest_abs(arr):\n \"\"\"Return the smallest absolute value of a 1D array.\"\"\"\n return np.min(np.abs(arr))\n\n\nimputer = SimpleImputer(strategy=smallest_abs)\n\nimputer.fit_transform(X)"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"metadata": {},
164+
"source": [
165+
"## Pairwise distances with non-numeric arrays\n:func:`~metrics.pairwise_distances` can now compute distances between\nnon-numeric arrays using a callable metric.\n\n"
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"metadata": {
172+
"collapsed": false
173+
},
174+
"outputs": [],
175+
"source": [
176+
"from sklearn.metrics import pairwise_distances\n\nX = [\"cat\", \"dog\"]\nY = [\"cat\", \"fox\"]\n\n\ndef levenshtein_distance(x, y):\n \"\"\"Return the Levenshtein distance between two strings.\"\"\"\n if x == \"\" or y == \"\":\n return max(len(x), len(y))\n if x[0] == y[0]:\n return levenshtein_distance(x[1:], y[1:])\n return 1 + min(\n levenshtein_distance(x[1:], y),\n levenshtein_distance(x, y[1:]),\n levenshtein_distance(x[1:], y[1:]),\n )\n\n\npairwise_distances(X, Y, metric=levenshtein_distance)"
177+
]
178+
}
179+
],
180+
"metadata": {
181+
"kernelspec": {
182+
"display_name": "Python 3",
183+
"language": "python",
184+
"name": "python3"
185+
},
186+
"language_info": {
187+
"codemirror_mode": {
188+
"name": "ipython",
189+
"version": 3
190+
},
191+
"file_extension": ".py",
192+
"mimetype": "text/x-python",
193+
"name": "python",
194+
"nbconvert_exporter": "python",
195+
"pygments_lexer": "ipython3",
196+
"version": "3.9.19"
197+
}
198+
},
199+
"nbformat": 4,
200+
"nbformat_minor": 0
201+
}
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# ruff: noqa
2+
"""
3+
=======================================
4+
Release Highlights for scikit-learn 1.5
5+
=======================================
6+
7+
.. currentmodule:: sklearn
8+
9+
We are pleased to announce the release of scikit-learn 1.5! Many bug fixes
10+
and improvements were added, as well as some key new features. Below we
11+
detail the highlights of this release. **For an exhaustive list of
12+
all the changes**, please refer to the :ref:`release notes <release_notes_1_5>`.
13+
14+
To install the latest version (with pip)::
15+
16+
pip install --upgrade scikit-learn
17+
18+
or with conda::
19+
20+
conda install -c conda-forge scikit-learn
21+
22+
"""
23+
24+
# %%
25+
# FixedThresholdClassifier: Setting the decision threshold of a binary classifier
26+
# -------------------------------------------------------------------------------
27+
# All binary classifiers of scikit-learn use a fixed decision threshold of 0.5 to
28+
# convert probability estimates (i.e. output of `predict_proba`) into class
29+
# predictions. However, 0.5 is almost never the desired threshold for a given problem.
30+
# :class:`~model_selection.FixedThresholdClassifier` allows to wrap any binary
31+
# classifier and set a custom decision threshold.
32+
from sklearn.datasets import make_classification
33+
from sklearn.linear_model import LogisticRegression
34+
from sklearn.metrics import confusion_matrix
35+
36+
X, y = make_classification(n_samples=1_000, weights=[0.9, 0.1], random_state=0)
37+
classifier = LogisticRegression(random_state=0).fit(X, y)
38+
39+
print("confusion matrix:\n", confusion_matrix(y, classifier.predict(X)))
40+
41+
# %%
42+
# Lowering the threshold, i.e. allowing more samples to be classified as the positive
43+
# class, increases the number of true positives at the cost of more false positives
44+
# (as is well known from the concavity of the ROC curve).
45+
from sklearn.model_selection import FixedThresholdClassifier
46+
47+
wrapped_classifier = FixedThresholdClassifier(classifier, threshold=0.1).fit(X, y)
48+
49+
print("confusion matrix:\n", confusion_matrix(y, wrapped_classifier.predict(X)))
50+
51+
# %%
52+
# TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier
53+
# --------------------------------------------------------------------------------
54+
# The decision threshold of a binary classifier can be tuned to optimize a given
55+
# metric, using :class:`~model_selection.TunedThresholdClassifierCV`.
56+
from sklearn.metrics import balanced_accuracy_score
57+
58+
# Due to the class imbalance, the balanced accuracy is not optimal for the default
59+
# threshold. The classifier tends to over predict the majority class.
60+
print(f"balanced accuracy: {balanced_accuracy_score(y, classifier.predict(X)):.2f}")
61+
62+
# %%
63+
# Tuning the threshold to optimize the balanced accuracy gives a smaller threshold
64+
# that allows more samples to be classified as the positive class.
65+
from sklearn.model_selection import TunedThresholdClassifierCV
66+
67+
tuned_classifier = TunedThresholdClassifierCV(
68+
classifier, cv=5, scoring="balanced_accuracy"
69+
).fit(X, y)
70+
71+
print(f"new threshold: {tuned_classifier.best_threshold_:.4f}")
72+
print(
73+
f"balanced accuracy: {balanced_accuracy_score(y, tuned_classifier.predict(X)):.2f}"
74+
)
75+
76+
# %%
77+
# :class:`~model_selection.TunedThresholdClassifierCV` also benefits from the
78+
# metadata routing support (:ref:`Metadata Routing User Guide<metadata_routing>`)
79+
# allowing to optimze complex business metrics, detailed
80+
# in :ref:`Post-tuning the decision threshold for cost-sensitive learning
81+
# <sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py>`.
82+
83+
# %%
84+
# Performance improvements in PCA
85+
# -------------------------------
86+
# :class:`~decomposition.PCA` has a new solver, "covariance_eigh", which is faster
87+
# and more memory efficient than the other solvers for datasets with a large number
88+
# of samples and a small number of features.
89+
from sklearn.datasets import make_low_rank_matrix
90+
from sklearn.decomposition import PCA
91+
92+
X = make_low_rank_matrix(
93+
n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0
94+
)
95+
96+
pca = PCA(n_components=10).fit(X)
97+
98+
print(f"explained variance: {pca.explained_variance_ratio_.sum():.2f}")
99+
100+
# %%
101+
# The "full" solver has also been improved to use less memory and allows to
102+
# transform faster. The "auto" option for the solver takes advantage of the
103+
# new solver and is now able to select an appropriate solver for sparse
104+
# datasets.
105+
from scipy.sparse import random
106+
107+
X = random(10000, 100, format="csr", random_state=0)
108+
109+
pca = PCA(n_components=10, svd_solver="auto").fit(X)
110+
111+
# %%
112+
# ColumnTransformer is subscriptable
113+
# ----------------------------------
114+
# The transformers of a :class:`~compose.ColumnTransformer` can now be directly
115+
# accessed using indexing by name.
116+
import numpy as np
117+
from sklearn.compose import ColumnTransformer
118+
from sklearn.preprocessing import StandardScaler, OneHotEncoder
119+
120+
X = np.array([[0, 1, 2], [3, 4, 5]])
121+
column_transformer = ColumnTransformer(
122+
[("std_scaler", StandardScaler(), [0]), ("one_hot", OneHotEncoder(), [1, 2])]
123+
)
124+
125+
column_transformer.fit(X)
126+
127+
print(column_transformer["std_scaler"])
128+
print(column_transformer["one_hot"])
129+
130+
# %%
131+
# Custom imputation strategies for the SimpleImputer
132+
# --------------------------------------------------
133+
# :class:`~impute.SimpleImputer` now supports custom strategies for imputation,
134+
# using a callable that computes a scalar value from the non missing values of
135+
# a column vector.
136+
from sklearn.impute import SimpleImputer
137+
138+
X = np.array(
139+
[
140+
[-1.1, 1.1, 1.1],
141+
[3.9, -1.2, np.nan],
142+
[np.nan, 1.3, np.nan],
143+
[-0.1, -1.4, -1.4],
144+
[-4.9, 1.5, -1.5],
145+
[np.nan, 1.6, 1.6],
146+
]
147+
)
148+
149+
150+
def smallest_abs(arr):
151+
"""Return the smallest absolute value of a 1D array."""
152+
return np.min(np.abs(arr))
153+
154+
155+
imputer = SimpleImputer(strategy=smallest_abs)
156+
157+
imputer.fit_transform(X)
158+
159+
# %%
160+
# Pairwise distances with non-numeric arrays
161+
# ------------------------------------------
162+
# :func:`~metrics.pairwise_distances` can now compute distances between
163+
# non-numeric arrays using a callable metric.
164+
from sklearn.metrics import pairwise_distances
165+
166+
X = ["cat", "dog"]
167+
Y = ["cat", "fox"]
168+
169+
170+
def levenshtein_distance(x, y):
171+
"""Return the Levenshtein distance between two strings."""
172+
if x == "" or y == "":
173+
return max(len(x), len(y))
174+
if x[0] == y[0]:
175+
return levenshtein_distance(x[1:], y[1:])
176+
return 1 + min(
177+
levenshtein_distance(x[1:], y),
178+
levenshtein_distance(x, y[1:]),
179+
levenshtein_distance(x[1:], y[1:]),
180+
)
181+
182+
183+
pairwise_distances(X, Y, metric=levenshtein_distance)

0 commit comments

Comments
 (0)