-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Labels
Description
I tried switching colsample_bytree
between 2 values randomly for each iteration using the reset_parameter
callback. It worked perfectly using the native API but seemed to do nothing with the scikit-learn interface.
Code to reproduce:
import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
X, y = make_classification(
10000, 8, n_classes=7, n_informative=8,
n_redundant=0,
random_state=42
)
params = {
'objective': 'multiclass',
'num_class': 7,
'colsample_bytree': 4/8,
'random_state': 0,
'verbose': -1
}
rng = np.random.default_rng(0)
booster = lgb.train(
params, lgb.Dataset(X, y), num_boost_round=100,
callbacks=[
lgb.reset_parameter(
colsample_bytree=lambda _: rng.choice([3/8, 4/8])
)
],
)
trees = booster.trees_to_dataframe()
counts = trees.groupby('tree_index')['split_feature'].nunique().value_counts().to_dict()
print(F'Native: {counts}')
rng = np.random.default_rng(0)
model = LGBMClassifier(
**params, n_estimators=100,
callbacks=[
lgb.reset_parameter(
colsample_bytree=lambda _: rng.choice([3/8, 4/8])
)
]
).fit(X, y)
trees = model.booster_.trees_to_dataframe()
counts = trees.groupby('tree_index')['split_feature'].nunique().value_counts().to_dict()
print(F'sklearn: {counts}')
print(F'{lgb.__version__=}')
Expected output:
Native: {4: 392, 3: 308}
sklearn: {4: 700}
lgb.__version__='4.6.0'
The native API produced an ensemble of mixture of trees using 3 features or 4 features. Scikit-learn API gave an ensemble of trees always using 4 features, so colsample_bytree
appeared to be always the initial value 4/8.