Skip to content

[c++] forcedsplits_filename pointing at a non-existent file is silently ignored #6830

@jameslamb

Description

@jameslamb

Description

If you pass a non-existent file via parameter forcedsplits_filename, lightgbm appears to silently ignore it.

It should raise an informative if reading that file fails, or at least log a warning.

Reproducible Example

Using lightgbm==4.6.0 installed from PyPI.

import json
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(
    n_samples=10_000,
    n_features=5,
    n_informative=5,
    random_state=42
)

# add a noise feature
noise_feature = np.random.random(size=(X.shape[0], 1))
X = np.concatenate((X, noise_feature), axis=1)

# force the use of that noise feature in every tree
forced_split = {
    "feature": 5,
    "threshold": np.mean(noise_feature),
}
with open("forced_splits.json", "w") as f:
    f.write(json.dumps(forced_split))

# train another model, forcing it to use those splits
model = lgb.LGBMRegressor(
    random_state=708,
    n_estimators=10,
    verbose=1,
    forcedsplits_filename="forced_splits.json",
)
model.fit(X, y)

# noise feature was used exactly once in every tree
# (because we forced LightGBM to use it)
model.feature_importances_
# array([  0, 109, 132,   0,  49,  10], dtype=int32)

# passing a non-existent file... no warning, no error
model2 = lgb.LGBMRegressor(
    random_state=708,
    n_estimators=10,
    verbose=1,
    forcedsplits_filename="does-not-exist.json",
)
model2.fit(X, y)

Logs from that second .fit():

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000568 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1530
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 6
[LightGBM] [Info] Start training from score -0.889445
LGBMRegressor(forcedsplits_filename='does-not-exist.json', n_estimators=10,
              random_state=708, verbose=1)

Notes

Noticed this while working on https://stackoverflow.com/a/79435055/3986677.

I strongly suspect it is not specific to the Python package, and that changes need to be made in the C++ code.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions