Skip to content

Commit b855758

Browse files
Merge branch 'glum-v3' into convert-nas-unseen
2 parents ee05d5d + 9cd1d8e commit b855758

File tree

7 files changed

+94
-180
lines changed

7 files changed

+94
-180
lines changed

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Why did we choose the name `glum`? We wanted a name that had the letters GLM and
6868
>>>
6969
>>> _ = model.fit(X=X, y=y)
7070
>>>
71-
>>> # .report_diagnostics shows details about the steps taken by the iterative solver
71+
>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
7272
>>> diags = model.get_formatted_diagnostics(full_report=True)
7373
>>> diags[['objective_fct']]
7474
objective_fct
@@ -79,6 +79,15 @@ n_iter
7979
3 0.443681
8080
4 0.443498
8181
5 0.443497
82+
>>>
83+
>>> # Models can also be built with formulas from formulaic.
84+
>>> model_formula = GeneralizedLinearRegressor(
85+
... family='binomial',
86+
... l1_ratio=1.0,
87+
... alpha=0.001,
88+
... formula="bedrooms + np.log(bathrooms + 1) + bs(sqft_living, 3) + C(waterfront)"
89+
... )
90+
>>> _ = model_formula.fit(X=house_data.data, y=y)
8291

8392
```
8493

conda.recipe/meta.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ requirements:
3535
- pandas
3636
- scikit-learn >=0.23
3737
- scipy
38-
- formulaic >=0.4
38+
- formulaic >=0.6
3939
- tabmat >=4.0.0a
4040

4141
test:

docs/tutorials/formula_interface/formula_interface.ipynb

Lines changed: 15 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n",
2424
"\n",
2525
"```\n",
26-
"{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method=\"zero\") + bs(BonusMalus, 3) + 1\n",
26+
"{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method=\"zero\") + bs(BonusMalus, 3)\n",
2727
"```\n",
2828
"\n",
2929
"Despite its brevity, it describes all of the following:\n",
@@ -32,7 +32,6 @@
3232
" - If there are missing values in `DrivAge`, they should be treated as a separate category.\n",
3333
" - On the other hand, missing values in `VehPower` should be treated as all-zero indicators.\n",
3434
" - The predictors should also include a third degree B-spline interpolation of `BonusMalus`.\n",
35-
" - The model should include an intercept.\n",
3635
"\n",
3736
"The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface."
3837
]
@@ -59,6 +58,7 @@
5958
"import matplotlib.pyplot as plt\n",
6059
"import numpy as np\n",
6160
"import pandas as pd\n",
61+
"import pytest\n",
6262
"import scipy.optimize as optimize\n",
6363
"import scipy.stats\n",
6464
"from dask_ml.preprocessing import Categorizer\n",
@@ -1261,144 +1261,27 @@
12611261
"source": [
12621262
"### Intercept Term\n",
12631263
"\n",
1264-
"Just like in the case of the non-formula interface, an intercept term is added by default. This can be disabled by either setting the `fit_intercept` parameter to `False`, or adding `+0` or `-1` to the end of the formula. In the case of conflict, a warning is emitted, and the latter takes precedence."
1264+
"Just like in the case of the non-formula interface, the presence of an intercept is determined by the `fit_intercept` argument. In case that the formula specifies a different behavior (e.g., adding `+0` or `-1` while `fit_intercept=True`), an error will be raised."
12651265
]
12661266
},
12671267
{
12681268
"cell_type": "code",
1269-
"execution_count": 12,
1269+
"execution_count": null,
12701270
"metadata": {},
1271-
"outputs": [
1272-
{
1273-
"name": "stderr",
1274-
"output_type": "stream",
1275-
"text": [
1276-
"/Users/stanmart/work/glum/src/glum/_glm.py:2354: UserWarning: The formula explicitly sets the intercept to False, overriding fit_intercept=True.\n",
1277-
" warnings.warn(\n"
1278-
]
1279-
},
1280-
{
1281-
"data": {
1282-
"text/html": [
1283-
"<div>\n",
1284-
"<style scoped>\n",
1285-
" .dataframe tbody tr th:only-of-type {\n",
1286-
" vertical-align: middle;\n",
1287-
" }\n",
1288-
"\n",
1289-
" .dataframe tbody tr th {\n",
1290-
" vertical-align: top;\n",
1291-
" }\n",
1292-
"\n",
1293-
" .dataframe thead th {\n",
1294-
" text-align: right;\n",
1295-
" }\n",
1296-
"</style>\n",
1297-
"<table border=\"1\" class=\"dataframe\">\n",
1298-
" <thead>\n",
1299-
" <tr style=\"text-align: right;\">\n",
1300-
" <th></th>\n",
1301-
" <th>intercept</th>\n",
1302-
" <th>DrivAge__0</th>\n",
1303-
" <th>DrivAge__1</th>\n",
1304-
" <th>DrivAge__2</th>\n",
1305-
" <th>DrivAge__3</th>\n",
1306-
" <th>DrivAge__4</th>\n",
1307-
" <th>DrivAge__5</th>\n",
1308-
" <th>DrivAge__6</th>\n",
1309-
" <th>VehPower__4</th>\n",
1310-
" <th>VehPower__5</th>\n",
1311-
" <th>...</th>\n",
1312-
" <th>DrivAge__4__x__VehPower__8</th>\n",
1313-
" <th>DrivAge__5__x__VehPower__8</th>\n",
1314-
" <th>DrivAge__6__x__VehPower__8</th>\n",
1315-
" <th>DrivAge__0__x__VehPower__9</th>\n",
1316-
" <th>DrivAge__1__x__VehPower__9</th>\n",
1317-
" <th>DrivAge__2__x__VehPower__9</th>\n",
1318-
" <th>DrivAge__3__x__VehPower__9</th>\n",
1319-
" <th>DrivAge__4__x__VehPower__9</th>\n",
1320-
" <th>DrivAge__5__x__VehPower__9</th>\n",
1321-
" <th>DrivAge__6__x__VehPower__9</th>\n",
1322-
" </tr>\n",
1323-
" </thead>\n",
1324-
" <tbody>\n",
1325-
" <tr>\n",
1326-
" <th>coefficient</th>\n",
1327-
" <td>0.0</td>\n",
1328-
" <td>1.713298</td>\n",
1329-
" <td>0.783505</td>\n",
1330-
" <td>0.205914</td>\n",
1331-
" <td>0.016085</td>\n",
1332-
" <td>0.0</td>\n",
1333-
" <td>0.000094</td>\n",
1334-
" <td>0.223685</td>\n",
1335-
" <td>4.66123</td>\n",
1336-
" <td>4.736272</td>\n",
1337-
" <td>...</td>\n",
1338-
" <td>-0.144927</td>\n",
1339-
" <td>0.001657</td>\n",
1340-
" <td>0.515373</td>\n",
1341-
" <td>0.714834</td>\n",
1342-
" <td>-0.325666</td>\n",
1343-
" <td>-0.370935</td>\n",
1344-
" <td>0.20417</td>\n",
1345-
" <td>0.013222</td>\n",
1346-
" <td>-0.273913</td>\n",
1347-
" <td>0.115693</td>\n",
1348-
" </tr>\n",
1349-
" </tbody>\n",
1350-
"</table>\n",
1351-
"<p>1 rows × 56 columns</p>\n",
1352-
"</div>"
1353-
],
1354-
"text/plain": [
1355-
" intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 \\\n",
1356-
"coefficient 0.0 1.713298 0.783505 0.205914 0.016085 \n",
1357-
"\n",
1358-
" DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 \\\n",
1359-
"coefficient 0.0 0.000094 0.223685 4.66123 4.736272 \n",
1360-
"\n",
1361-
" ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 \\\n",
1362-
"coefficient ... -0.144927 0.001657 \n",
1363-
"\n",
1364-
" DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 \\\n",
1365-
"coefficient 0.515373 0.714834 \n",
1366-
"\n",
1367-
" DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 \\\n",
1368-
"coefficient -0.325666 -0.370935 \n",
1369-
"\n",
1370-
" DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 \\\n",
1371-
"coefficient 0.20417 0.013222 \n",
1372-
"\n",
1373-
" DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9 \n",
1374-
"coefficient -0.273913 0.115693 \n",
1375-
"\n",
1376-
"[1 rows x 56 columns]"
1377-
]
1378-
},
1379-
"execution_count": 12,
1380-
"metadata": {},
1381-
"output_type": "execute_result"
1382-
}
1383-
],
1271+
"outputs": [],
13841272
"source": [
13851273
"formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n",
13861274
"\n",
1387-
"t_glm6 = GeneralizedLinearRegressor(\n",
1388-
" family=TweedieDist,\n",
1389-
" alpha_search=True,\n",
1390-
" l1_ratio=1,\n",
1391-
" fit_intercept=True,\n",
1392-
" formula=formula_noint,\n",
1393-
" interaction_separator=\"__x__\",\n",
1394-
" categorical_format=\"{name}__{category}\",\n",
1395-
")\n",
1396-
"t_glm6.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n",
1397-
"\n",
1398-
"pd.DataFrame(\n",
1399-
" {\"coefficient\": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},\n",
1400-
" index=[\"intercept\"] + t_glm6.feature_names_,\n",
1401-
").T"
1275+
"with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n",
1276+
" t_glm6 = GeneralizedLinearRegressor(\n",
1277+
" family=TweedieDist,\n",
1278+
" alpha_search=True,\n",
1279+
" l1_ratio=1,\n",
1280+
" fit_intercept=True,\n",
1281+
" formula=formula_noint,\n",
1282+
" interaction_separator=\"__x__\",\n",
1283+
" categorical_format=\"{name}__{category}\",\n",
1284+
" )"
14021285
]
14031286
},
14041287
{

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@
8686
"pandas",
8787
"scikit-learn>=0.23",
8888
"scipy",
89-
"formulaic>=0.4",
89+
"formulaic>=0.6",
9090
"tabmat>=4.0.0a",
9191
],
9292
entry_points=None

src/glum/_distribution.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1355,7 +1355,7 @@ def guess_intercept(
13551355
second = np.log((mu ** (2 - p)).dot(sample_weight))
13561356
return first - second
13571357
elif isinstance(link, LogitLink):
1358-
log_odds = np.log(avg_y) - np.log(np.average(1 - y, weights=sample_weight))
1358+
log_odds = np.log(avg_y) - np.log(1 - avg_y)
13591359
if eta is None:
13601360
return log_odds
13611361
avg_eta = eta if np.isscalar(eta) else np.average(eta, weights=sample_weight)

src/glum/_glm.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -244,8 +244,7 @@ def _parse_formula(
244244
formula : FormulaSpec
245245
The formula to parse.
246246
include_intercept: bool, default True
247-
Whether to include an intercept column if the formula does not
248-
include (``+ 1``) or exclude (``+ 0`` or ``- 1``) it explicitly.
247+
Whether to include an intercept column.
249248
250249
Returns
251250
-------
@@ -2683,11 +2682,11 @@ def _set_up_and_check_fit_args(
26832682

26842683
intercept = "1" in X.model_spec.terms
26852684
if intercept != self.fit_intercept:
2686-
warnings.warn(
2687-
f"The formula explicitly sets the intercept to {intercept}, "
2688-
f"overriding fit_intercept={self.fit_intercept}."
2685+
raise ValueError(
2686+
f"The formula sets the intercept to {intercept}, "
2687+
f"contradicting fit_intercept={self.fit_intercept}. "
2688+
"You should use fit_intercept to specify the intercept."
26892689
)
2690-
self.fit_intercept = intercept
26912690

26922691
self.X_model_spec_ = X.model_spec
26932692

@@ -3114,6 +3113,7 @@ class GeneralizedLinearRegressor(GeneralizedLinearRegressorBase):
31143113
expected_information : bool, optional (default = False)
31153114
If true, then the expected information matrix is computed by default.
31163115
Only relevant when computing robust standard errors.
3116+
31173117
formula : FormulaSpec
31183118
A formula accepted by formulaic. It can either be a one-sided formula, in
31193119
which case ``y`` must be specified in ``fit``, or a two-sided formula, in
@@ -3140,6 +3140,7 @@ class GeneralizedLinearRegressor(GeneralizedLinearRegressorBase):
31403140
- if 'zero', missing values will represent all-zero indicator columns.
31413141
- if 'convert', missing values will be converted to the ``cat_missing_name``
31423142
category.
3143+
31433144
cat_missing_name: str, default='(MISSING)'
31443145
Name of the category to which missing values will be converted if
31453146
``cat_missing_method='convert'``. Only used if ``X`` is a pandas data frame.

0 commit comments

Comments
 (0)