|
23 | 23 | "Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n",
|
24 | 24 | "\n",
|
25 | 25 | "```\n",
|
26 |
| - "{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method=\"zero\") + bs(BonusMalus, 3) + 1\n", |
| 26 | + "{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method=\"zero\") + bs(BonusMalus, 3)\n", |
27 | 27 | "```\n",
|
28 | 28 | "\n",
|
29 | 29 | "Despite its brevity, it describes all of the following:\n",
|
|
32 | 32 | " - If there are missing values in `DrivAge`, they should be treated as a separate category.\n",
|
33 | 33 | " - On the other hand, missing values in `VehPower` should be treated as all-zero indicators.\n",
|
34 | 34 | " - The predictors should also include a third degree B-spline interpolation of `BonusMalus`.\n",
|
35 |
| - " - The model should include an intercept.\n", |
36 | 35 | "\n",
|
37 | 36 | "The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface."
|
38 | 37 | ]
|
|
59 | 58 | "import matplotlib.pyplot as plt\n",
|
60 | 59 | "import numpy as np\n",
|
61 | 60 | "import pandas as pd\n",
|
| 61 | + "import pytest\n", |
62 | 62 | "import scipy.optimize as optimize\n",
|
63 | 63 | "import scipy.stats\n",
|
64 | 64 | "from dask_ml.preprocessing import Categorizer\n",
|
|
1261 | 1261 | "source": [
|
1262 | 1262 | "### Intercept Term\n",
|
1263 | 1263 | "\n",
|
1264 |
| - "Just like in the case of the non-formula interface, an intercept term is added by default. This can be disabled by either setting the `fit_intercept` parameter to `False`, or adding `+0` or `-1` to the end of the formula. In the case of conflict, a warning is emitted, and the latter takes precedence." |
| 1264 | + "Just like in the case of the non-formula interface, the presence of an intercept is determined by the `fit_intercept` argument. In case that the formula specifies a different behavior (e.g., adding `+0` or `-1` while `fit_intercept=True`), an error will be raised." |
1265 | 1265 | ]
|
1266 | 1266 | },
|
1267 | 1267 | {
|
1268 | 1268 | "cell_type": "code",
|
1269 |
| - "execution_count": 12, |
| 1269 | + "execution_count": null, |
1270 | 1270 | "metadata": {},
|
1271 |
| - "outputs": [ |
1272 |
| - { |
1273 |
| - "name": "stderr", |
1274 |
| - "output_type": "stream", |
1275 |
| - "text": [ |
1276 |
| - "/Users/stanmart/work/glum/src/glum/_glm.py:2354: UserWarning: The formula explicitly sets the intercept to False, overriding fit_intercept=True.\n", |
1277 |
| - " warnings.warn(\n" |
1278 |
| - ] |
1279 |
| - }, |
1280 |
| - { |
1281 |
| - "data": { |
1282 |
| - "text/html": [ |
1283 |
| - "<div>\n", |
1284 |
| - "<style scoped>\n", |
1285 |
| - " .dataframe tbody tr th:only-of-type {\n", |
1286 |
| - " vertical-align: middle;\n", |
1287 |
| - " }\n", |
1288 |
| - "\n", |
1289 |
| - " .dataframe tbody tr th {\n", |
1290 |
| - " vertical-align: top;\n", |
1291 |
| - " }\n", |
1292 |
| - "\n", |
1293 |
| - " .dataframe thead th {\n", |
1294 |
| - " text-align: right;\n", |
1295 |
| - " }\n", |
1296 |
| - "</style>\n", |
1297 |
| - "<table border=\"1\" class=\"dataframe\">\n", |
1298 |
| - " <thead>\n", |
1299 |
| - " <tr style=\"text-align: right;\">\n", |
1300 |
| - " <th></th>\n", |
1301 |
| - " <th>intercept</th>\n", |
1302 |
| - " <th>DrivAge__0</th>\n", |
1303 |
| - " <th>DrivAge__1</th>\n", |
1304 |
| - " <th>DrivAge__2</th>\n", |
1305 |
| - " <th>DrivAge__3</th>\n", |
1306 |
| - " <th>DrivAge__4</th>\n", |
1307 |
| - " <th>DrivAge__5</th>\n", |
1308 |
| - " <th>DrivAge__6</th>\n", |
1309 |
| - " <th>VehPower__4</th>\n", |
1310 |
| - " <th>VehPower__5</th>\n", |
1311 |
| - " <th>...</th>\n", |
1312 |
| - " <th>DrivAge__4__x__VehPower__8</th>\n", |
1313 |
| - " <th>DrivAge__5__x__VehPower__8</th>\n", |
1314 |
| - " <th>DrivAge__6__x__VehPower__8</th>\n", |
1315 |
| - " <th>DrivAge__0__x__VehPower__9</th>\n", |
1316 |
| - " <th>DrivAge__1__x__VehPower__9</th>\n", |
1317 |
| - " <th>DrivAge__2__x__VehPower__9</th>\n", |
1318 |
| - " <th>DrivAge__3__x__VehPower__9</th>\n", |
1319 |
| - " <th>DrivAge__4__x__VehPower__9</th>\n", |
1320 |
| - " <th>DrivAge__5__x__VehPower__9</th>\n", |
1321 |
| - " <th>DrivAge__6__x__VehPower__9</th>\n", |
1322 |
| - " </tr>\n", |
1323 |
| - " </thead>\n", |
1324 |
| - " <tbody>\n", |
1325 |
| - " <tr>\n", |
1326 |
| - " <th>coefficient</th>\n", |
1327 |
| - " <td>0.0</td>\n", |
1328 |
| - " <td>1.713298</td>\n", |
1329 |
| - " <td>0.783505</td>\n", |
1330 |
| - " <td>0.205914</td>\n", |
1331 |
| - " <td>0.016085</td>\n", |
1332 |
| - " <td>0.0</td>\n", |
1333 |
| - " <td>0.000094</td>\n", |
1334 |
| - " <td>0.223685</td>\n", |
1335 |
| - " <td>4.66123</td>\n", |
1336 |
| - " <td>4.736272</td>\n", |
1337 |
| - " <td>...</td>\n", |
1338 |
| - " <td>-0.144927</td>\n", |
1339 |
| - " <td>0.001657</td>\n", |
1340 |
| - " <td>0.515373</td>\n", |
1341 |
| - " <td>0.714834</td>\n", |
1342 |
| - " <td>-0.325666</td>\n", |
1343 |
| - " <td>-0.370935</td>\n", |
1344 |
| - " <td>0.20417</td>\n", |
1345 |
| - " <td>0.013222</td>\n", |
1346 |
| - " <td>-0.273913</td>\n", |
1347 |
| - " <td>0.115693</td>\n", |
1348 |
| - " </tr>\n", |
1349 |
| - " </tbody>\n", |
1350 |
| - "</table>\n", |
1351 |
| - "<p>1 rows × 56 columns</p>\n", |
1352 |
| - "</div>" |
1353 |
| - ], |
1354 |
| - "text/plain": [ |
1355 |
| - " intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 \\\n", |
1356 |
| - "coefficient 0.0 1.713298 0.783505 0.205914 0.016085 \n", |
1357 |
| - "\n", |
1358 |
| - " DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 \\\n", |
1359 |
| - "coefficient 0.0 0.000094 0.223685 4.66123 4.736272 \n", |
1360 |
| - "\n", |
1361 |
| - " ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 \\\n", |
1362 |
| - "coefficient ... -0.144927 0.001657 \n", |
1363 |
| - "\n", |
1364 |
| - " DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 \\\n", |
1365 |
| - "coefficient 0.515373 0.714834 \n", |
1366 |
| - "\n", |
1367 |
| - " DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 \\\n", |
1368 |
| - "coefficient -0.325666 -0.370935 \n", |
1369 |
| - "\n", |
1370 |
| - " DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 \\\n", |
1371 |
| - "coefficient 0.20417 0.013222 \n", |
1372 |
| - "\n", |
1373 |
| - " DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9 \n", |
1374 |
| - "coefficient -0.273913 0.115693 \n", |
1375 |
| - "\n", |
1376 |
| - "[1 rows x 56 columns]" |
1377 |
| - ] |
1378 |
| - }, |
1379 |
| - "execution_count": 12, |
1380 |
| - "metadata": {}, |
1381 |
| - "output_type": "execute_result" |
1382 |
| - } |
1383 |
| - ], |
| 1271 | + "outputs": [], |
1384 | 1272 | "source": [
|
1385 | 1273 | "formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n",
|
1386 | 1274 | "\n",
|
1387 |
| - "t_glm6 = GeneralizedLinearRegressor(\n", |
1388 |
| - " family=TweedieDist,\n", |
1389 |
| - " alpha_search=True,\n", |
1390 |
| - " l1_ratio=1,\n", |
1391 |
| - " fit_intercept=True,\n", |
1392 |
| - " formula=formula_noint,\n", |
1393 |
| - " interaction_separator=\"__x__\",\n", |
1394 |
| - " categorical_format=\"{name}__{category}\",\n", |
1395 |
| - ")\n", |
1396 |
| - "t_glm6.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", |
1397 |
| - "\n", |
1398 |
| - "pd.DataFrame(\n", |
1399 |
| - " {\"coefficient\": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},\n", |
1400 |
| - " index=[\"intercept\"] + t_glm6.feature_names_,\n", |
1401 |
| - ").T" |
| 1275 | + "with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n", |
| 1276 | + " t_glm6 = GeneralizedLinearRegressor(\n", |
| 1277 | + " family=TweedieDist,\n", |
| 1278 | + " alpha_search=True,\n", |
| 1279 | + " l1_ratio=1,\n", |
| 1280 | + " fit_intercept=True,\n", |
| 1281 | + " formula=formula_noint,\n", |
| 1282 | + " interaction_separator=\"__x__\",\n", |
| 1283 | + " categorical_format=\"{name}__{category}\",\n", |
| 1284 | + " )" |
1402 | 1285 | ]
|
1403 | 1286 | },
|
1404 | 1287 | {
|
|
0 commit comments