{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formula Interface Tutorial: Revisiting French Motor Third-Party Liability Claims\n", "\n", "\n", "**Intro**\n", "\n", "This tutorial showcases the formula interface of `glum`. It allows for the specification of the design matrix and the response variable using so-called [Wilkinson-formulas](https://www.jstor.org/stable/2346786) instead of constructing it by hand. This kind of model specification should be familiar to R users or those who have used the `statsmodels` or `linearmodels` Python packages before. This tutorial aims to introduce the basics of working with formulas to other users, as well as highlighting some important differences between `glum`s and other packages' formula implementations.\n", "\n", "For a more in-depth look at how formulas work, please take a look at the [documentation of `formulaic`](https://matthewwardrop.github.io/formulaic/), the package on which `glum`'s formula interface is based.\n", "\n", "\n", "**Background**\n", "\n", "This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.\n", "\n", "**Sneak Peek**\n", "\n", "Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n", "\n", "```\n", "{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method=\"zero\") + bs(BonusMalus, 3)\n", "```\n", "\n", "Despite its brevity, it describes all of the following:\n", " - The outcome variable is the ratio of `ClaimAmountCut` and `Exposure`.\n", " - The predictors should include the interactions of the categorical variables `DrivAge` and `VehPower`, as well as those two variables themselves. (Even though they behave as such, neither the individual variables nor their interaction will be dummy-encoded by glum. For categoricals with many levels, this can lead to a substantial performance improvement over dummy encoding, especially for the interaction.)\n", " - If there are missing values in `DrivAge`, they should be treated as a separate category.\n", " - On the other hand, missing values in `VehPower` should be treated as all-zero indicators.\n", " - The predictors should also include a third degree B-spline interpolation of `BonusMalus`.\n", "\n", "The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": "## Table of Contents\n* [1. Load and Prepare Datasets from Openml](#1.-Load-and-Prepare-Datasets-from-Openml)\n* [2. Reproducing the model from the GLM Tutorial](#2.-Reproducing-the-model-from-the-GLM-Tutorial)\n* [3. Categorical Variables](#3.-Categorical-Variables)\n* [4. Interactions and Structural Full-rankness](#4.-Interactions-and-Structural-Full-rankness)\n* [5. Fun with Functions](#5.-Fun-with-Functions)\n* [6. Monotonic Constraints](#6.-Monotonic-Constraints)\n* [7. Miscellaneous Features](#7.-Miscellaneous-Features)" }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:11.302339Z", "iopub.status.busy": "2026-03-16T16:10:11.302010Z", "iopub.status.idle": "2026-03-16T16:10:12.926504Z", "shell.execute_reply": "2026-03-16T16:10:12.926009Z" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import scipy.optimize as optimize\n", "import scipy.stats\n", "from dask_ml.preprocessing import Categorizer\n", "from sklearn.metrics import mean_absolute_error\n", "from sklearn.model_selection import ShuffleSplit\n", "from glum import GeneralizedLinearRegressor\n", "from glum import TweedieDistribution\n", "\n", "from load_transform_formula import load_transform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load and Prepare Datasets from Openml\n", "[back to table of contents](#Table-of-Contents)\n", "\n", "First, we load in our [dataset from openML](\"https://www.openml.org/d/41214\") and apply several transformations. In the interest of simplicity, we do not include the data loading and preparation code in this notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:12.928178Z", "iopub.status.busy": "2026-03-16T16:10:12.927994Z", "iopub.status.idle": "2026-03-16T16:10:15.244947Z", "shell.execute_reply": "2026-03-16T16:10:15.244591Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClaimNbExposureAreaVehPowerVehAgeDrivAgeBonusMalusVehBrandVehGasDensityRegionClaimAmountClaimAmountCut
IDpol
100.10000D50550B12Regular1217R820.00.0
300.77000D50550B12Regular1217R820.00.0
500.75000B61550B12Diesel54R220.00.0
1000.09000B70450B12Diesel76R720.00.0
1100.84000B70450B12Diesel76R720.00.0
..........................................
611432600.00274E40550B12Regular3317R930.00.0
611432700.00274E40495B12Regular9850R110.00.0
611432800.00274D61450B12Diesel1323R820.00.0
611432900.00274B40550B12Regular95R260.00.0
611433000.00274B71254B12Diesel65R720.00.0
\n", "

678013 rows × 13 columns

\n", "
" ], "text/plain": [ " ClaimNb Exposure Area VehPower VehAge DrivAge BonusMalus \\\n", "IDpol \n", "1 0 0.10000 D 5 0 5 50 \n", "3 0 0.77000 D 5 0 5 50 \n", "5 0 0.75000 B 6 1 5 50 \n", "10 0 0.09000 B 7 0 4 50 \n", "11 0 0.84000 B 7 0 4 50 \n", "... ... ... ... ... ... ... ... \n", "6114326 0 0.00274 E 4 0 5 50 \n", "6114327 0 0.00274 E 4 0 4 95 \n", "6114328 0 0.00274 D 6 1 4 50 \n", "6114329 0 0.00274 B 4 0 5 50 \n", "6114330 0 0.00274 B 7 1 2 54 \n", "\n", " VehBrand VehGas Density Region ClaimAmount ClaimAmountCut \n", "IDpol \n", "1 B12 Regular 1217 R82 0.0 0.0 \n", "3 B12 Regular 1217 R82 0.0 0.0 \n", "5 B12 Diesel 54 R22 0.0 0.0 \n", "10 B12 Diesel 76 R72 0.0 0.0 \n", "11 B12 Diesel 76 R72 0.0 0.0 \n", "... ... ... ... ... ... ... \n", "6114326 B12 Regular 3317 R93 0.0 0.0 \n", "6114327 B12 Regular 9850 R11 0.0 0.0 \n", "6114328 B12 Diesel 1323 R82 0.0 0.0 \n", "6114329 B12 Regular 95 R26 0.0 0.0 \n", "6114330 B12 Diesel 65 R72 0.0 0.0 \n", "\n", "[678013 rows x 13 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = load_transform()\n", "with pd.option_context('display.max_rows', 10):\n", " display(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Reproducing the Model From the GLM Tutorial\n", "\n", "Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.\n", "\n", "We still have some preprocessing to do:\n", " - Many of the ordinal or nominal variables are encoded as integers, instead of as categoricals. We will need to convert these so that `glum` will know to estimate a separate coefficient for each of their levels.\n", " - The outcome variable is a transformation of other columns. We need to create it first.\n", "\n", "As we will see later on, these steps can be incorporated into the formula itself, but let's not overcomplicate things at first." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:15.259806Z", "iopub.status.busy": "2026-03-16T16:10:15.259694Z", "iopub.status.idle": "2026-03-16T16:10:15.471707Z", "shell.execute_reply": "2026-03-16T16:10:15.470808Z" } }, "outputs": [], "source": [ "ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n", "train, test = next(ss.split(df))\n", "\n", "df = df.assign(PurePremium=lambda x: x[\"ClaimAmountCut\"] / x[\"Exposure\"])\n", "\n", "glm_categorizer = Categorizer(\n", " columns=[\"VehBrand\", \"VehGas\", \"Region\", \"Area\", \"DrivAge\", \"VehAge\", \"VehPower\"]\n", ")\n", "df_train = glm_categorizer.fit_transform(df.iloc[train])\n", "df_test = glm_categorizer.transform(df.iloc[test])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different predictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:15.473697Z", "iopub.status.busy": "2026-03-16T16:10:15.473584Z", "iopub.status.idle": "2026-03-16T16:10:22.769294Z", "shell.execute_reply": "2026-03-16T16:10:22.768802Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptVehBrand[B1]VehBrand[B10]VehBrand[B11]VehBrand[B12]VehBrand[B13]VehBrand[B14]VehBrand[B2]VehBrand[B3]VehBrand[B4]...VehAge[1]VehAge[2]VehPower[4]VehPower[5]VehPower[6]VehPower[7]VehPower[8]VehPower[9]BonusMalusDensity
coefficient2.88667-0.0641570.00.231868-0.2110610.054979-0.270346-0.0714530.002910.059324...0.008117-0.229906-0.111796-0.1233880.0607570.005179-0.0218320.2081580.0325080.000002
\n", "

1 rows × 60 columns

\n", "
" ], "text/plain": [ " intercept VehBrand[B1] VehBrand[B10] VehBrand[B11] \\\n", "coefficient 2.88667 -0.064157 0.0 0.231868 \n", "\n", " VehBrand[B12] VehBrand[B13] VehBrand[B14] VehBrand[B2] \\\n", "coefficient -0.211061 0.054979 -0.270346 -0.071453 \n", "\n", " VehBrand[B3] VehBrand[B4] ... VehAge[1] VehAge[2] \\\n", "coefficient 0.00291 0.059324 ... 0.008117 -0.229906 \n", "\n", " VehPower[4] VehPower[5] VehPower[6] VehPower[7] VehPower[8] \\\n", "coefficient -0.111796 -0.123388 0.060757 0.005179 -0.021832 \n", "\n", " VehPower[9] BonusMalus Density \n", "coefficient 0.208158 0.032508 0.000002 \n", "\n", "[1 rows x 60 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = \"PurePremium ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + BonusMalus + Density\"\n", "\n", "TweedieDist = TweedieDistribution(1.5)\n", "t_glm1 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula,\n", ")\n", "t_glm1.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm1.intercept_], t_glm1.coef_))},\n", " index=[\"intercept\"] + t_glm1.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Categorical Variables\n", "\n", "`glum` also provides extensive support for categorical variables. The main function one needs to be aware of in the context of categoricals is simply called `C()`. A variable placed within it is always converted to a categorical, regardless of its type.\n", "\n", "A huge part of tabmat's/glum's performance advantage is that categoricals need not be one-hot encoded, but are treated as if they were. For this reason, we do not support using other coding schemes within the formula interface. If one needs to use other categorical encodings than one-hot, they can always do so manually (or even using `formulaic` directly) before the estimation.\n", "\n", "Let's try it out on our dataset!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:22.770437Z", "iopub.status.busy": "2026-03-16T16:10:22.770347Z", "iopub.status.idle": "2026-03-16T16:10:22.819876Z", "shell.execute_reply": "2026-03-16T16:10:22.819390Z" } }, "outputs": [ { "data": { "text/plain": [ "ClaimNb int64\n", "Exposure float64\n", "Area str\n", "VehPower int64\n", "VehAge int64\n", "DrivAge int64\n", "BonusMalus int64\n", "VehBrand str\n", "VehGas str\n", "Density int64\n", "Region str\n", "ClaimAmount float64\n", "ClaimAmountCut float64\n", "PurePremium float64\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train_noncat = df.iloc[train]\n", "df_test_noncat = df.iloc[test]\n", "\n", "df_train_noncat.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:22.821622Z", "iopub.status.busy": "2026-03-16T16:10:22.821528Z", "iopub.status.idle": "2026-03-16T16:10:30.355393Z", "shell.execute_reply": "2026-03-16T16:10:30.355061Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptC(VehBrand)[B1]C(VehBrand)[B10]C(VehBrand)[B11]C(VehBrand)[B12]C(VehBrand)[B13]C(VehBrand)[B14]C(VehBrand)[B2]C(VehBrand)[B3]C(VehBrand)[B4]...C(VehAge)[1]C(VehAge)[2]C(VehPower)[4]C(VehPower)[5]C(VehPower)[6]C(VehPower)[7]C(VehPower)[8]C(VehPower)[9]BonusMalusDensity
coefficient2.88667-0.0641570.00.231868-0.2110610.054979-0.270346-0.0714530.002910.059324...0.008117-0.229906-0.111796-0.1233880.0607570.005179-0.0218320.2081580.0325080.000002
\n", "

1 rows × 60 columns

\n", "
" ], "text/plain": [ " intercept C(VehBrand)[B1] C(VehBrand)[B10] C(VehBrand)[B11] \\\n", "coefficient 2.88667 -0.064157 0.0 0.231868 \n", "\n", " C(VehBrand)[B12] C(VehBrand)[B13] C(VehBrand)[B14] \\\n", "coefficient -0.211061 0.054979 -0.270346 \n", "\n", " C(VehBrand)[B2] C(VehBrand)[B3] C(VehBrand)[B4] ... \\\n", "coefficient -0.071453 0.00291 0.059324 ... \n", "\n", " C(VehAge)[1] C(VehAge)[2] C(VehPower)[4] C(VehPower)[5] \\\n", "coefficient 0.008117 -0.229906 -0.111796 -0.123388 \n", "\n", " C(VehPower)[6] C(VehPower)[7] C(VehPower)[8] C(VehPower)[9] \\\n", "coefficient 0.060757 0.005179 -0.021832 0.208158 \n", "\n", " BonusMalus Density \n", "coefficient 0.032508 0.000002 \n", "\n", "[1 rows x 60 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_cat = (\n", " \"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area) \"\n", " \"+ C(DrivAge) + C(VehAge) + C(VehPower) + BonusMalus + Density\"\n", ")\n", "\n", "t_glm3 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_cat,\n", ")\n", "t_glm3.fit(df_train_noncat, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm3.intercept_], t_glm3.coef_))},\n", " index=[\"intercept\"] + t_glm3.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, prediction works as expected with categorical variables. `glum` keeps track of the levels present in the training dataset, and makes sure that categorical variables in unseen datasets are also properly aligned, even if they have missing or unknown levels.3 Therefore, one can simply use predict, and `glum` does The Right Thing™ by default.\n", "\n", "3: This is made possible due to `glum` saving a [`ModelSpec` object](https://matthewwardrop.github.io/formulaic/guides/model_specs/), which contains any information necessary for reapplying the transitions that were done during the formula materialization process. It is especially relevant in the case of [stateful transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/), such as creating categorical variables." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:30.356740Z", "iopub.status.busy": "2026-03-16T16:10:30.356657Z", "iopub.status.idle": "2026-03-16T16:10:30.401890Z", "shell.execute_reply": "2026-03-16T16:10:30.401512Z" } }, "outputs": [ { "data": { "text/plain": [ "array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,\n", " 67.98332028, 297.21717383], shape=(67802,))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_glm3.predict(df_test_noncat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Interactions and Structural Full-Rankness\n", "\n", "One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n", "\n", "Let's see how that looks like on the insurance example! Suppose that we expect `VehPower` to have a different effect depending on `DrivAge` (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:30.403259Z", "iopub.status.busy": "2026-03-16T16:10:30.403170Z", "iopub.status.idle": "2026-03-16T16:10:38.273339Z", "shell.execute_reply": "2026-03-16T16:10:38.272842Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptC(VehBrand)[B1]C(VehBrand)[B10]C(VehBrand)[B11]C(VehBrand)[B12]C(VehBrand)[B13]C(VehBrand)[B14]C(VehBrand)[B2]C(VehBrand)[B3]C(VehBrand)[B4]...C(DrivAge)[4]:C(VehPower)[8]C(DrivAge)[5]:C(VehPower)[8]C(DrivAge)[6]:C(VehPower)[8]C(DrivAge)[0]:C(VehPower)[9]C(DrivAge)[1]:C(VehPower)[9]C(DrivAge)[2]:C(VehPower)[9]C(DrivAge)[3]:C(VehPower)[9]C(DrivAge)[4]:C(VehPower)[9]C(DrivAge)[5]:C(VehPower)[9]C(DrivAge)[6]:C(VehPower)[9]
coefficient2.88023-0.0690760.00.221037-0.2118540.052355-0.272058-0.0748360.00.052523...-0.147844-0.035670.5044070.682528-0.106569-0.3082570.1732060.010684-0.2202730.070334
\n", "

1 rows × 102 columns

\n", "
" ], "text/plain": [ " intercept C(VehBrand)[B1] C(VehBrand)[B10] C(VehBrand)[B11] \\\n", "coefficient 2.88023 -0.069076 0.0 0.221037 \n", "\n", " C(VehBrand)[B12] C(VehBrand)[B13] C(VehBrand)[B14] \\\n", "coefficient -0.211854 0.052355 -0.272058 \n", "\n", " C(VehBrand)[B2] C(VehBrand)[B3] C(VehBrand)[B4] ... \\\n", "coefficient -0.074836 0.0 0.052523 ... \n", "\n", " C(DrivAge)[4]:C(VehPower)[8] C(DrivAge)[5]:C(VehPower)[8] \\\n", "coefficient -0.147844 -0.03567 \n", "\n", " C(DrivAge)[6]:C(VehPower)[8] C(DrivAge)[0]:C(VehPower)[9] \\\n", "coefficient 0.504407 0.682528 \n", "\n", " C(DrivAge)[1]:C(VehPower)[9] C(DrivAge)[2]:C(VehPower)[9] \\\n", "coefficient -0.106569 -0.308257 \n", "\n", " C(DrivAge)[3]:C(VehPower)[9] C(DrivAge)[4]:C(VehPower)[9] \\\n", "coefficient 0.173206 0.010684 \n", "\n", " C(DrivAge)[5]:C(VehPower)[9] C(DrivAge)[6]:C(VehPower)[9] \n", "coefficient -0.220273 0.070334 \n", "\n", "[1 rows x 102 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_int = (\n", " \"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area)\"\n", " \" + C(DrivAge) * C(VehPower) + C(VehAge) + BonusMalus + Density\"\n", ")\n", "\n", "t_glm4 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_int,\n", ")\n", "t_glm4.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm4.intercept_], t_glm4.coef_))},\n", " index=[\"intercept\"] + t_glm4.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that, in addition to the interactions, the non-interacted variants of `DrivAge` and `VehPower` are also included in the model. This is a result of using the `*` operator to interact the variables. Using `:` instead would only include the interactions, and not the marginals. (In short, `a * b` is equivalent to `a + b + a:b`.)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:38.274479Z", "iopub.status.busy": "2026-03-16T16:10:38.274413Z", "iopub.status.idle": "2026-03-16T16:10:38.276592Z", "shell.execute_reply": "2026-03-16T16:10:38.276208Z" } }, "outputs": [ { "data": { "text/plain": [ "['C(VehPower)[4]',\n", " 'C(VehPower)[5]',\n", " 'C(VehPower)[6]',\n", " 'C(VehPower)[7]',\n", " 'C(VehPower)[8]',\n", " 'C(VehPower)[9]',\n", " 'C(DrivAge)[0]:C(VehPower)[4]',\n", " 'C(DrivAge)[1]:C(VehPower)[4]',\n", " 'C(DrivAge)[2]:C(VehPower)[4]',\n", " 'C(DrivAge)[3]:C(VehPower)[4]',\n", " 'C(DrivAge)[4]:C(VehPower)[4]',\n", " 'C(DrivAge)[5]:C(VehPower)[4]',\n", " 'C(DrivAge)[6]:C(VehPower)[4]',\n", " 'C(DrivAge)[0]:C(VehPower)[5]',\n", " 'C(DrivAge)[1]:C(VehPower)[5]',\n", " 'C(DrivAge)[2]:C(VehPower)[5]',\n", " 'C(DrivAge)[3]:C(VehPower)[5]',\n", " 'C(DrivAge)[4]:C(VehPower)[5]',\n", " 'C(DrivAge)[5]:C(VehPower)[5]',\n", " 'C(DrivAge)[6]:C(VehPower)[5]',\n", " 'C(DrivAge)[0]:C(VehPower)[6]',\n", " 'C(DrivAge)[1]:C(VehPower)[6]',\n", " 'C(DrivAge)[2]:C(VehPower)[6]',\n", " 'C(DrivAge)[3]:C(VehPower)[6]',\n", " 'C(DrivAge)[4]:C(VehPower)[6]',\n", " 'C(DrivAge)[5]:C(VehPower)[6]',\n", " 'C(DrivAge)[6]:C(VehPower)[6]',\n", " 'C(DrivAge)[0]:C(VehPower)[7]',\n", " 'C(DrivAge)[1]:C(VehPower)[7]',\n", " 'C(DrivAge)[2]:C(VehPower)[7]',\n", " 'C(DrivAge)[3]:C(VehPower)[7]',\n", " 'C(DrivAge)[4]:C(VehPower)[7]',\n", " 'C(DrivAge)[5]:C(VehPower)[7]',\n", " 'C(DrivAge)[6]:C(VehPower)[7]',\n", " 'C(DrivAge)[0]:C(VehPower)[8]',\n", " 'C(DrivAge)[1]:C(VehPower)[8]',\n", " 'C(DrivAge)[2]:C(VehPower)[8]',\n", " 'C(DrivAge)[3]:C(VehPower)[8]',\n", " 'C(DrivAge)[4]:C(VehPower)[8]',\n", " 'C(DrivAge)[5]:C(VehPower)[8]',\n", " 'C(DrivAge)[6]:C(VehPower)[8]',\n", " 'C(DrivAge)[0]:C(VehPower)[9]',\n", " 'C(DrivAge)[1]:C(VehPower)[9]',\n", " 'C(DrivAge)[2]:C(VehPower)[9]',\n", " 'C(DrivAge)[3]:C(VehPower)[9]',\n", " 'C(DrivAge)[4]:C(VehPower)[9]',\n", " 'C(DrivAge)[5]:C(VehPower)[9]',\n", " 'C(DrivAge)[6]:C(VehPower)[9]']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in t_glm4.feature_names_ if \"VehPower\" in name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The attentive reader might have also noticed that the first level of each categorical variable is omitted from the model. This is a manifestation of the more general concept of [ensuring structural full-rankedness](https://matthewwardrop.github.io/formulaic/guides/contrasts/#guaranteeing-structural-full-rankness)4. By default, `glum` and `formulaic` will try to make sure that one does not fall into the [Dummy Variable Trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)). Moreover, it even does it in the case of (possibly multi-way) interactions involving categorical variables. It will always drop the necessary number of levels, and no more. If you want to opt out of this behavior (for example because you would like to penalize all levels equally), simply set the `drop_first` parameter during model initialization to `False`. If one only aims to include all levels of a certain variable, and not others, it is possible to do so by using the `spans_intercept` parameter (e.g. `C(VehPower, spans_intercept=False)` would include all levels of `VehPower` even if `drop_first` is set to `True`).\n", "\n", "4: Note, that it does not guarantee that the design matrix is actually full rank. For example, two identical numerical variables will still lead to a rank-deficient design matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Fun with Functions\n", "\n", "The previous example is only scratching the surface of what formulas are capable of. For example, they are capable of evaluating arbitrary Python expressions, which act as if they saw the columns of the input data frame as local variables (`pandas.Series`). The way to tell `glum` that a part of the formula should be evaluated as a Python expression before applying the formula grammar to it is to enclose it in curly braces. As an example, we can easily do the following within the formula itself:\n", "\n", " 1. Create the outcome variable on the fly instead of doing it beforehand.\n", " 2. Include the logarithm of a certain variable in the model.\n", " 3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.\n", "\n", "1\\. works because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n", "\n", "Let's try it out!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:38.277724Z", "iopub.status.busy": "2026-03-16T16:10:38.277661Z", "iopub.status.idle": "2026-03-16T16:10:47.196866Z", "shell.execute_reply": "2026-03-16T16:10:47.196455Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptVehBrand[B1]VehBrand[B10]VehBrand[B11]VehBrand[B12]VehBrand[B13]VehBrand[B14]VehBrand[B2]VehBrand[B3]VehBrand[B4]...VehPower[4]VehPower[5]VehPower[6]VehPower[7]VehPower[8]VehPower[9]bs(BonusMalus, 3)[1]bs(BonusMalus, 3)[2]bs(BonusMalus, 3)[3]np.log(Density)
coefficient3.808829-0.0602010.00.242194-0.2025170.063471-0.345415-0.0725460.007770.079391...-0.113038-0.1272550.0602090.005577-0.0321140.2073553.1781780.3619518.2318460.121944
\n", "

1 rows × 62 columns

\n", "
" ], "text/plain": [ " intercept VehBrand[B1] VehBrand[B10] VehBrand[B11] \\\n", "coefficient 3.808829 -0.060201 0.0 0.242194 \n", "\n", " VehBrand[B12] VehBrand[B13] VehBrand[B14] VehBrand[B2] \\\n", "coefficient -0.202517 0.063471 -0.345415 -0.072546 \n", "\n", " VehBrand[B3] VehBrand[B4] ... VehPower[4] VehPower[5] \\\n", "coefficient 0.00777 0.079391 ... -0.113038 -0.127255 \n", "\n", " VehPower[6] VehPower[7] VehPower[8] VehPower[9] \\\n", "coefficient 0.060209 0.005577 -0.032114 0.207355 \n", "\n", " bs(BonusMalus, 3)[1] bs(BonusMalus, 3)[2] bs(BonusMalus, 3)[3] \\\n", "coefficient 3.178178 0.361951 8.231846 \n", "\n", " np.log(Density) \n", "coefficient 0.121944 \n", "\n", "[1 rows x 62 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_fun = (\n", " \"{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area\"\n", " \" + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)\"\n", ")\n", "\n", "t_glm5 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_fun,\n", ")\n", "t_glm5.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm5.intercept_], t_glm5.coef_))},\n", " index=[\"intercept\"] + t_glm5.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To allow for even more flexibility, you can add custom transformations that are defined in the context from which the call is made. E.g., we can define a transformation that takes the logarithm of ``VehAge + 1`` after casting it to numeric. To make the formula recognize this transform, you need to explicitly set ``context=0`` when calling the fit method (note that this differs from ``formulaic``'s default, which is already ``context=0``)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:47.197980Z", "iopub.status.busy": "2026-03-16T16:10:47.197903Z", "iopub.status.idle": "2026-03-16T16:10:47.783163Z", "shell.execute_reply": "2026-03-16T16:10:47.782765Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
intercept_log_plus_one(VehAge)
coefficient5.046712-0.151043
\n", "
" ], "text/plain": [ " intercept _log_plus_one(VehAge)\n", "coefficient 5.046712 -0.151043" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def _log_plus_one(x):\n", " return np.log(pd.to_numeric(x) + 1)\n", "\n", "formula_custom_fun = (\n", " \"{ClaimAmountCut / Exposure} ~ _log_plus_one(VehAge)\"\n", ")\n", "\n", "t_glm6 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_custom_fun,\n", ")\n", "t_glm6.fit(df_train, sample_weight=df[\"Exposure\"].values[train], context=0)\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},\n", " index=[\"intercept\"] + t_glm6.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "source": "## 6. Monotonic Constraints\n\nWhen using the formula interface, `glum` supports **monotonic constraints** via the `monotonic_constraints` parameter. This is useful when domain knowledge tells you that a variable's effect should only go in one direction. For example, we might expect that a higher `BonusMalus` score (which indicates a worse claims history) should never *decrease* the predicted pure premium.\n\nMonotonic constraints work with any term that expands to multiple columns — such as B-splines or ordered categoricals — by enforcing that consecutive coefficients are ordered. For single-column terms (e.g. a bare numeric variable), they enforce a sign constraint (non-negative for `\"increasing\"`, non-positive for `\"decreasing\"`).\n\nConstraints are specified as a dictionary mapping **variable names** (not formula terms) to `\"increasing\"` or `\"decreasing\"`. `glum` automatically resolves variable names to the corresponding expanded terms. Under the hood, this builds linear inequality constraints and solves them via a penalty-based IRLS algorithm (`solver='irls-ls-monotonic'`). You can also set `solver='trust-constr'` explicitly for an alternative approach based on scipy's constrained optimizer.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "formula_mono = (\n", " \"{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area\"\n", " \" + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)\"\n", ")\n", "\n", "t_glm_mono = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha=0.001,\n", " l1_ratio=0,\n", " fit_intercept=True,\n", " formula=formula_mono,\n", " monotonic_constraints={\"BonusMalus\": \"increasing\"},\n", " max_iter=500,\n", ")\n", "t_glm_mono.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "# The BonusMalus spline coefficients are monotonically increasing:\n", "bm_mask = [\"BonusMalus\" in n for n in t_glm_mono.feature_names_]\n", "bm_coefs = t_glm_mono.coef_[bm_mask]\n", "bm_names = [n for n in t_glm_mono.feature_names_ if \"BonusMalus\" in n]\n", "\n", "pd.DataFrame({\"coefficient\": bm_coefs}, index=bm_names).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": "## 7. Miscellaneous Features" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variable Names\n", "\n", "`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` parameters." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:47.784411Z", "iopub.status.busy": "2026-03-16T16:10:47.784343Z", "iopub.status.idle": "2026-03-16T16:10:50.289682Z", "shell.execute_reply": "2026-03-16T16:10:50.289219Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptDrivAge__0DrivAge__1DrivAge__2DrivAge__3DrivAge__4DrivAge__5DrivAge__6VehPower__4VehPower__5...DrivAge__4__x__VehPower__8DrivAge__5__x__VehPower__8DrivAge__6__x__VehPower__8DrivAge__0__x__VehPower__9DrivAge__1__x__VehPower__9DrivAge__2__x__VehPower__9DrivAge__3__x__VehPower__9DrivAge__4__x__VehPower__9DrivAge__5__x__VehPower__9DrivAge__6__x__VehPower__9
coefficient5.0072771.4970790.535650.0-0.152974-0.210998-0.2056890.017896-0.096153-0.05484...-0.143822-0.0020940.5122580.730534-0.280869-0.3676690.1710630.022052-0.2704560.119634
\n", "

1 rows × 56 columns

\n", "
" ], "text/plain": [ " intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 \\\n", "coefficient 5.007277 1.497079 0.53565 0.0 -0.152974 \n", "\n", " DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 \\\n", "coefficient -0.210998 -0.205689 0.017896 -0.096153 -0.05484 \n", "\n", " ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 \\\n", "coefficient ... -0.143822 -0.002094 \n", "\n", " DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 \\\n", "coefficient 0.512258 0.730534 \n", "\n", " DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 \\\n", "coefficient -0.280869 -0.367669 \n", "\n", " DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 \\\n", "coefficient 0.171063 0.022052 \n", "\n", " DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9 \n", "coefficient -0.270456 0.119634 \n", "\n", "[1 rows x 56 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_name = \"PurePremium ~ DrivAge * VehPower\"\n", "\n", "t_glm7 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_name,\n", " interaction_separator=\"__x__\",\n", " categorical_format=\"{name}__{category}\",\n", ")\n", "t_glm7.fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm7.intercept_], t_glm7.coef_))},\n", " index=[\"intercept\"] + t_glm7.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Intercept Term\n", "\n", "Just like in the case of the non-formula interface, the presence of an intercept is determined by the `fit_intercept` argument. In case that the formula specifies a different behavior (e.g., adding `+0` or `-1` while `fit_intercept=True`), an error will be raised." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:50.290873Z", "iopub.status.busy": "2026-03-16T16:10:50.290798Z", "iopub.status.idle": "2026-03-16T16:10:50.305162Z", "shell.execute_reply": "2026-03-16T16:10:50.304698Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.\n" ] } ], "source": [ "formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n", "\n", "try:\n", " t_glm8 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=True,\n", " formula=formula_noint,\n", " interaction_separator=\"__x__\",\n", " categorical_format=\"{name}__{category}\",\n", " ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", " raise AssertionError(\"Expected ValueError was not raised\")\n", "except ValueError as e:\n", " assert \"The formula sets the intercept to False\" in str(e)\n", " print(f\"Caught expected ValueError: {e}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-Sided Formulas\n", "\n", "Even when using formulas, the outcome variable can be specified as a vector, as in the interface without formulas. In that case the supplied formula should be one-sided (not contain a `~`), and only describe the right-hand side of the regression." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:50.306335Z", "iopub.status.busy": "2026-03-16T16:10:50.306261Z", "iopub.status.idle": "2026-03-16T16:10:52.806289Z", "shell.execute_reply": "2026-03-16T16:10:52.805883Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptDrivAge__0DrivAge__1DrivAge__2DrivAge__3DrivAge__4DrivAge__5DrivAge__6VehPower__4VehPower__5...DrivAge__4__x__VehPower__8DrivAge__5__x__VehPower__8DrivAge__6__x__VehPower__8DrivAge__0__x__VehPower__9DrivAge__1__x__VehPower__9DrivAge__2__x__VehPower__9DrivAge__3__x__VehPower__9DrivAge__4__x__VehPower__9DrivAge__5__x__VehPower__9DrivAge__6__x__VehPower__9
coefficient0.01.7132980.7835050.2059140.0160850.00.0000940.2236854.661234.736272...-0.1449270.0016570.5153730.714834-0.325666-0.3709350.204170.013222-0.2739130.115693
\n", "

1 rows × 56 columns

\n", "
" ], "text/plain": [ " intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 \\\n", "coefficient 0.0 1.713298 0.783505 0.205914 0.016085 \n", "\n", " DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 \\\n", "coefficient 0.0 0.000094 0.223685 4.66123 4.736272 \n", "\n", " ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 \\\n", "coefficient ... -0.144927 0.001657 \n", "\n", " DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 \\\n", "coefficient 0.515373 0.714834 \n", "\n", " DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 \\\n", "coefficient -0.325666 -0.370935 \n", "\n", " DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 \\\n", "coefficient 0.20417 0.013222 \n", "\n", " DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9 \n", "coefficient -0.273913 0.115693 \n", "\n", "[1 rows x 56 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_onesie = \"DrivAge * VehPower\"\n", "\n", "t_glm8 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=False,\n", " formula=formula_onesie,\n", " interaction_separator=\"__x__\",\n", " categorical_format=\"{name}__{category}\",\n", ")\n", "t_glm8.fit(\n", " X=df_train, y=df_train[\"PurePremium\"], sample_weight=df[\"Exposure\"].values[train]\n", ")\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm8.intercept_], t_glm8.coef_))},\n", " index=[\"intercept\"] + t_glm8.feature_names_,\n", ").T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing Values in Categorical Columns\n", "\n", "By default, `glum` raises a `ValueError` when it encounters a missing value in a categorical variable (`\"raise\"` option). However, there are two other options for handling these cases. They can also be treated as if they represented all-zeros indicators (`\"zero\"` option, which is also the way `pandas.get_dummies` works) or missing values can be treated as their own separate category (`\"convert\"` option).\n", "\n", "Similarly to the non-formula-based interface, `glum`'s behavior can be set globally using the `cat_missing_method` parameter during model initialization. However, formulas provide some additional flexibility: the `C` function has a `missing_method` parameter, with which users can select an option on a column-by-column basis. Here is an example of doing that (although our dataset does not have any missing values, so these options have no actual effect in this case):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-03-16T16:10:52.807428Z", "iopub.status.busy": "2026-03-16T16:10:52.807354Z", "iopub.status.idle": "2026-03-16T16:10:54.748684Z", "shell.execute_reply": "2026-03-16T16:10:54.748153Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interceptC(DrivAge, missing_method='zero')[0]C(DrivAge, missing_method='zero')[1]C(DrivAge, missing_method='zero')[2]C(DrivAge, missing_method='zero')[3]C(DrivAge, missing_method='zero')[4]C(DrivAge, missing_method='zero')[5]C(DrivAge, missing_method='zero')[6]C(VehPower, missing_method='convert')[4]C(VehPower, missing_method='convert')[5]C(VehPower, missing_method='convert')[6]C(VehPower, missing_method='convert')[7]C(VehPower, missing_method='convert')[8]C(VehPower, missing_method='convert')[9]
coefficient0.01.7867030.7427650.2395280.0965310.0711180.00.2010784.6372674.6793914.8633874.772634.7496734.970188
\n", "
" ], "text/plain": [ " intercept C(DrivAge, missing_method='zero')[0] \\\n", "coefficient 0.0 1.786703 \n", "\n", " C(DrivAge, missing_method='zero')[1] \\\n", "coefficient 0.742765 \n", "\n", " C(DrivAge, missing_method='zero')[2] \\\n", "coefficient 0.239528 \n", "\n", " C(DrivAge, missing_method='zero')[3] \\\n", "coefficient 0.096531 \n", "\n", " C(DrivAge, missing_method='zero')[4] \\\n", "coefficient 0.071118 \n", "\n", " C(DrivAge, missing_method='zero')[5] \\\n", "coefficient 0.0 \n", "\n", " C(DrivAge, missing_method='zero')[6] \\\n", "coefficient 0.201078 \n", "\n", " C(VehPower, missing_method='convert')[4] \\\n", "coefficient 4.637267 \n", "\n", " C(VehPower, missing_method='convert')[5] \\\n", "coefficient 4.679391 \n", "\n", " C(VehPower, missing_method='convert')[6] \\\n", "coefficient 4.863387 \n", "\n", " C(VehPower, missing_method='convert')[7] \\\n", "coefficient 4.77263 \n", "\n", " C(VehPower, missing_method='convert')[8] \\\n", "coefficient 4.749673 \n", "\n", " C(VehPower, missing_method='convert')[9] \n", "coefficient 4.970188 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula_missing = \"C(DrivAge, missing_method='zero') + C(VehPower, missing_method='convert')\"\n", "\n", "t_glm9 = GeneralizedLinearRegressor(\n", " family=TweedieDist,\n", " \n", " alpha_search=True,\n", " l1_ratio=1,\n", " fit_intercept=False,\n", " formula=formula_missing,\n", "\n", ")\n", "t_glm9.fit(\n", " X=df_train, y=df_train[\"PurePremium\"], sample_weight=df[\"Exposure\"].values[train]\n", ")\n", "\n", "pd.DataFrame(\n", " {\"coefficient\": np.concatenate(([t_glm9.intercept_], t_glm9.coef_))},\n", " index=[\"intercept\"] + t_glm9.feature_names_,\n", ").T" ] } ], "metadata": { "kernelspec": { "display_name": "glum-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 2 }