Formula Interface Tutorial: Revisiting French Motor Third-Party Liability Claims
Intro
This tutorial showcases the formula interface of glum. It allows for the specification of the design matrix and the response variable using so-called Wilkinson-formulas instead of constructing it by hand. This kind of model specification should be familiar to R users or those who have used the statsmodels or linearmodels Python packages before. This tutorial aims to introduce the basics of working with formulas to other users, as well as
highlighting some important differences between glums and other packages’ formula implementations.
For a more in-depth look at how formulas work, please take a look at the documentation of ``formulaic` <https://matthewwardrop.github.io/formulaic/>`__, the package on which glum’s formula interface is based.
Background
This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the GLM tutorial. If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.
Sneak Peek
Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:
{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method="zero") + bs(BonusMalus, 3)
Despite its brevity, it describes all of the following:
The outcome variable is the ratio of
ClaimAmountCutandExposure.The predictors should include the interactions of the categorical variables
DrivAgeandVehPower, as well as those two variables themselves. (Even though they behave as such, neither the individual variables nor their interaction will be dummy-encoded by glum. For categoricals with many levels, this can lead to a substantial performance improvement over dummy encoding, especially for the interaction.)If there are missing values in
DrivAge, they should be treated as a separate category.On the other hand, missing values in
VehPowershould be treated as all-zero indicators.The predictors should also include a third degree B-spline interpolation of
BonusMalus.
The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface.
Table of Contents
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.stats
from dask_ml.preprocessing import Categorizer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import ShuffleSplit
from glum import GeneralizedLinearRegressor
from glum import TweedieDistribution
from load_transform_formula import load_transform
1. Load and Prepare Datasets from Openml
First, we load in our dataset from openML and apply several transformations. In the interest of simplicity, we do not include the data loading and preparation code in this notebook.
[2]:
df = load_transform()
with pd.option_context('display.max_rows', 10):
display(df)
| ClaimNb | Exposure | Area | VehPower | VehAge | DrivAge | BonusMalus | VehBrand | VehGas | Density | Region | ClaimAmount | ClaimAmountCut | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IDpol | |||||||||||||
| 1 | 0 | 0.10000 | D | 5 | 0 | 5 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 |
| 3 | 0 | 0.77000 | D | 5 | 0 | 5 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 |
| 5 | 0 | 0.75000 | B | 6 | 1 | 5 | 50 | B12 | Diesel | 54 | R22 | 0.0 | 0.0 |
| 10 | 0 | 0.09000 | B | 7 | 0 | 4 | 50 | B12 | Diesel | 76 | R72 | 0.0 | 0.0 |
| 11 | 0 | 0.84000 | B | 7 | 0 | 4 | 50 | B12 | Diesel | 76 | R72 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6114326 | 0 | 0.00274 | E | 4 | 0 | 5 | 50 | B12 | Regular | 3317 | R93 | 0.0 | 0.0 |
| 6114327 | 0 | 0.00274 | E | 4 | 0 | 4 | 95 | B12 | Regular | 9850 | R11 | 0.0 | 0.0 |
| 6114328 | 0 | 0.00274 | D | 6 | 1 | 4 | 50 | B12 | Diesel | 1323 | R82 | 0.0 | 0.0 |
| 6114329 | 0 | 0.00274 | B | 4 | 0 | 5 | 50 | B12 | Regular | 95 | R26 | 0.0 | 0.0 |
| 6114330 | 0 | 0.00274 | B | 7 | 1 | 2 | 54 | B12 | Diesel | 65 | R72 | 0.0 | 0.0 |
678013 rows × 13 columns
2. Reproducing the Model From the GLM Tutorial
Now, let us start by fitting a very simple model. As usual, let’s divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate y and X data frames for our label and features – the formula will take care of that for us.
We still have some preprocessing to do:
Many of the ordinal or nominal variables are encoded as integers, instead of as categoricals. We will need to convert these so that
glumwill know to estimate a separate coefficient for each of their levels.The outcome variable is a transformation of other columns. We need to create it first.
As we will see later on, these steps can be incorporated into the formula itself, but let’s not overcomplicate things at first.
[3]:
ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
train, test = next(ss.split(df))
df = df.assign(PurePremium=lambda x: x["ClaimAmountCut"] / x["Exposure"])
glm_categorizer = Categorizer(
columns=["VehBrand", "VehGas", "Region", "Area", "DrivAge", "VehAge", "VehPower"]
)
df_train = glm_categorizer.fit_transform(df.iloc[train])
df_test = glm_categorizer.transform(df.iloc[test])
This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (~), and different predictors are separated by plus signs (+). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand.
[4]:
formula = "PurePremium ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + BonusMalus + Density"
TweedieDist = TweedieDistribution(1.5)
t_glm1 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula,
)
t_glm1.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm1.intercept_], t_glm1.coef_))},
index=["intercept"] + t_glm1.feature_names_,
).T
[4]:
| intercept | VehBrand[B1] | VehBrand[B10] | VehBrand[B11] | VehBrand[B12] | VehBrand[B13] | VehBrand[B14] | VehBrand[B2] | VehBrand[B3] | VehBrand[B4] | ... | VehAge[1] | VehAge[2] | VehPower[4] | VehPower[5] | VehPower[6] | VehPower[7] | VehPower[8] | VehPower[9] | BonusMalus | Density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 2.88667 | -0.064157 | 0.0 | 0.231868 | -0.211061 | 0.054979 | -0.270346 | -0.071453 | 0.00291 | 0.059324 | ... | 0.008117 | -0.229906 | -0.111796 | -0.123388 | 0.060757 | 0.005179 | -0.021832 | 0.208158 | 0.032508 | 0.000002 |
1 rows × 60 columns
3. Categorical Variables
glum also provides extensive support for categorical variables. The main function one needs to be aware of in the context of categoricals is simply called C(). A variable placed within it is always converted to a categorical, regardless of its type.
A huge part of tabmat’s/glum’s performance advantage is that categoricals need not be one-hot encoded, but are treated as if they were. For this reason, we do not support using other coding schemes within the formula interface. If one needs to use other categorical encodings than one-hot, they can always do so manually (or even using formulaic directly) before the estimation.
Let’s try it out on our dataset!
[5]:
df_train_noncat = df.iloc[train]
df_test_noncat = df.iloc[test]
df_train_noncat.dtypes
[5]:
ClaimNb int64
Exposure float64
Area str
VehPower int64
VehAge int64
DrivAge int64
BonusMalus int64
VehBrand str
VehGas str
Density int64
Region str
ClaimAmount float64
ClaimAmountCut float64
PurePremium float64
dtype: object
Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the C() function. Strings, such as VehBrand or VehGas would have been handled as categorical by default anyway, but using the C() function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name.
[6]:
formula_cat = (
"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area) "
"+ C(DrivAge) + C(VehAge) + C(VehPower) + BonusMalus + Density"
)
t_glm3 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_cat,
)
t_glm3.fit(df_train_noncat, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm3.intercept_], t_glm3.coef_))},
index=["intercept"] + t_glm3.feature_names_,
).T
[6]:
| intercept | C(VehBrand)[B1] | C(VehBrand)[B10] | C(VehBrand)[B11] | C(VehBrand)[B12] | C(VehBrand)[B13] | C(VehBrand)[B14] | C(VehBrand)[B2] | C(VehBrand)[B3] | C(VehBrand)[B4] | ... | C(VehAge)[1] | C(VehAge)[2] | C(VehPower)[4] | C(VehPower)[5] | C(VehPower)[6] | C(VehPower)[7] | C(VehPower)[8] | C(VehPower)[9] | BonusMalus | Density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 2.88667 | -0.064157 | 0.0 | 0.231868 | -0.211061 | 0.054979 | -0.270346 | -0.071453 | 0.00291 | 0.059324 | ... | 0.008117 | -0.229906 | -0.111796 | -0.123388 | 0.060757 | 0.005179 | -0.021832 | 0.208158 | 0.032508 | 0.000002 |
1 rows × 60 columns
Finally, prediction works as expected with categorical variables. glum keeps track of the levels present in the training dataset, and makes sure that categorical variables in unseen datasets are also properly aligned, even if they have missing or unknown levels.3 Therefore, one can simply use predict, and glum does The Right Thing™ by default.
3: This is made possible due to glum saving a `ModelSpec object <https://matthewwardrop.github.io/formulaic/guides/model_specs/>`__, which contains any information necessary for reapplying the transitions that were done during the formula materialization process. It is especially relevant in the case of stateful transforms, such as creating categorical variables.
[7]:
t_glm3.predict(df_test_noncat)
[7]:
array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,
67.98332028, 297.21717383], shape=(67802,))
4. Interactions and Structural Full-Rankness
One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. glum implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In
general, just as with glum’s categorical handling in general, you can be assured that you don’t have to worry too much about the actual implementation, and can expect that glum will do the most efficient thing behind the scenes.
Let’s see how that looks like on the insurance example! Suppose that we expect VehPower to have a different effect depending on DrivAge (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows.
[8]:
formula_int = (
"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area)"
" + C(DrivAge) * C(VehPower) + C(VehAge) + BonusMalus + Density"
)
t_glm4 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_int,
)
t_glm4.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm4.intercept_], t_glm4.coef_))},
index=["intercept"] + t_glm4.feature_names_,
).T
[8]:
| intercept | C(VehBrand)[B1] | C(VehBrand)[B10] | C(VehBrand)[B11] | C(VehBrand)[B12] | C(VehBrand)[B13] | C(VehBrand)[B14] | C(VehBrand)[B2] | C(VehBrand)[B3] | C(VehBrand)[B4] | ... | C(DrivAge)[4]:C(VehPower)[8] | C(DrivAge)[5]:C(VehPower)[8] | C(DrivAge)[6]:C(VehPower)[8] | C(DrivAge)[0]:C(VehPower)[9] | C(DrivAge)[1]:C(VehPower)[9] | C(DrivAge)[2]:C(VehPower)[9] | C(DrivAge)[3]:C(VehPower)[9] | C(DrivAge)[4]:C(VehPower)[9] | C(DrivAge)[5]:C(VehPower)[9] | C(DrivAge)[6]:C(VehPower)[9] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 2.88023 | -0.069076 | 0.0 | 0.221037 | -0.211854 | 0.052355 | -0.272058 | -0.074836 | 0.0 | 0.052523 | ... | -0.147844 | -0.03567 | 0.504407 | 0.682528 | -0.106569 | -0.308257 | 0.173206 | 0.010684 | -0.220273 | 0.070334 |
1 rows × 102 columns
Note that, in addition to the interactions, the non-interacted variants of DrivAge and VehPower are also included in the model. This is a result of using the * operator to interact the variables. Using : instead would only include the interactions, and not the marginals. (In short, a * b is equivalent to a + b + a:b.)
[9]:
[name for name in t_glm4.feature_names_ if "VehPower" in name]
[9]:
['C(VehPower)[4]',
'C(VehPower)[5]',
'C(VehPower)[6]',
'C(VehPower)[7]',
'C(VehPower)[8]',
'C(VehPower)[9]',
'C(DrivAge)[0]:C(VehPower)[4]',
'C(DrivAge)[1]:C(VehPower)[4]',
'C(DrivAge)[2]:C(VehPower)[4]',
'C(DrivAge)[3]:C(VehPower)[4]',
'C(DrivAge)[4]:C(VehPower)[4]',
'C(DrivAge)[5]:C(VehPower)[4]',
'C(DrivAge)[6]:C(VehPower)[4]',
'C(DrivAge)[0]:C(VehPower)[5]',
'C(DrivAge)[1]:C(VehPower)[5]',
'C(DrivAge)[2]:C(VehPower)[5]',
'C(DrivAge)[3]:C(VehPower)[5]',
'C(DrivAge)[4]:C(VehPower)[5]',
'C(DrivAge)[5]:C(VehPower)[5]',
'C(DrivAge)[6]:C(VehPower)[5]',
'C(DrivAge)[0]:C(VehPower)[6]',
'C(DrivAge)[1]:C(VehPower)[6]',
'C(DrivAge)[2]:C(VehPower)[6]',
'C(DrivAge)[3]:C(VehPower)[6]',
'C(DrivAge)[4]:C(VehPower)[6]',
'C(DrivAge)[5]:C(VehPower)[6]',
'C(DrivAge)[6]:C(VehPower)[6]',
'C(DrivAge)[0]:C(VehPower)[7]',
'C(DrivAge)[1]:C(VehPower)[7]',
'C(DrivAge)[2]:C(VehPower)[7]',
'C(DrivAge)[3]:C(VehPower)[7]',
'C(DrivAge)[4]:C(VehPower)[7]',
'C(DrivAge)[5]:C(VehPower)[7]',
'C(DrivAge)[6]:C(VehPower)[7]',
'C(DrivAge)[0]:C(VehPower)[8]',
'C(DrivAge)[1]:C(VehPower)[8]',
'C(DrivAge)[2]:C(VehPower)[8]',
'C(DrivAge)[3]:C(VehPower)[8]',
'C(DrivAge)[4]:C(VehPower)[8]',
'C(DrivAge)[5]:C(VehPower)[8]',
'C(DrivAge)[6]:C(VehPower)[8]',
'C(DrivAge)[0]:C(VehPower)[9]',
'C(DrivAge)[1]:C(VehPower)[9]',
'C(DrivAge)[2]:C(VehPower)[9]',
'C(DrivAge)[3]:C(VehPower)[9]',
'C(DrivAge)[4]:C(VehPower)[9]',
'C(DrivAge)[5]:C(VehPower)[9]',
'C(DrivAge)[6]:C(VehPower)[9]']
The attentive reader might have also noticed that the first level of each categorical variable is omitted from the model. This is a manifestation of the more general concept of ensuring structural full-rankedness4. By default, glum and formulaic will try to make sure that one does not fall into the Dummy Variable Trap.
Moreover, it even does it in the case of (possibly multi-way) interactions involving categorical variables. It will always drop the necessary number of levels, and no more. If you want to opt out of this behavior (for example because you would like to penalize all levels equally), simply set the drop_first parameter during model initialization to False. If one only aims to include all levels of a certain variable, and not others, it is possible to do so by using the spans_intercept
parameter (e.g. C(VehPower, spans_intercept=False) would include all levels of VehPower even if drop_first is set to True).
4: Note, that it does not guarantee that the design matrix is actually full rank. For example, two identical numerical variables will still lead to a rank-deficient design matrix.
5. Fun with Functions
The previous example is only scratching the surface of what formulas are capable of. For example, they are capable of evaluating arbitrary Python expressions, which act as if they saw the columns of the input data frame as local variables (pandas.Series). The way to tell glum that a part of the formula should be evaluated as a Python expression before applying the formula grammar to it is to enclose it in curly braces. As an example, we can easily do the following within the formula
itself:
Create the outcome variable on the fly instead of doing it beforehand.
Include the logarithm of a certain variable in the model.
Include a basis spline interpolation of a variable to capture non-linearities in its effect.
works because formulas can contain Python operations. 2. and 3. work because formulas are evaluated within a context that is aware of a number of transforms. To be precise, 2. is a regular transform and 3. is a stateful transform.
Let’s try it out!
[10]:
formula_fun = (
"{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area"
" + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)"
)
t_glm5 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_fun,
)
t_glm5.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm5.intercept_], t_glm5.coef_))},
index=["intercept"] + t_glm5.feature_names_,
).T
[10]:
| intercept | VehBrand[B1] | VehBrand[B10] | VehBrand[B11] | VehBrand[B12] | VehBrand[B13] | VehBrand[B14] | VehBrand[B2] | VehBrand[B3] | VehBrand[B4] | ... | VehPower[4] | VehPower[5] | VehPower[6] | VehPower[7] | VehPower[8] | VehPower[9] | bs(BonusMalus, 3)[1] | bs(BonusMalus, 3)[2] | bs(BonusMalus, 3)[3] | np.log(Density) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 3.808829 | -0.060201 | 0.0 | 0.242194 | -0.202517 | 0.063471 | -0.345415 | -0.072546 | 0.00777 | 0.079391 | ... | -0.113038 | -0.127255 | 0.060209 | 0.005577 | -0.032114 | 0.207355 | 3.178178 | 0.361951 | 8.231846 | 0.121944 |
1 rows × 62 columns
To allow for even more flexibility, you can add custom transformations that are defined in the context from which the call is made. E.g., we can define a transformation that takes the logarithm of VehAge + 1 after casting it to numeric. To make the formula recognize this transform, you need to explicitly set context=0 when calling the fit method (note that this differs from formulaic’s default, which is already context=0).
[11]:
def _log_plus_one(x):
return np.log(pd.to_numeric(x) + 1)
formula_custom_fun = (
"{ClaimAmountCut / Exposure} ~ _log_plus_one(VehAge)"
)
t_glm6 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_custom_fun,
)
t_glm6.fit(df_train, sample_weight=df["Exposure"].values[train], context=0)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},
index=["intercept"] + t_glm6.feature_names_,
).T
[11]:
| intercept | _log_plus_one(VehAge) | |
|---|---|---|
| coefficient | 5.046712 | -0.151043 |
6. Monotonic Constraints
When using the formula interface, glum supports monotonic constraints via the monotonic_constraints parameter. This is useful when domain knowledge tells you that a variable’s effect should only go in one direction. For example, we might expect that a higher BonusMalus score (which indicates a worse claims history) should never decrease the predicted pure premium.
Monotonic constraints work with any term that expands to multiple columns — such as B-splines or ordered categoricals — by enforcing that consecutive coefficients are ordered. For single-column terms (e.g. a bare numeric variable), they enforce a sign constraint (non-negative for "increasing", non-positive for "decreasing").
Constraints are specified as a dictionary mapping variable names (not formula terms) to "increasing" or "decreasing". glum automatically resolves variable names to the corresponding expanded terms. Under the hood, this builds linear inequality constraints and solves them via a penalty-based IRLS algorithm (solver='irls-ls-monotonic'). You can also set solver='trust-constr' explicitly for an alternative approach based on scipy’s constrained optimizer.
[ ]:
formula_mono = (
"{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area"
" + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)"
)
t_glm_mono = GeneralizedLinearRegressor(
family=TweedieDist,
alpha=0.001,
l1_ratio=0,
fit_intercept=True,
formula=formula_mono,
monotonic_constraints={"BonusMalus": "increasing"},
max_iter=500,
)
t_glm_mono.fit(df_train, sample_weight=df["Exposure"].values[train])
# The BonusMalus spline coefficients are monotonically increasing:
bm_mask = ["BonusMalus" in n for n in t_glm_mono.feature_names_]
bm_coefs = t_glm_mono.coef_[bm_mask]
bm_names = [n for n in t_glm_mono.feature_names_ if "BonusMalus" in n]
pd.DataFrame({"coefficient": bm_coefs}, index=bm_names).T
7. Miscellaneous Features
Variable Names
glum’s formula interface provides a lot of control over how the resulting features are named. By default, it follows formulaic’s standards, but it can be customized by setting the interaction_separator and categorical_format parameters.
[12]:
formula_name = "PurePremium ~ DrivAge * VehPower"
t_glm7 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_name,
interaction_separator="__x__",
categorical_format="{name}__{category}",
)
t_glm7.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm7.intercept_], t_glm7.coef_))},
index=["intercept"] + t_glm7.feature_names_,
).T
[12]:
| intercept | DrivAge__0 | DrivAge__1 | DrivAge__2 | DrivAge__3 | DrivAge__4 | DrivAge__5 | DrivAge__6 | VehPower__4 | VehPower__5 | ... | DrivAge__4__x__VehPower__8 | DrivAge__5__x__VehPower__8 | DrivAge__6__x__VehPower__8 | DrivAge__0__x__VehPower__9 | DrivAge__1__x__VehPower__9 | DrivAge__2__x__VehPower__9 | DrivAge__3__x__VehPower__9 | DrivAge__4__x__VehPower__9 | DrivAge__5__x__VehPower__9 | DrivAge__6__x__VehPower__9 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 5.007277 | 1.497079 | 0.53565 | 0.0 | -0.152974 | -0.210998 | -0.205689 | 0.017896 | -0.096153 | -0.05484 | ... | -0.143822 | -0.002094 | 0.512258 | 0.730534 | -0.280869 | -0.367669 | 0.171063 | 0.022052 | -0.270456 | 0.119634 |
1 rows × 56 columns
Intercept Term
Just like in the case of the non-formula interface, the presence of an intercept is determined by the fit_intercept argument. In case that the formula specifies a different behavior (e.g., adding +0 or -1 while fit_intercept=True), an error will be raised.
[13]:
formula_noint = "PurePremium ~ DrivAge * VehPower - 1"
try:
t_glm8 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_noint,
interaction_separator="__x__",
categorical_format="{name}__{category}",
).fit(df_train, sample_weight=df["Exposure"].values[train])
raise AssertionError("Expected ValueError was not raised")
except ValueError as e:
assert "The formula sets the intercept to False" in str(e)
print(f"Caught expected ValueError: {e}")
Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.
One-Sided Formulas
Even when using formulas, the outcome variable can be specified as a vector, as in the interface without formulas. In that case the supplied formula should be one-sided (not contain a ~), and only describe the right-hand side of the regression.
[14]:
formula_onesie = "DrivAge * VehPower"
t_glm8 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=False,
formula=formula_onesie,
interaction_separator="__x__",
categorical_format="{name}__{category}",
)
t_glm8.fit(
X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm8.intercept_], t_glm8.coef_))},
index=["intercept"] + t_glm8.feature_names_,
).T
[14]:
| intercept | DrivAge__0 | DrivAge__1 | DrivAge__2 | DrivAge__3 | DrivAge__4 | DrivAge__5 | DrivAge__6 | VehPower__4 | VehPower__5 | ... | DrivAge__4__x__VehPower__8 | DrivAge__5__x__VehPower__8 | DrivAge__6__x__VehPower__8 | DrivAge__0__x__VehPower__9 | DrivAge__1__x__VehPower__9 | DrivAge__2__x__VehPower__9 | DrivAge__3__x__VehPower__9 | DrivAge__4__x__VehPower__9 | DrivAge__5__x__VehPower__9 | DrivAge__6__x__VehPower__9 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 0.0 | 1.713298 | 0.783505 | 0.205914 | 0.016085 | 0.0 | 0.000094 | 0.223685 | 4.66123 | 4.736272 | ... | -0.144927 | 0.001657 | 0.515373 | 0.714834 | -0.325666 | -0.370935 | 0.20417 | 0.013222 | -0.273913 | 0.115693 |
1 rows × 56 columns
Missing Values in Categorical Columns
By default, glum raises a ValueError when it encounters a missing value in a categorical variable ("raise" option). However, there are two other options for handling these cases. They can also be treated as if they represented all-zeros indicators ("zero" option, which is also the way pandas.get_dummies works) or missing values can be treated as their own separate category ("convert" option).
Similarly to the non-formula-based interface, glum’s behavior can be set globally using the cat_missing_method parameter during model initialization. However, formulas provide some additional flexibility: the C function has a missing_method parameter, with which users can select an option on a column-by-column basis. Here is an example of doing that (although our dataset does not have any missing values, so these options have no actual effect in this case):
[ ]:
formula_missing = "C(DrivAge, missing_method='zero') + C(VehPower, missing_method='convert')"
t_glm9 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=False,
formula=formula_missing,
)
t_glm9.fit(
X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm9.intercept_], t_glm9.coef_))},
index=["intercept"] + t_glm9.feature_names_,
).T
| intercept | C(DrivAge, missing_method='zero')[0] | C(DrivAge, missing_method='zero')[1] | C(DrivAge, missing_method='zero')[2] | C(DrivAge, missing_method='zero')[3] | C(DrivAge, missing_method='zero')[4] | C(DrivAge, missing_method='zero')[5] | C(DrivAge, missing_method='zero')[6] | C(VehPower, missing_method='convert')[4] | C(VehPower, missing_method='convert')[5] | C(VehPower, missing_method='convert')[6] | C(VehPower, missing_method='convert')[7] | C(VehPower, missing_method='convert')[8] | C(VehPower, missing_method='convert')[9] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coefficient | 0.0 | 1.786703 | 0.742765 | 0.239528 | 0.096531 | 0.071118 | 0.0 | 0.201078 | 4.637267 | 4.679391 | 4.863387 | 4.77263 | 4.749673 | 4.970188 |