Formula Interface Tutorial: Revisiting French Motor Third-Party Liability Claims
Intro
This tutorial showcases the formula interface of glum
. It allows for the specification of the design matrix and the response variable using so-called Wilkinson-formulas instead of constructing it by hand. This kind of model specification should be familiar to R users or those who have used the statsmodels
or linearmodels
Python packages before. This tutorial aims to introduce the basics of working with formulas to other users, as well as
highlighting some important differences between glum
s and other packages’ formula implementations.
For a more in-depth look at how formulas work, please take a look at the documentation of ``formulaic` <https://matthewwardrop.github.io/formulaic/>`__, the package on which glum
’s formula interface is based.
Background
This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the GLM tutorial. If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.
Sneak Peak
Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:
{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method="zero") + bs(BonusMalus, 3)
Despite its brevity, it describes all of the following: - The outcome variable is the ratio of ClaimAmountCut
and Exposure
. - The predictors should include the interactions of the categorical variables DrivAge
and VehPower
, as well as those two variables themselves. (Even though they behave as such, neither the individual variables nor their interaction will be dummy-encoded by glum. For categoricals with many levels, this can lead to a substantial performance improvement over
dummy encoding, especially for the interaction.) - If there are missing values in DrivAge
, they should be treated as a separate category. - On the other hand, missing values in VehPower
should be treated as all-zero indicators. - The predictors should also include a third degree B-spline interpolation of BonusMalus
.
The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface.
Table of Contents
2. Reproducing the model from the GLM Tutorial
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pytest
import scipy.optimize as optimize
import scipy.stats
from dask_ml.preprocessing import Categorizer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import ShuffleSplit
from glum import GeneralizedLinearRegressor
from glum import TweedieDistribution
from load_transform_formula import load_transform
1. Load and Prepare Datasets from Openml
First, we load in our dataset from openML and apply several transformations. In the interest of simplicity, we do not include the data loading and preparation code in this notebook.
[2]:
df = load_transform()
with pd.option_context('display.max_rows', 10):
display(df)
ClaimNb | Exposure | Area | VehPower | VehAge | DrivAge | BonusMalus | VehBrand | VehGas | Density | Region | ClaimAmount | ClaimAmountCut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IDpol | |||||||||||||
1 | 0 | 0.10000 | D | 5 | 0 | 5 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 |
3 | 0 | 0.77000 | D | 5 | 0 | 5 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 |
5 | 0 | 0.75000 | B | 6 | 1 | 5 | 50 | B12 | Diesel | 54 | R22 | 0.0 | 0.0 |
10 | 0 | 0.09000 | B | 7 | 0 | 4 | 50 | B12 | Diesel | 76 | R72 | 0.0 | 0.0 |
11 | 0 | 0.84000 | B | 7 | 0 | 4 | 50 | B12 | Diesel | 76 | R72 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6114326 | 0 | 0.00274 | E | 4 | 0 | 5 | 50 | B12 | Regular | 3317 | R93 | 0.0 | 0.0 |
6114327 | 0 | 0.00274 | E | 4 | 0 | 4 | 95 | B12 | Regular | 9850 | R11 | 0.0 | 0.0 |
6114328 | 0 | 0.00274 | D | 6 | 1 | 4 | 50 | B12 | Diesel | 1323 | R82 | 0.0 | 0.0 |
6114329 | 0 | 0.00274 | B | 4 | 0 | 5 | 50 | B12 | Regular | 95 | R26 | 0.0 | 0.0 |
6114330 | 0 | 0.00274 | B | 7 | 1 | 2 | 54 | B12 | Diesel | 65 | R72 | 0.0 | 0.0 |
678013 rows × 13 columns
2. Reproducing the Model From the GLM Turorial
Now, let us start by fitting a very simple model. As usual, let’s divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate y
and X
data frames for our label and features – the formula will take care of that for us.
We still have some preprocessing to do: - Many of the ordinal or nominal variables are encoded as integers, instead of as categoricals. We will need to convert these so that glum
will know to estimate a separate coefficient for each of their levels. - The outcome variable is a transformation of other columns. We need to create it first.
As we will see later on, these steps can be incorporated into the formula itself, but let’s not overcomplicate things at first.
[3]:
ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
train, test = next(ss.split(df))
df = df.assign(PurePremium=lambda x: x["ClaimAmountCut"] / x["Exposure"])
glm_categorizer = Categorizer(
columns=["VehBrand", "VehGas", "Region", "Area", "DrivAge", "VehAge", "VehPower"]
)
df_train = glm_categorizer.fit_transform(df.iloc[train])
df_test = glm_categorizer.transform(df.iloc[test])
This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (~
), and different prefictors are separated by plus signs (+
). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand.
[4]:
formula = "PurePremium ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + BonusMalus + Density"
TweedieDist = TweedieDistribution(1.5)
t_glm1 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula,
)
t_glm1.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm1.intercept_], t_glm1.coef_))},
index=["intercept"] + t_glm1.feature_names_,
).T
[4]:
intercept | VehBrand[B1] | VehBrand[B10] | VehBrand[B11] | VehBrand[B12] | VehBrand[B13] | VehBrand[B14] | VehBrand[B2] | VehBrand[B3] | VehBrand[B4] | ... | VehAge[1] | VehAge[2] | VehPower[4] | VehPower[5] | VehPower[6] | VehPower[7] | VehPower[8] | VehPower[9] | BonusMalus | Density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 2.88667 | -0.064157 | 0.0 | 0.231868 | -0.211061 | 0.054979 | -0.270346 | -0.071453 | 0.00291 | 0.059324 | ... | 0.008117 | -0.229906 | -0.111796 | -0.123388 | 0.060757 | 0.005179 | -0.021832 | 0.208158 | 0.032508 | 0.000002 |
1 rows × 60 columns
3. Categorical Variables
glum
also provides extensive support for categorical variables. The main function one needs to be aware of in the context of categoricals is simply called C()
. A variable placed within it is always converted to a categorical, regardless of its type.
A huge part of tabmat’s/glum’s performance advantage is that categoricals need not be one-hot encoded, but are treated as if they were. For this reason, we do not support using other coding schemes within the formula interface. If one needs to use other categorical encodings than one-hot, they can always do so manually (or even using formulaic
directly) before the estimation.
Let’s try it out on our dataset!
[5]:
df_train_noncat = df.iloc[train]
df_test_noncat = df.iloc[test]
df_train_noncat.dtypes
[5]:
ClaimNb int64
Exposure float64
Area object
VehPower int64
VehAge int64
DrivAge int64
BonusMalus int64
VehBrand object
VehGas object
Density int64
Region object
ClaimAmount float64
ClaimAmountCut float64
PurePremium float64
dtype: object
Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the C()
function. Strings, such as VehBrand
or VehGas
would have been handled as categorical by default anyway, but using the C()
function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name.
[6]:
formula_cat = (
"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area) "
"+ C(DrivAge) + C(VehAge) + C(VehPower) + BonusMalus + Density"
)
t_glm3 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_cat,
)
t_glm3.fit(df_train_noncat, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm3.intercept_], t_glm3.coef_))},
index=["intercept"] + t_glm3.feature_names_,
).T
[6]:
intercept | C(VehBrand)[B1] | C(VehBrand)[B10] | C(VehBrand)[B11] | C(VehBrand)[B12] | C(VehBrand)[B13] | C(VehBrand)[B14] | C(VehBrand)[B2] | C(VehBrand)[B3] | C(VehBrand)[B4] | ... | C(VehAge)[1] | C(VehAge)[2] | C(VehPower)[4] | C(VehPower)[5] | C(VehPower)[6] | C(VehPower)[7] | C(VehPower)[8] | C(VehPower)[9] | BonusMalus | Density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 2.88667 | -0.064157 | 0.0 | 0.231868 | -0.211061 | 0.054979 | -0.270346 | -0.071453 | 0.00291 | 0.059324 | ... | 0.008117 | -0.229906 | -0.111796 | -0.123388 | 0.060757 | 0.005179 | -0.021832 | 0.208158 | 0.032508 | 0.000002 |
1 rows × 60 columns
Finally, prediction works as expected with categorical variables. glum
keeps track of the levels present in the training dataset, and makes sure that categorical variables in unseen datasets are also properly aligned, even if they have missing or unknown levels.3 Therefore, one can simply use predict, and glum
does The Right Thing™ by default.
3: This is made possible due to glum
saving a `ModelSpec
object <https://matthewwardrop.github.io/formulaic/guides/model_specs/>`__, which contains any information necessary for reapplying the transitions that were done during the formula materialization process. It is especially relevant in the case of stateful transforms, such as creating categorical variables.
[7]:
t_glm3.predict(df_test_noncat)
[7]:
array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,
67.98332028, 297.21717383])
4. Interactions and Structural Full-Rankness
One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. glum
implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In
general, just as with glum
’s categorical handling in general, you can be assured that glum
you don’t have to worry too much about the actual implementation, and can expect that glum
will do the most efficient thing behind the scenes.
Let’s see how that looks like on the insurance example! Suppose that we expect VehPower
to have a different effect depending on DrivAge
(e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows.
[8]:
formula_int = (
"PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area)"
" + C(DrivAge) * C(VehPower) + C(VehAge) + BonusMalus + Density"
)
t_glm4 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_int,
)
t_glm4.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm4.intercept_], t_glm4.coef_))},
index=["intercept"] + t_glm4.feature_names_,
).T
[8]:
intercept | C(VehBrand)[B1] | C(VehBrand)[B10] | C(VehBrand)[B11] | C(VehBrand)[B12] | C(VehBrand)[B13] | C(VehBrand)[B14] | C(VehBrand)[B2] | C(VehBrand)[B3] | C(VehBrand)[B4] | ... | C(DrivAge)[4]:C(VehPower)[8] | C(DrivAge)[5]:C(VehPower)[8] | C(DrivAge)[6]:C(VehPower)[8] | C(DrivAge)[0]:C(VehPower)[9] | C(DrivAge)[1]:C(VehPower)[9] | C(DrivAge)[2]:C(VehPower)[9] | C(DrivAge)[3]:C(VehPower)[9] | C(DrivAge)[4]:C(VehPower)[9] | C(DrivAge)[5]:C(VehPower)[9] | C(DrivAge)[6]:C(VehPower)[9] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 2.88023 | -0.069076 | 0.0 | 0.221037 | -0.211854 | 0.052355 | -0.272058 | -0.074836 | 0.0 | 0.052523 | ... | -0.147844 | -0.03567 | 0.504407 | 0.682528 | -0.106569 | -0.308257 | 0.173206 | 0.010684 | -0.220273 | 0.070334 |
1 rows × 102 columns
Note that, in addition to the interactions, the non-interacted variants of DrivAge
and VehPower
are also included in the model. This is a result of using the *
operator to interact the variables. Using :
instead would only include the interactions, and not the marginals. (In short, a * b
is equivalent to a + b + a:b
.)
[9]:
[name for name in t_glm4.feature_names_ if "VehPower" in name]
[9]:
['C(VehPower)[4]',
'C(VehPower)[5]',
'C(VehPower)[6]',
'C(VehPower)[7]',
'C(VehPower)[8]',
'C(VehPower)[9]',
'C(DrivAge)[0]:C(VehPower)[4]',
'C(DrivAge)[1]:C(VehPower)[4]',
'C(DrivAge)[2]:C(VehPower)[4]',
'C(DrivAge)[3]:C(VehPower)[4]',
'C(DrivAge)[4]:C(VehPower)[4]',
'C(DrivAge)[5]:C(VehPower)[4]',
'C(DrivAge)[6]:C(VehPower)[4]',
'C(DrivAge)[0]:C(VehPower)[5]',
'C(DrivAge)[1]:C(VehPower)[5]',
'C(DrivAge)[2]:C(VehPower)[5]',
'C(DrivAge)[3]:C(VehPower)[5]',
'C(DrivAge)[4]:C(VehPower)[5]',
'C(DrivAge)[5]:C(VehPower)[5]',
'C(DrivAge)[6]:C(VehPower)[5]',
'C(DrivAge)[0]:C(VehPower)[6]',
'C(DrivAge)[1]:C(VehPower)[6]',
'C(DrivAge)[2]:C(VehPower)[6]',
'C(DrivAge)[3]:C(VehPower)[6]',
'C(DrivAge)[4]:C(VehPower)[6]',
'C(DrivAge)[5]:C(VehPower)[6]',
'C(DrivAge)[6]:C(VehPower)[6]',
'C(DrivAge)[0]:C(VehPower)[7]',
'C(DrivAge)[1]:C(VehPower)[7]',
'C(DrivAge)[2]:C(VehPower)[7]',
'C(DrivAge)[3]:C(VehPower)[7]',
'C(DrivAge)[4]:C(VehPower)[7]',
'C(DrivAge)[5]:C(VehPower)[7]',
'C(DrivAge)[6]:C(VehPower)[7]',
'C(DrivAge)[0]:C(VehPower)[8]',
'C(DrivAge)[1]:C(VehPower)[8]',
'C(DrivAge)[2]:C(VehPower)[8]',
'C(DrivAge)[3]:C(VehPower)[8]',
'C(DrivAge)[4]:C(VehPower)[8]',
'C(DrivAge)[5]:C(VehPower)[8]',
'C(DrivAge)[6]:C(VehPower)[8]',
'C(DrivAge)[0]:C(VehPower)[9]',
'C(DrivAge)[1]:C(VehPower)[9]',
'C(DrivAge)[2]:C(VehPower)[9]',
'C(DrivAge)[3]:C(VehPower)[9]',
'C(DrivAge)[4]:C(VehPower)[9]',
'C(DrivAge)[5]:C(VehPower)[9]',
'C(DrivAge)[6]:C(VehPower)[9]']
The attentive reader might have also noticed that the first level of each categorical variable is omitted from the model. This is a manifestation of the more general concept of ensuring structural full-rankedness4. By default, glum
and formulaic
will try to make sure that one does not fall into the Dummy Variable Trap.
Moreover, it even does it in the case of (possibly multi-way) interactions involving categorical variables. It will always drop the necessary number of levels, and no more. If you want to opt out of this behavior (for example because you would like to penalize all levels equally), simply set the drop_first
parameter during model initialization to False
. If one only aims to include all levels of a certain variable, and not others, it is possible to do so by using the spans_intercept
parameter (e.g. C(VehPower, spans_intercept=False)
would include all levels of VehPower
even if drop_first
is set to True
).
4: Note, that it does not guarantee that the design matrix is actually full rank. For example, two identical numerical variables will still lead to a rank-deficient design matrix.
5. Fun with Functions
The previous example is only scratching the surface of what formulas are capable of. For example, they are capable of evaluating arbitrary Python expressions, which act as if they saw the columns of the input data frame as local variables (pandas.Series
). The way to tell glum
that a part of the formula should be evaluated as a Python expression before applying the formula grammar to it is to enclose it in curly braces. As an example, we can easily do the following within the formula
itself:
Create the outcome variable on the fly instead of doing it beforehand.
Include the logarithm of a certain variable in the model.
Include a basis spline interpolation of a variable to capture non-linearities in its effect.
works because because formulas can contain Python operations. 2. and 3. work because formulas are evaluated within a context that is aware of a number of transforms. To be precise, 2. is a regular transform and 3. is a stateful transform.
Let’s try it out!
[10]:
formula_fun = (
"{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area"
" + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)"
)
t_glm5 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_fun,
)
t_glm5.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm5.intercept_], t_glm5.coef_))},
index=["intercept"] + t_glm5.feature_names_,
).T
[10]:
intercept | VehBrand[B1] | VehBrand[B10] | VehBrand[B11] | VehBrand[B12] | VehBrand[B13] | VehBrand[B14] | VehBrand[B2] | VehBrand[B3] | VehBrand[B4] | ... | VehPower[4] | VehPower[5] | VehPower[6] | VehPower[7] | VehPower[8] | VehPower[9] | bs(BonusMalus, 3)[1] | bs(BonusMalus, 3)[2] | bs(BonusMalus, 3)[3] | np.log(Density) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 3.808829 | -0.060201 | 0.0 | 0.242194 | -0.202517 | 0.063471 | -0.345415 | -0.072546 | 0.00777 | 0.079391 | ... | -0.113038 | -0.127255 | 0.060209 | 0.005577 | -0.032114 | 0.207355 | 3.178178 | 0.361951 | 8.231846 | 0.121944 |
1 rows × 62 columns
To allow for even more flexibility, you can add custom transformations that are defined in the context from which the call is made. E.g., we can define a transformation that takes the logarithm of VehAge + 1
after casting it to numeric. To make the formula recognize this transform, you need to explicitly set context=0
when calling the fit method (note that this differs from formulaic
’s default, which is already context=0
).
[11]:
def _log_plus_one(x):
return np.log(pd.to_numeric(x) + 1)
formula_custom_fun = (
"{ClaimAmountCut / Exposure} ~ _log_plus_one(VehAge)"
)
t_glm6 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_custom_fun,
)
t_glm6.fit(df_train, sample_weight=df["Exposure"].values[train], context=0)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},
index=["intercept"] + t_glm6.feature_names_,
).T
[11]:
intercept | _log_plus_one(VehAge) | |
---|---|---|
coefficient | 5.046712 | -0.151043 |
6. Miscellaneous Features
Variable Names
glum
’s formula interface provides a lot of control over how the resulting features are named. By default, it follows formulaic
’s standards, but it can be customized by setting the interaction_separator
and categorical_format
paremeters.
[12]:
formula_name = "PurePremium ~ DrivAge * VehPower"
t_glm7 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_name,
interaction_separator="__x__",
categorical_format="{name}__{category}",
)
t_glm7.fit(df_train, sample_weight=df["Exposure"].values[train])
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm7.intercept_], t_glm7.coef_))},
index=["intercept"] + t_glm7.feature_names_,
).T
[12]:
intercept | DrivAge__0 | DrivAge__1 | DrivAge__2 | DrivAge__3 | DrivAge__4 | DrivAge__5 | DrivAge__6 | VehPower__4 | VehPower__5 | ... | DrivAge__4__x__VehPower__8 | DrivAge__5__x__VehPower__8 | DrivAge__6__x__VehPower__8 | DrivAge__0__x__VehPower__9 | DrivAge__1__x__VehPower__9 | DrivAge__2__x__VehPower__9 | DrivAge__3__x__VehPower__9 | DrivAge__4__x__VehPower__9 | DrivAge__5__x__VehPower__9 | DrivAge__6__x__VehPower__9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 5.007277 | 1.497079 | 0.53565 | 0.0 | -0.152974 | -0.210998 | -0.205689 | 0.017896 | -0.096153 | -0.05484 | ... | -0.143822 | -0.002094 | 0.512258 | 0.730534 | -0.280869 | -0.367669 | 0.171063 | 0.022052 | -0.270456 | 0.119634 |
1 rows × 56 columns
Intercept Term
Just like in the case of the non-formula interface, the presence of an intercept is determined by the fit_intercept
argument. In case that the formula specifies a different behavior (e.g., adding +0
or -1
while fit_intercept=True
), an error will be raised.
[13]:
formula_noint = "PurePremium ~ DrivAge * VehPower - 1"
with pytest.raises(ValueError, match="The formula sets the intercept to False"):
t_glm8 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=True,
formula=formula_noint,
interaction_separator="__x__",
categorical_format="{name}__{category}",
).fit(df_train, sample_weight=df["Exposure"].values[train])
One-Sided Formulas
Even when using formulas, the outcome variable can be specified as a vector, as in the interface without formulas. In that case the supplied formula should be one-sided (not contain a ~
), and only describe the right-hand side of the regression.
[14]:
formula_onesie = "DrivAge * VehPower"
t_glm8 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=False,
formula=formula_onesie,
interaction_separator="__x__",
categorical_format="{name}__{category}",
)
t_glm8.fit(
X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm8.intercept_], t_glm8.coef_))},
index=["intercept"] + t_glm8.feature_names_,
).T
[14]:
intercept | DrivAge__0 | DrivAge__1 | DrivAge__2 | DrivAge__3 | DrivAge__4 | DrivAge__5 | DrivAge__6 | VehPower__4 | VehPower__5 | ... | DrivAge__4__x__VehPower__8 | DrivAge__5__x__VehPower__8 | DrivAge__6__x__VehPower__8 | DrivAge__0__x__VehPower__9 | DrivAge__1__x__VehPower__9 | DrivAge__2__x__VehPower__9 | DrivAge__3__x__VehPower__9 | DrivAge__4__x__VehPower__9 | DrivAge__5__x__VehPower__9 | DrivAge__6__x__VehPower__9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 0.0 | 1.713298 | 0.783505 | 0.205914 | 0.016085 | 0.0 | 0.000094 | 0.223685 | 4.66123 | 4.736272 | ... | -0.144927 | 0.001657 | 0.515373 | 0.714834 | -0.325666 | -0.370935 | 0.20417 | 0.013222 | -0.273913 | 0.115693 |
1 rows × 56 columns
Missing Values in Categorical Columns
By default, glum
raises a ValueError
when it encounters a missing value in a categorical variable ("raise"
option). However, there are two other options for handling these cases. They can also be treated as if they represented all-zeros indicators ("zero"
option, which is also the way pandas.get_dummies
works) or missing values can be treated as their own separate category ("convert"
option).
Similarly to the non-formula-based interface, glum
’s behavior can be set globally using the cat_missing_method
parameter during model initialization. However, formulas provide some additional flexibility: the C
function has a missing_method
parameter, with which users can select an option on a column-by-column basis. Here is an example of doing that (although our dataset does not have any missing values, so these options have no actual effect in this case):
[15]:
formula_missing = "C(DrivAge, missing_method='zero') + C(VehPower, missing_method='convert')"
t_glm9 = GeneralizedLinearRegressor(
family=TweedieDist,
alpha_search=True,
l1_ratio=1,
fit_intercept=False,
formula=formula_missing,
)
t_glm9.fit(
X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)
pd.DataFrame(
{"coefficient": np.concatenate(([t_glm9.intercept_], t_glm9.coef_))},
index=["intercept"] + t_glm9.feature_names_,
).T
[15]:
intercept | C(DrivAge, missing_method='zero')[0] | C(DrivAge, missing_method='zero')[1] | C(DrivAge, missing_method='zero')[2] | C(DrivAge, missing_method='zero')[3] | C(DrivAge, missing_method='zero')[4] | C(DrivAge, missing_method='zero')[5] | C(DrivAge, missing_method='zero')[6] | C(VehPower, missing_method='convert')[4] | C(VehPower, missing_method='convert')[5] | C(VehPower, missing_method='convert')[6] | C(VehPower, missing_method='convert')[7] | C(VehPower, missing_method='convert')[8] | C(VehPower, missing_method='convert')[9] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coefficient | 0.0 | 1.786703 | 0.742765 | 0.239528 | 0.096531 | 0.071118 | 0.0 | 0.201078 | 4.637267 | 4.679391 | 4.863387 | 4.77263 | 4.749673 | 4.970188 |