Formula Interface Tutorial: Revisiting French Motor Third-Party Liability Claims

Intro

This tutorial showcases the formula interface of glum. It allows for the specification of the design matrix and the response variable using so-called Wilkinson-formulas instead of constructing it by hand. This kind of model specification should be familiar to R users or those who have used the statsmodels or linearmodels Python packages before. This tutorial aims to introduce the basics of working with formulas to other users, as well as highlighting some important differences between glums and other packages’ formula implementations.

For a more in-depth look at how formulas work, please take a look at the documentation of ``formulaic` <https://matthewwardrop.github.io/formulaic/>`__, the package on which glum’s formula interface is based.

Background

This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the GLM tutorial. If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.

Sneak Peak

Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:

{ClaimAmountCut / Exposure} ~ C(DrivAge, missing_method='convert') * C(VehPower, missing_method="zero") + bs(BonusMalus, 3)

Despite its brevity, it describes all of the following: - The outcome variable is the ratio of ClaimAmountCut and Exposure. - The predictors should include the interactions of the categorical variables DrivAge and VehPower, as well as those two variables themselves. (Even though they behave as such, neither the individual variables nor their interaction will be dummy-encoded by glum. For categoricals with many levels, this can lead to a substantial performance improvement over dummy encoding, especially for the interaction.) - If there are missing values in DrivAge, they should be treated as a separate category. - On the other hand, missing values in VehPower should be treated as all-zero indicators. - The predictors should also include a third degree B-spline interpolation of BonusMalus.

The following chapters demonstrate each of these features in some detail, as well as some additional advantages of using the formula interface.

Table of Contents

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pytest
import scipy.optimize as optimize
import scipy.stats
from dask_ml.preprocessing import Categorizer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import ShuffleSplit
from glum import GeneralizedLinearRegressor
from glum import TweedieDistribution

from load_transform_formula import load_transform

1. Load and Prepare Datasets from Openml

back to table of contents

First, we load in our dataset from openML and apply several transformations. In the interest of simplicity, we do not include the data loading and preparation code in this notebook.

[2]:
df = load_transform()
with pd.option_context('display.max_rows', 10):
    display(df)
ClaimNb Exposure Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Density Region ClaimAmount ClaimAmountCut
IDpol
1 0 0.10000 D 5 0 5 50 B12 Regular 1217 R82 0.0 0.0
3 0 0.77000 D 5 0 5 50 B12 Regular 1217 R82 0.0 0.0
5 0 0.75000 B 6 1 5 50 B12 Diesel 54 R22 0.0 0.0
10 0 0.09000 B 7 0 4 50 B12 Diesel 76 R72 0.0 0.0
11 0 0.84000 B 7 0 4 50 B12 Diesel 76 R72 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6114326 0 0.00274 E 4 0 5 50 B12 Regular 3317 R93 0.0 0.0
6114327 0 0.00274 E 4 0 4 95 B12 Regular 9850 R11 0.0 0.0
6114328 0 0.00274 D 6 1 4 50 B12 Diesel 1323 R82 0.0 0.0
6114329 0 0.00274 B 4 0 5 50 B12 Regular 95 R26 0.0 0.0
6114330 0 0.00274 B 7 1 2 54 B12 Diesel 65 R72 0.0 0.0

678013 rows × 13 columns

2. Reproducing the Model From the GLM Turorial

Now, let us start by fitting a very simple model. As usual, let’s divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate y and X data frames for our label and features – the formula will take care of that for us.

We still have some preprocessing to do: - Many of the ordinal or nominal variables are encoded as integers, instead of as categoricals. We will need to convert these so that glum will know to estimate a separate coefficient for each of their levels. - The outcome variable is a transformation of other columns. We need to create it first.

As we will see later on, these steps can be incorporated into the formula itself, but let’s not overcomplicate things at first.

[3]:
ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
train, test = next(ss.split(df))

df = df.assign(PurePremium=lambda x: x["ClaimAmountCut"] / x["Exposure"])

glm_categorizer = Categorizer(
    columns=["VehBrand", "VehGas", "Region", "Area", "DrivAge", "VehAge", "VehPower"]
)
df_train = glm_categorizer.fit_transform(df.iloc[train])
df_test = glm_categorizer.transform(df.iloc[test])

This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (~), and different prefictors are separated by plus signs (+). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand.

[4]:
formula = "PurePremium ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + BonusMalus + Density"

TweedieDist = TweedieDistribution(1.5)
t_glm1 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula,
)
t_glm1.fit(df_train, sample_weight=df["Exposure"].values[train])

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm1.intercept_], t_glm1.coef_))},
    index=["intercept"] + t_glm1.feature_names_,
).T
[4]:
intercept VehBrand[B1] VehBrand[B10] VehBrand[B11] VehBrand[B12] VehBrand[B13] VehBrand[B14] VehBrand[B2] VehBrand[B3] VehBrand[B4] ... VehAge[1] VehAge[2] VehPower[4] VehPower[5] VehPower[6] VehPower[7] VehPower[8] VehPower[9] BonusMalus Density
coefficient 2.88667 -0.064157 0.0 0.231868 -0.211061 0.054979 -0.270346 -0.071453 0.00291 0.059324 ... 0.008117 -0.229906 -0.111796 -0.123388 0.060757 0.005179 -0.021832 0.208158 0.032508 0.000002

1 rows × 60 columns

3. Categorical Variables

glum also provides extensive support for categorical variables. The main function one needs to be aware of in the context of categoricals is simply called C(). A variable placed within it is always converted to a categorical, regardless of its type.

A huge part of tabmat’s/glum’s performance advantage is that categoricals need not be one-hot encoded, but are treated as if they were. For this reason, we do not support using other coding schemes within the formula interface. If one needs to use other categorical encodings than one-hot, they can always do so manually (or even using formulaic directly) before the estimation.

Let’s try it out on our dataset!

[5]:
df_train_noncat = df.iloc[train]
df_test_noncat = df.iloc[test]

df_train_noncat.dtypes
[5]:
ClaimNb             int64
Exposure          float64
Area               object
VehPower            int64
VehAge              int64
DrivAge             int64
BonusMalus          int64
VehBrand           object
VehGas             object
Density             int64
Region             object
ClaimAmount       float64
ClaimAmountCut    float64
PurePremium       float64
dtype: object

Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the C() function. Strings, such as VehBrand or VehGas would have been handled as categorical by default anyway, but using the C() function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name.

[6]:
formula_cat = (
    "PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area) "
    "+ C(DrivAge) + C(VehAge) + C(VehPower) + BonusMalus + Density"
)

t_glm3 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula_cat,
)
t_glm3.fit(df_train_noncat, sample_weight=df["Exposure"].values[train])

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm3.intercept_], t_glm3.coef_))},
    index=["intercept"] + t_glm3.feature_names_,
).T
[6]:
intercept C(VehBrand)[B1] C(VehBrand)[B10] C(VehBrand)[B11] C(VehBrand)[B12] C(VehBrand)[B13] C(VehBrand)[B14] C(VehBrand)[B2] C(VehBrand)[B3] C(VehBrand)[B4] ... C(VehAge)[1] C(VehAge)[2] C(VehPower)[4] C(VehPower)[5] C(VehPower)[6] C(VehPower)[7] C(VehPower)[8] C(VehPower)[9] BonusMalus Density
coefficient 2.88667 -0.064157 0.0 0.231868 -0.211061 0.054979 -0.270346 -0.071453 0.00291 0.059324 ... 0.008117 -0.229906 -0.111796 -0.123388 0.060757 0.005179 -0.021832 0.208158 0.032508 0.000002

1 rows × 60 columns

Finally, prediction works as expected with categorical variables. glum keeps track of the levels present in the training dataset, and makes sure that categorical variables in unseen datasets are also properly aligned, even if they have missing or unknown levels.3 Therefore, one can simply use predict, and glum does The Right Thing™ by default.

3: This is made possible due to glum saving a `ModelSpec object <https://matthewwardrop.github.io/formulaic/guides/model_specs/>`__, which contains any information necessary for reapplying the transitions that were done during the formula materialization process. It is especially relevant in the case of stateful transforms, such as creating categorical variables.

[7]:
t_glm3.predict(df_test_noncat)
[7]:
array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,
        67.98332028, 297.21717383])

4. Interactions and Structural Full-Rankness

One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. glum implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with glum’s categorical handling in general, you can be assured that glum you don’t have to worry too much about the actual implementation, and can expect that glum will do the most efficient thing behind the scenes.

Let’s see how that looks like on the insurance example! Suppose that we expect VehPower to have a different effect depending on DrivAge (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows.

[8]:
formula_int = (
    "PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area)"
    " + C(DrivAge) * C(VehPower) + C(VehAge) + BonusMalus + Density"
)

t_glm4 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula_int,
)
t_glm4.fit(df_train, sample_weight=df["Exposure"].values[train])

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm4.intercept_], t_glm4.coef_))},
    index=["intercept"] + t_glm4.feature_names_,
).T
[8]:
intercept C(VehBrand)[B1] C(VehBrand)[B10] C(VehBrand)[B11] C(VehBrand)[B12] C(VehBrand)[B13] C(VehBrand)[B14] C(VehBrand)[B2] C(VehBrand)[B3] C(VehBrand)[B4] ... C(DrivAge)[4]:C(VehPower)[8] C(DrivAge)[5]:C(VehPower)[8] C(DrivAge)[6]:C(VehPower)[8] C(DrivAge)[0]:C(VehPower)[9] C(DrivAge)[1]:C(VehPower)[9] C(DrivAge)[2]:C(VehPower)[9] C(DrivAge)[3]:C(VehPower)[9] C(DrivAge)[4]:C(VehPower)[9] C(DrivAge)[5]:C(VehPower)[9] C(DrivAge)[6]:C(VehPower)[9]
coefficient 2.88023 -0.069076 0.0 0.221037 -0.211854 0.052355 -0.272058 -0.074836 0.0 0.052523 ... -0.147844 -0.03567 0.504407 0.682528 -0.106569 -0.308257 0.173206 0.010684 -0.220273 0.070334

1 rows × 102 columns

Note that, in addition to the interactions, the non-interacted variants of DrivAge and VehPower are also included in the model. This is a result of using the * operator to interact the variables. Using : instead would only include the interactions, and not the marginals. (In short, a * b is equivalent to a + b + a:b.)

[9]:
[name for name in t_glm4.feature_names_ if "VehPower" in name]
[9]:
['C(VehPower)[4]',
 'C(VehPower)[5]',
 'C(VehPower)[6]',
 'C(VehPower)[7]',
 'C(VehPower)[8]',
 'C(VehPower)[9]',
 'C(DrivAge)[0]:C(VehPower)[4]',
 'C(DrivAge)[1]:C(VehPower)[4]',
 'C(DrivAge)[2]:C(VehPower)[4]',
 'C(DrivAge)[3]:C(VehPower)[4]',
 'C(DrivAge)[4]:C(VehPower)[4]',
 'C(DrivAge)[5]:C(VehPower)[4]',
 'C(DrivAge)[6]:C(VehPower)[4]',
 'C(DrivAge)[0]:C(VehPower)[5]',
 'C(DrivAge)[1]:C(VehPower)[5]',
 'C(DrivAge)[2]:C(VehPower)[5]',
 'C(DrivAge)[3]:C(VehPower)[5]',
 'C(DrivAge)[4]:C(VehPower)[5]',
 'C(DrivAge)[5]:C(VehPower)[5]',
 'C(DrivAge)[6]:C(VehPower)[5]',
 'C(DrivAge)[0]:C(VehPower)[6]',
 'C(DrivAge)[1]:C(VehPower)[6]',
 'C(DrivAge)[2]:C(VehPower)[6]',
 'C(DrivAge)[3]:C(VehPower)[6]',
 'C(DrivAge)[4]:C(VehPower)[6]',
 'C(DrivAge)[5]:C(VehPower)[6]',
 'C(DrivAge)[6]:C(VehPower)[6]',
 'C(DrivAge)[0]:C(VehPower)[7]',
 'C(DrivAge)[1]:C(VehPower)[7]',
 'C(DrivAge)[2]:C(VehPower)[7]',
 'C(DrivAge)[3]:C(VehPower)[7]',
 'C(DrivAge)[4]:C(VehPower)[7]',
 'C(DrivAge)[5]:C(VehPower)[7]',
 'C(DrivAge)[6]:C(VehPower)[7]',
 'C(DrivAge)[0]:C(VehPower)[8]',
 'C(DrivAge)[1]:C(VehPower)[8]',
 'C(DrivAge)[2]:C(VehPower)[8]',
 'C(DrivAge)[3]:C(VehPower)[8]',
 'C(DrivAge)[4]:C(VehPower)[8]',
 'C(DrivAge)[5]:C(VehPower)[8]',
 'C(DrivAge)[6]:C(VehPower)[8]',
 'C(DrivAge)[0]:C(VehPower)[9]',
 'C(DrivAge)[1]:C(VehPower)[9]',
 'C(DrivAge)[2]:C(VehPower)[9]',
 'C(DrivAge)[3]:C(VehPower)[9]',
 'C(DrivAge)[4]:C(VehPower)[9]',
 'C(DrivAge)[5]:C(VehPower)[9]',
 'C(DrivAge)[6]:C(VehPower)[9]']

The attentive reader might have also noticed that the first level of each categorical variable is omitted from the model. This is a manifestation of the more general concept of ensuring structural full-rankedness4. By default, glum and formulaic will try to make sure that one does not fall into the Dummy Variable Trap. Moreover, it even does it in the case of (possibly multi-way) interactions involving categorical variables. It will always drop the necessary number of levels, and no more. If you want to opt out of this behavior (for example because you would like to penalize all levels equally), simply set the drop_first parameter during model initialization to False. If one only aims to include all levels of a certain variable, and not others, it is possible to do so by using the spans_intercept parameter (e.g. C(VehPower, spans_intercept=False) would include all levels of VehPower even if drop_first is set to True).

4: Note, that it does not guarantee that the design matrix is actually full rank. For example, two identical numerical variables will still lead to a rank-deficient design matrix.

5. Fun with Functions

The previous example is only scratching the surface of what formulas are capable of. For example, they are capable of evaluating arbitrary Python expressions, which act as if they saw the columns of the input data frame as local variables (pandas.Series). The way to tell glum that a part of the formula should be evaluated as a Python expression before applying the formula grammar to it is to enclose it in curly braces. As an example, we can easily do the following within the formula itself:

  1. Create the outcome variable on the fly instead of doing it beforehand.

  2. Include the logarithm of a certain variable in the model.

  3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.

  1. works because because formulas can contain Python operations. 2. and 3. work because formulas are evaluated within a context that is aware of a number of transforms. To be precise, 2. is a regular transform and 3. is a stateful transform.

Let’s try it out!

[10]:
formula_fun = (
    "{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area"
    " + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + np.log(Density)"
)

t_glm5 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula_fun,
)
t_glm5.fit(df_train, sample_weight=df["Exposure"].values[train])

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm5.intercept_], t_glm5.coef_))},
    index=["intercept"] + t_glm5.feature_names_,
).T
[10]:
intercept VehBrand[B1] VehBrand[B10] VehBrand[B11] VehBrand[B12] VehBrand[B13] VehBrand[B14] VehBrand[B2] VehBrand[B3] VehBrand[B4] ... VehPower[4] VehPower[5] VehPower[6] VehPower[7] VehPower[8] VehPower[9] bs(BonusMalus, 3)[1] bs(BonusMalus, 3)[2] bs(BonusMalus, 3)[3] np.log(Density)
coefficient 3.808829 -0.060201 0.0 0.242194 -0.202517 0.063471 -0.345415 -0.072546 0.00777 0.079391 ... -0.113038 -0.127255 0.060209 0.005577 -0.032114 0.207355 3.178178 0.361951 8.231846 0.121944

1 rows × 62 columns

To allow for even more flexibility, you can add custom transformations that are defined in the context from which the call is made. E.g., we can define a transformation that takes the logarithm of VehAge + 1 after casting it to numeric. To make the formula recognize this transform, you need to explicitly set context=0 when calling the fit method (note that this differs from formulaic’s default, which is already context=0).

[11]:
def _log_plus_one(x):
    return np.log(pd.to_numeric(x) + 1)

formula_custom_fun = (
    "{ClaimAmountCut / Exposure} ~ _log_plus_one(VehAge)"
)

t_glm6 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula_custom_fun,
)
t_glm6.fit(df_train, sample_weight=df["Exposure"].values[train], context=0)

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm6.intercept_], t_glm6.coef_))},
    index=["intercept"] + t_glm6.feature_names_,
).T
[11]:
intercept _log_plus_one(VehAge)
coefficient 5.046712 -0.151043

6. Miscellaneous Features

Variable Names

glum’s formula interface provides a lot of control over how the resulting features are named. By default, it follows formulaic’s standards, but it can be customized by setting the interaction_separator and categorical_format paremeters.

[12]:
formula_name = "PurePremium ~ DrivAge * VehPower"

t_glm7 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=True,
    formula=formula_name,
    interaction_separator="__x__",
    categorical_format="{name}__{category}",
)
t_glm7.fit(df_train, sample_weight=df["Exposure"].values[train])

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm7.intercept_], t_glm7.coef_))},
    index=["intercept"] + t_glm7.feature_names_,
).T
[12]:
intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9
coefficient 5.007277 1.497079 0.53565 0.0 -0.152974 -0.210998 -0.205689 0.017896 -0.096153 -0.05484 ... -0.143822 -0.002094 0.512258 0.730534 -0.280869 -0.367669 0.171063 0.022052 -0.270456 0.119634

1 rows × 56 columns

Intercept Term

Just like in the case of the non-formula interface, the presence of an intercept is determined by the fit_intercept argument. In case that the formula specifies a different behavior (e.g., adding +0 or -1 while fit_intercept=True), an error will be raised.

[13]:
formula_noint = "PurePremium ~ DrivAge * VehPower - 1"

with pytest.raises(ValueError, match="The formula sets the intercept to False"):
    t_glm8 = GeneralizedLinearRegressor(
        family=TweedieDist,
        alpha_search=True,
        l1_ratio=1,
        fit_intercept=True,
        formula=formula_noint,
        interaction_separator="__x__",
        categorical_format="{name}__{category}",
    ).fit(df_train, sample_weight=df["Exposure"].values[train])

One-Sided Formulas

Even when using formulas, the outcome variable can be specified as a vector, as in the interface without formulas. In that case the supplied formula should be one-sided (not contain a ~), and only describe the right-hand side of the regression.

[14]:
formula_onesie = "DrivAge * VehPower"

t_glm8 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=False,
    formula=formula_onesie,
    interaction_separator="__x__",
    categorical_format="{name}__{category}",
)
t_glm8.fit(
    X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm8.intercept_], t_glm8.coef_))},
    index=["intercept"] + t_glm8.feature_names_,
).T
[14]:
intercept DrivAge__0 DrivAge__1 DrivAge__2 DrivAge__3 DrivAge__4 DrivAge__5 DrivAge__6 VehPower__4 VehPower__5 ... DrivAge__4__x__VehPower__8 DrivAge__5__x__VehPower__8 DrivAge__6__x__VehPower__8 DrivAge__0__x__VehPower__9 DrivAge__1__x__VehPower__9 DrivAge__2__x__VehPower__9 DrivAge__3__x__VehPower__9 DrivAge__4__x__VehPower__9 DrivAge__5__x__VehPower__9 DrivAge__6__x__VehPower__9
coefficient 0.0 1.713298 0.783505 0.205914 0.016085 0.0 0.000094 0.223685 4.66123 4.736272 ... -0.144927 0.001657 0.515373 0.714834 -0.325666 -0.370935 0.20417 0.013222 -0.273913 0.115693

1 rows × 56 columns

Missing Values in Categorical Columns

By default, glum raises a ValueError when it encounters a missing value in a categorical variable ("raise" option). However, there are two other options for handling these cases. They can also be treated as if they represented all-zeros indicators ("zero" option, which is also the way pandas.get_dummies works) or missing values can be treated as their own separate category ("convert" option).

Similarly to the non-formula-based interface, glum’s behavior can be set globally using the cat_missing_method parameter during model initialization. However, formulas provide some additional flexibility: the C function has a missing_method parameter, with which users can select an option on a column-by-column basis. Here is an example of doing that (although our dataset does not have any missing values, so these options have no actual effect in this case):

[15]:
formula_missing = "C(DrivAge, missing_method='zero') + C(VehPower, missing_method='convert')"

t_glm9 = GeneralizedLinearRegressor(
    family=TweedieDist,
    alpha_search=True,
    l1_ratio=1,
    fit_intercept=False,
    formula=formula_missing,

)
t_glm9.fit(
    X=df_train, y=df_train["PurePremium"], sample_weight=df["Exposure"].values[train]
)

pd.DataFrame(
    {"coefficient": np.concatenate(([t_glm9.intercept_], t_glm9.coef_))},
    index=["intercept"] + t_glm9.feature_names_,
).T
[15]:
intercept C(DrivAge, missing_method='zero')[0] C(DrivAge, missing_method='zero')[1] C(DrivAge, missing_method='zero')[2] C(DrivAge, missing_method='zero')[3] C(DrivAge, missing_method='zero')[4] C(DrivAge, missing_method='zero')[5] C(DrivAge, missing_method='zero')[6] C(VehPower, missing_method='convert')[4] C(VehPower, missing_method='convert')[5] C(VehPower, missing_method='convert')[6] C(VehPower, missing_method='convert')[7] C(VehPower, missing_method='convert')[8] C(VehPower, missing_method='convert')[9]
coefficient 0.0 1.786703 0.742765 0.239528 0.096531 0.071118 0.0 0.201078 4.637267 4.679391 4.863387 4.77263 4.749673 4.970188