Getting Started: fitting a Lasso model
The purpose of this tutorial is to show the basics of glum
. It assumes a working knowledge of python, regularized linear models, and machine learning. The API is very similar to scikit-learn. After all, glum
is based on a fork of scikit-learn.
If you have not done so already, please refer to our installation instructions for installing glum
.
[1]:
import pandas as pd
import sklearn
from sklearn.datasets import fetch_openml
from glum import GeneralizedLinearRegressor, GeneralizedLinearRegressorCV
Data
We start by loading the King County housing dataset from openML and splitting it into training and test sets. For simplicity, we don’t go into any details regarding exploration or data cleaning.
[2]:
house_data = fetch_openml(name="house_sales", version=3, as_frame=True)
# Use only select features
X = house_data.data[
[
"bedrooms",
"bathrooms",
"sqft_living",
"floors",
"waterfront",
"view",
"condition",
"grade",
"yr_built",
]
].copy()
# Targets
y = house_data.target
[3]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, test_size = 0.3, random_state=5
)
GLM basics: fitting and predicting using the normal family
We’ll use glum.GeneralizedLinearRegressor
to predict the house prices using the available predictors.
We set three key parameters:
family
: the family parameter specifies the distributional assumption of the GLM and, as a consequence, the loss function to be minimized. Accepted strings are ‘normal’, ‘poisson’, ‘gamma’, ‘inverse.gaussian’, and ‘binomial’. You can also pass in an instantiatedglum
distribution (e.g.glum.TweedieDistribution(1.5)
)alpha
: the constant multiplying the penalty term that determines regularization strength. (Note:GeneralizedLinearRegressor
also has an alpha-search option. See theGeneralizedLinearRegressorCV
example below for details on how alpha-search works).l1_ratio
: the elastic net mixing parameter (0 <= l1_ratio <= 1
). Forl1_ratio = 0
, the penalty is the L2 penalty (ridge).For l1_ratio = 1
, it is an L1 penalty (lasso). For0 < l1_ratio < 1
, the penalty is a combination of L1 and L2.
To be precise, we will be minimizing the function with respect to the parameters, \(\beta\):
\begin{equation} \frac{1}{N}(\mathbf{X}\beta - y)^2 + \alpha\|\beta\|_1 \end{equation}
[4]:
glm = GeneralizedLinearRegressor(family="normal", alpha=0.1, l1_ratio=1)
The GeneralizedLinearRegressor.fit()
method follows typical sklearn API style and accepts two primary inputs:
X
: the design matrix with shape(n_samples, n_features)
.y
: then_samples
length array of target data.
[5]:
glm.fit(X_train, y_train)
[5]:
GeneralizedLinearRegressor(alpha=0.1, l1_ratio=1)
Once the model has been estimated, we can retrieve useful information using an sklearn-style syntax.
[6]:
# retrieve the coefficients and the intercept
coefs = glm.coef_
intercept = glm.intercept_
# use the model to predict on our test data
preds = glm.predict(X_test)
preds[0:5]
[6]:
array([ 482648.22861066, 142902.68859995, 539452.61266391,
569693.78048569, 1042446.90903451])
Fitting a GLM with cross validation
Now, we fit using automatic cross validation with glum.GeneralizedLinearRegressorCV
. This mirrors the commonly used cv.glmnet
function.
Some important parameters:
alphas
: forGeneralizedLinearRegressorCV
, the bestalpha
will be found by searching along the regularization path. The regularization path is determined as follows:If
alpha
is an iterable, use it directly. All other parameters governing the regularization path are ignored.If
min_alpha
is set, create a path frommin_alpha
to the lowest alpha such that all coefficients are zero.If
min_alpha_ratio
is set, create a path where the ratio ofmin_alpha / max_alpha = min_alpha_ratio
.If none of the above parameters are set, use a
min_alpha_ratio
of 1e-6.
l1_ratio
: forGeneralizedLinearRegressorCV
, if you passl1_ratio
as an array, thefit
method will choose the best value ofl1_ratio
and store it asself.l1_ratio_
.
[7]:
glmcv = GeneralizedLinearRegressorCV(
family="normal",
alphas=None, # default
min_alpha=None, # default
min_alpha_ratio=None, # default
l1_ratio=[0, 0.5, 1.0],
fit_intercept=True,
max_iter=150
)
glmcv.fit(X_train, y_train)
print(f"Chosen alpha: {glmcv.alpha_}")
print(f"Chosen l1 ratio: {glmcv.l1_ratio_}")
Chosen alpha: 0.0003274549162877732
Chosen l1 ratio: 0.0
Congratulations! You have finished our getting started tutorial. If you wish to learn more, please see our other tutorials for more advanced topics like Poisson, Gamma, and Tweedie regression, high dimensional fixed effects, and spatial smoothing using Tikhonov regularization.
[ ]: