{ "cells": [ { "cell_type": "markdown", "id": "0cbb38dc", "metadata": { "tags": [] }, "source": [ "# Getting Started: fitting a Lasso model \n", "\n", "The purpose of this tutorial is to show the basics of `glum`. It assumes a working knowledge of python, regularized linear models, and machine learning. The API is very similar to scikit-learn. After all, `glum` is based on a fork of scikit-learn.\n", "\n", "If you have not done so already, please refer to our [installation instructions](../install.rst) for installing `glum`." ] }, { "cell_type": "code", "execution_count": 1, "id": "0b0a7790", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:13.935556Z", "iopub.status.busy": "2026-04-21T09:13:13.935348Z", "iopub.status.idle": "2026-04-21T09:13:14.965229Z", "shell.execute_reply": "2026-04-21T09:13:14.964779Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import sklearn\n", "from sklearn.datasets import fetch_openml\n", "from glum import GeneralizedLinearRegressor, GeneralizedLinearRegressorCV" ] }, { "cell_type": "markdown", "id": "3f664566", "metadata": {}, "source": [ "## Data\n", "\n", "We start by loading the King County housing dataset from openML and splitting it into training and test sets. For simplicity, we don't go into any details regarding exploration or data cleaning." ] }, { "cell_type": "code", "execution_count": 2, "id": "896a2486", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:14.966705Z", "iopub.status.busy": "2026-04-21T09:13:14.966588Z", "iopub.status.idle": "2026-04-21T09:13:14.995434Z", "shell.execute_reply": "2026-04-21T09:13:14.995074Z" } }, "outputs": [], "source": [ "house_data = fetch_openml(name=\"house_sales\", version=3, as_frame=True)\n", "\n", "# Use only select features\n", "X = house_data.data[\n", " [\n", " \"bedrooms\",\n", " \"bathrooms\",\n", " \"sqft_living\",\n", " \"floors\",\n", " \"waterfront\",\n", " \"view\",\n", " \"condition\",\n", " \"grade\",\n", " \"yr_built\",\n", " ]\n", "].copy()\n", "\n", "# Targets\n", "y = house_data.target" ] }, { "cell_type": "code", "execution_count": 3, "id": "a65eff50", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:14.996642Z", "iopub.status.busy": "2026-04-21T09:13:14.996574Z", "iopub.status.idle": "2026-04-21T09:13:15.090998Z", "shell.execute_reply": "2026-04-21T09:13:15.090512Z" } }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n", " X, y, test_size = 0.3, random_state=5\n", ")" ] }, { "cell_type": "markdown", "id": "31f29f1b", "metadata": {}, "source": [ "## GLM basics: fitting and predicting using the normal family\n", "\n", "We'll use `glum.GeneralizedLinearRegressor` to predict the house prices using the available predictors. \n", "\n", "We set three key parameters:\n", "\n", "- `family`: the family parameter specifies the distributional assumption of the GLM and, as a consequence, the loss function to be minimized. Accepted strings are 'normal', 'poisson', 'gamma', 'inverse.gaussian', and 'binomial'. You can also pass in an instantiated `glum` distribution (e.g. `glum.TweedieDistribution(1.5)` )\n", "- `alpha`: the constant multiplying the penalty term that determines regularization strength. (*Note*: `GeneralizedLinearRegressor` also has an alpha-search option. See the `GeneralizedLinearRegressorCV` example below for details on how alpha-search works).\n", "- `l1_ratio`: the elastic net mixing parameter (`0 <= l1_ratio <= 1`). For `l1_ratio = 0`, the penalty is the L2 penalty (ridge). ``For l1_ratio = 1``, it is an L1 penalty (lasso). For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2.\n", "\n", "To be precise, we will be minimizing the function with respect to the parameters, $\\beta$:\n", "\n", "\\begin{equation}\n", "\\frac{1}{N}(\\mathbf{X}\\beta - y)^2 + \\alpha\\|\\beta\\|_1\n", "\\end{equation}" ] }, { "cell_type": "code", "execution_count": 4, "id": "aa90b816", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:15.092186Z", "iopub.status.busy": "2026-04-21T09:13:15.092119Z", "iopub.status.idle": "2026-04-21T09:13:15.093713Z", "shell.execute_reply": "2026-04-21T09:13:15.093439Z" } }, "outputs": [], "source": [ "glm = GeneralizedLinearRegressor(family=\"normal\", alpha=0.1, l1_ratio=1)" ] }, { "cell_type": "markdown", "id": "b4dee7fb", "metadata": {}, "source": [ "The `GeneralizedLinearRegressor.fit()` method follows typical sklearn API style and accepts two primary inputs:\n", "\n", "1. `X`: the design matrix with shape `(n_samples, n_features)`.\n", "2. `y`: the `n_samples` length array of target data." ] }, { "cell_type": "code", "execution_count": 5, "id": "ae60a126", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:15.094724Z", "iopub.status.busy": "2026-04-21T09:13:15.094667Z", "iopub.status.idle": "2026-04-21T09:13:15.118489Z", "shell.execute_reply": "2026-04-21T09:13:15.118147Z" } }, "outputs": [ { "data": { "text/html": [ "
GeneralizedLinearRegressor(alpha=0.1, l1_ratio=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GeneralizedLinearRegressor(alpha=0.1, l1_ratio=1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glm.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "id": "59ef7916", "metadata": {}, "source": [ "Once the model has been estimated, we can retrieve useful information using an sklearn-style syntax." ] }, { "cell_type": "code", "execution_count": 6, "id": "442345f9", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:15.119626Z", "iopub.status.busy": "2026-04-21T09:13:15.119554Z", "iopub.status.idle": "2026-04-21T09:13:15.123637Z", "shell.execute_reply": "2026-04-21T09:13:15.123306Z" } }, "outputs": [ { "data": { "text/plain": [ "array([ 482648.22861173, 142902.68860096, 539452.61266271,\n", " 569693.78048478, 1042446.90903338])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retrieve the coefficients and the intercept\n", "coefs = glm.coef_\n", "intercept = glm.intercept_\n", "\n", "# use the model to predict on our test data\n", "preds = glm.predict(X_test)\n", "\n", "preds[0:5]" ] }, { "cell_type": "markdown", "id": "regularization-section", "metadata": {}, "source": [ "## Regularization\n", "\n", "In the example above, the `alpha` and `l1_ratio` parameters specify the level of regularization, i.e. the amount by which fitted model coefficients are biased towards zero.\n", "The advantage of the regularized model is that one avoids overfitting by controlling the tradeoff between the bias and the variance of the coefficient estimator.\n", "An optimal level of regularization can be obtained data-adaptively through cross-validation. In the `GeneralizedLinearRegressorCV` example below, we show how this can be done by specifying an `alpha_search` parameter.\n", "\n", "To fit an unregularized GLM we set `alpha=0`. Note that the default level `alpha=None` results in regularization at the level `alpha=1.0`, which is the default in the scikit-learn's [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html).\n", "\n", "A basic unregularized GLM object is obtained as\n", "```python\n", "glm = GeneralizedLinearRegressor(family=\"normal\", alpha=0)\n", "```\n", "which we interact with as in the example above." ] }, { "cell_type": "markdown", "id": "baf9343c", "metadata": {}, "source": [ "## Fitting a GLM with cross validation\n", "\n", "Now, we fit using automatic cross validation with `glum.GeneralizedLinearRegressorCV`. This mirrors the commonly used `cv.glmnet` function. \n", "\n", "Some important parameters:\n", "\n", "- `alphas`: for `GeneralizedLinearRegressorCV`, the best `alpha` will be found by searching along the regularization path. The regularization path is determined as follows:\n", " 1. If `alpha` is an iterable, use it directly. All other parameters\n", " governing the regularization path are ignored.\n", " 2. If `min_alpha` is set, create a path from `min_alpha` to the\n", " lowest alpha such that all coefficients are zero.\n", " 3. If `min_alpha_ratio` is set, create a path where the ratio of\n", " `min_alpha / max_alpha = min_alpha_ratio`.\n", " 4. If none of the above parameters are set, use a `min_alpha_ratio`\n", " of 1e-6. \n", "- `l1_ratio`: for `GeneralizedLinearRegressorCV`, if you pass `l1_ratio` as an array, the `fit` method will choose the best value of `l1_ratio` and store it as `self.l1_ratio_`." ] }, { "cell_type": "code", "execution_count": 7, "id": "6aff410f", "metadata": { "execution": { "iopub.execute_input": "2026-04-21T09:13:15.124669Z", "iopub.status.busy": "2026-04-21T09:13:15.124605Z", "iopub.status.idle": "2026-04-21T09:13:20.255825Z", "shell.execute_reply": "2026-04-21T09:13:20.255429Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chosen alpha: 232.70228992842283\n", "Chosen l1 ratio: 1.0\n" ] } ], "source": [ "glmcv = GeneralizedLinearRegressorCV(\n", " family=\"normal\",\n", " alphas=None, # default\n", " min_alpha=None, # default\n", " min_alpha_ratio=None, # default\n", " l1_ratio=[0, 0.5, 1.0],\n", " fit_intercept=True,\n", " max_iter=150\n", ")\n", "glmcv.fit(X_train, y_train)\n", "print(f\"Chosen alpha: {glmcv.alpha_}\")\n", "print(f\"Chosen l1 ratio: {glmcv.l1_ratio_}\")" ] }, { "cell_type": "markdown", "id": "eb2375f4", "metadata": {}, "source": [ "Congratulations! You have finished our getting started tutorial. If you wish to learn more, please see our other tutorials for more advanced topics like Poisson, Gamma, and Tweedie regression, high dimensional fixed effects, and spatial smoothing using Tikhonov regularization." ] }, { "cell_type": "code", "execution_count": null, "id": "033e34cc-bd94-486c-8301-e85e1eb82969", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "f4ccb5736b973816ae00c72a718cb0ac20728fa34095efe3fdd792810ed0340a" }, "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }