# An introduction to the algorithms used in `glum`

Before continuing, please take a look at the sklearn documentation for a high-level intro to generalized linear models.

In addition, please take a look at the tutorials.

For mathematical and algorithmic details, see below:

## What is a GLM?

This package is intended to fit L1 and L2-norm penalized generalized linear models (GLMs). What is a GLM?

A GLM is a linear model (\(\eta = x^\top \beta\)) wrapped in a transformation (link function) and equipped with a response distribution from an exponential family. The choice of link function and response distribution is very flexible. In a GLM, a predictive distribution for the response variable \(Y\) is associated with a vector of observed predictors \(x\). The distribution has the form:

Here \(\beta\) are the parameters; \(\phi\) is a parameter representing dispersion (“variance”); and \(m\), \(T\), \(A\) are characterized by the user-specified model family; and \(g\) is the **link function.** For background on the exponential family, and how common distributions can be represented in the form of the “random component” above, see Wikipedia | Exponential family.

The expected mean of \(Y\) depends on \(x\) by composition of **linear response** \(\eta\) and (inverse) link function, i.e.:

Proving this involves use of moment-generating functions and cumulants.

Defining an offset of \(\gamma\), we can write the linear predictor as \(\eta = X^T \beta + \gamma\).

## Examples of GLMs

### Normal

Many of the “regression” models we commonly work with are GLMs. For example, the normal distribution with a linear link function is a simple GLM: \(y | x \sim N(x^T \beta, \sigma^2)\). If we fit \(\beta\) with maximum likelihood, this is least-squares regression. If we replace the linear link function with an exponential link function, we get a simple multiplicative model that is also a GLM: \(y | x \sim N(e^{x^T \beta}, \sigma^2)\).

### Tweedie

In the insurance context, we usually work with a Tweedie distribution, which is a generalization of Poisson (\(p=1\)) and Gamma (\(p=2\)):

With \(1 < p < 2\), the Tweedie distribution is a compound Poisson-gamma distribution. \(y\) is distributed as if a number \(N\) is drawn from a Poisson distribution, and then \(N\) draws are taken IID from a gamma distribution and added. This distribution might model the total amount of a claim in an insurance context: There are \(N\) incidents, and each incident \(i\) has an amount \(X_i\), and the total amount is the sum of \(X_i\).

## Objective function

To fit a GLM, we minimize the negative log-likelihood (or typically the unit deviance) subject to an elastic net constraint involving a mix of L1 and L2 penalty terms:

## Optimizers/solvers

There are three solvers implemented in `glum`

:

`lbfgs`

- This solver uses the scipy`fmin_l_bfgs_b`

optimizer to minimize L2-penalized GLMs. The L-BFGS solver does not work with L1-penalties. Because L-BFGS does not store the full Hessian, it can be particularly effective for very high dimensional problems with several thousand or more columns. For more details, see the scipy documentation.`irls-cd`

and`irls-ls`

- These solvers are both based on Iteratively Reweighted Least Squares (IRLS). IRLS proceeds by iteratively approximating the objective function with a quadratic, then solving that quadratic for the optimal update. For purely L2-penalized settings, the`irls-ls`

uses a least squares inner solver for each quadratic subproblem. For problems that have any L1-penalty component, the`irls-cd`

uses a coordinate descent inner solver for each quadratic subproblem. The IRLS-LS and IRLS-CD implementations largely follow the algorithm described in`newglmnet`

(see references below).

### Convexity

Our objective function will generally be convex, because the log-likelihoods of members of the exponential family are convex in their “natural parameterizations.” The natural parameterization may not be the most obvious one. Example: The log-likelihood of the normal distribution is convex in one over the variance, which is its natural parameterization. It is not convex in the variance. We can generally assume we are solving convex problems.

An exception is multicollinearity (rank deficiency) in the design matrix, without an L2 component to the penalty. In that case, the problem will be only weakly convex and will have no unique miminum. This is not an arcane consideration, since we frequently generate rank deficiency by constructing multiple sets of one-hot encoded categorical variables. This can make evaluating different optimizers tricky, since they could converge to different equally good optima. The original `glmnet`

paper
suggests using at least a small L2 regularization component to remove “degeneracies and wild behavior.”

### IRLS

We minimize the log likelihood using Iteratively Reweighted Least Squares (IRLS). IRLS can be justified by seeing it as taking a Newton step, but replacing the Hessian with the expected Hessian.

In the `irls-cd`

and `irls-ls`

solvers, the outer loop is an IRLS iteration that forms a quadratic approximation to the negative loglikelihood. That is, we find `w`

and `z`

so that the problem can be expressed as:

```
min sum_i w_i (z_i - x_i beta)^2 + penalty
```

We exit when either the gradient is small (`gradient_tol`

) or the step size is small (`step_size_tol`

). Both of these tolerances are user configurable.

Once we have formed this quadratic approximation, an “inner solver” finds the minimum of the quadratic. In the `irls-ls`

solver, the inner solver is simply a direct least squares solve.

See the `glmintro`

reference for an excellent discussion of IRLS in the context of GLMs.

### Coordinate Descent

With an L1 penalty, we use the `irls-cd`

solver where the inner solver finds the minimum of the quadratic using coordinate descent. We exit the inner loop when the quadratic problem’s gradient is small. The classic reference here is the `glmnet`

paper.

However, coordinate descent is older than the glmnet paper, and is a simple idea. In a problem with data \(y\) and \(x\) and parameters “params”, Coordinate Descent involves repeatedly optimizing one or several parameters while holding the rest fixed. Interestingly, Wikipedia claims that coordinate descent is often overlooked among researchers because it is simple to implement, and they would rather work on something more interesting.

```
[1]:
```

```
"""
Cyclical coordinate descent with Newton-like steps and a
twice-differentiable objective function.
"""
def coordinate_descent_inner(y, x, params, grad_func, hess_func):
for i, elt in enumerate(params):
step = -grad_func(y, x, params) / hess_func(y, x, params)
params[i] += step
return params
def coordinate_descent(y, x, params, grad_func, hess_func,
convergence_func):
while not convergence_func():
params = coordinate_descent_inner(y, x, params, grad_func,
hess_func)
return params
```

In practice, the implementation is a bit more complicated when the objective function is not differentiable. See also the `coordinate_descent`

reference below.

### Active set tracking

When penalizing with an L1-norm, it is common for many coefficients to be exactly zero. And, it is possible to predict during a given iteration which of those coefficients will stay zero. As a result, we track the “active set” consisting of all the coefficients that are either currently non-zero or likely to remain non-zero. We follow the outer loop active set tracking algorithm in the `newglmnet`

reference. That paper refers to the same concept as “shrinkage”, whereas the `glmnet`

reference
calls this the “active set”. Currently, we have not yet implemented the inner loop active set tracking from the `newglmnet`

reference.

## Matrix Types

Along with the GLM solvers, this package supports dense, sparse, categorical matrix types and mixtures of these types. Using the most efficient matrix representations massively improves performances.

For more details, see the README for tabmat

We support dense matrices via standard numpy arrays.

We support sparse CSR and CSC matrices via standard

`scipy.sparse`

objects. However, we have extended these operations with custom matrix-vector and sandwich product routines that are optimized and parallelized. A user does not need to modify their code to take advantage of this optimization. If a`scipy.sparse.csc_matrix`

object is passed in, it will be automatically converted to a`SparseMatrix`

object. This operation is almost free because no data needs to be copied.We implement a

`CategoricalMatrix`

object that efficiently represents these matrices without nearly as much overhead as a normal CSC or CSR sparse matrix.Finally,

`SplitMatrix`

allows mixing different matrix types for different columns to minimize overhead.

## Standardization

Internal to our solvers, all matrix types are wrapped in a `tabmat.StandardizedMatrix`

which offsets columns to have mean zero and standard deviation one without modifying the matrix data itself. This avoids situations where modifying a matrix to have mean zero would result in losing the sparsity structure. It also avoids ever needing to copy or modify the input data matrix. As a result, excess memory usage is very low in `glum`

.

## References

`glmnet`

- Regularization Paths for Generalized Linear Models via Coordinate Descent

`newglmnet`

- An Improved GLMNET for L1-regularized LogisticRegression

`glmintro`

- Bryan Lewis on GLMs

`coordinate_descent`

- Coordinate Descent Algorithms

`glmbook`

- Generalized Linear Models, McCullagh and Nelder

```
[ ]:
```

```
```