The Cox Proportional Hazards Model in Glum
Intro
This tutorial shows how the Cox proportional hazards model (from now on: Cox model) which cannot be represented as an Exponential Dispersion Model (EDM), can still be estimated by a simple data transformation followed by a standard Poisson regression in glum (from now on: Poisson approach). The Poisson approach requires estimating the coefficients of a high-dimensional categorical of time-fixed-effects, which can be done efficiently in glum, leveraging the capabilities of tabmat’s
CategoricalMatrix. The exposition of the Poisson approach here is based on [1], but the equivalence has been described before [2].
Table of Contents
1. Equivalence Between the Cox Likelihood and a Profile Poisson Likelihood
In the Cox model, the rate of event occurrence, \(\lambda(t,x_i)\), factorizes nicely into a linear predictor \(\eta_i=\sum_k \beta_k x_{ik}\) that depends on individual \(i\)’s characteristics but not on time \(t\), and a baseline hazard \(\lambda_0\) that depends only on time: \(\lambda(t,x_i)=\lambda_0(t)\exp(\eta_i)\). This is known as the proportional hazards assumption). The partial log-likelihood of \(\eta_i\) is
where \(\mathcal{R}_t\) is the set of individuals observed at event time \(t\) and \(y_{i,t}\) is one if the individual has an event at \(t\) and zero otherwise.1 This partial log-likelihood cannot be represented as the log-likelihood of an EDM.2 Now consider an alternative Poisson regression with \(y_{i,t}\) as an outcome. Apart from a constant, the log likelihood is
Using the proportional hazards assumption and letting \(\alpha_t = \log(\lambda_0(t))\), this becomes
Solving the first order condition with respect to \(\alpha_t\) yields \(\exp(\hat{\alpha}_t) = \left(\sum_{i \in \mathcal{R}_t} \exp(\eta_i)\right)^{-1}\). This can be plugged back into the log likelihood to yield, after some simplifications,
which is the same as the partial likelihood in the Cox model, apart from the -1 which drops out when taking derivatives. In short, the Cox partial log likelihood is equivalent to a Poisson log likelihood with the estimate for time period effects fed back in (“profiled out”). This means that, to estimate the parameters of the Cox model, one can simply run a Poisson regression with time fixed effects \(\alpha_t\). The data structures for the two objectives are different: the Cox partial log-likelihood operates on data with one row per observed individual, while the Poisson log-likelihood uses one row per individual and time period.
2. Estimating a Cox Model in Glum
We now show that a Poisson approach in glum yields the same parameter estimates as a Cox model. For the latter, we use the lifelines library. We also take the dataset from lifelines, which is from an RCT on recidivism for 432 convicts released from Maryland state prisons with first arrest after release as event. We first load imports and the dataset. The dataset has one row per convict, with two outcome columns: the week until which the
observation lasts and arrest, which indicates whether an arrest event happened or not (censoring).
[1]:
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd
import glum
df = load_rossi()
df.head()
[1]:
| week | arrest | fin | age | race | wexp | mar | paro | prio | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 20 | 1 | 0 | 27 | 1 | 0 | 0 | 1 | 3 |
| 1 | 17 | 1 | 0 | 18 | 1 | 0 | 0 | 1 | 8 |
| 2 | 25 | 1 | 0 | 19 | 0 | 1 | 0 | 1 | 13 |
| 3 | 52 | 0 | 1 | 23 | 1 | 1 | 1 | 1 | 1 |
| 4 | 52 | 0 | 0 | 19 | 0 | 1 | 0 | 1 | 3 |
Next, we estimate the Cox model with an indicator for financial aid (the treatment), a B-spline in the age at time of release, indicators for race, prior experience, marital status, and parole, and a B-spline in the number of prior convictions:
[2]:
cph = CoxPHFitter()
cph.fit(
df,
duration_col="week",
event_col="arrest",
formula="fin + bs(age, df=4) + race + wexp + mar + paro + bs(prio, df=3)",
)
cph.print_summary()
| model | lifelines.CoxPHFitter |
|---|---|
| duration col | 'week' |
| event col | 'arrest' |
| baseline estimation | breslow |
| number of observations | 432 |
| number of events observed | 114 |
| partial log-likelihood | -656.25 |
| time fit was run | 2024-11-01 19:22:33 UTC |
| coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fin | -0.35 | 0.70 | 0.19 | -0.73 | 0.03 | 0.48 | 1.03 | 0.00 | -1.82 | 0.07 | 3.85 |
| bs(age, df=4)[1] | -0.49 | 0.61 | 0.64 | -1.74 | 0.76 | 0.18 | 2.13 | 0.00 | -0.77 | 0.44 | 1.18 |
| bs(age, df=4)[2] | -1.81 | 0.16 | 0.84 | -3.46 | -0.16 | 0.03 | 0.85 | 0.00 | -2.15 | 0.03 | 4.99 |
| bs(age, df=4)[3] | -0.91 | 0.40 | 1.41 | -3.67 | 1.86 | 0.03 | 6.40 | 0.00 | -0.64 | 0.52 | 0.94 |
| bs(age, df=4)[4] | -1.76 | 0.17 | 1.10 | -3.92 | 0.41 | 0.02 | 1.50 | 0.00 | -1.59 | 0.11 | 3.17 |
| race | 0.36 | 1.43 | 0.31 | -0.25 | 0.96 | 0.78 | 2.62 | 0.00 | 1.15 | 0.25 | 2.00 |
| wexp | -0.09 | 0.91 | 0.22 | -0.52 | 0.33 | 0.60 | 1.39 | 0.00 | -0.43 | 0.67 | 0.59 |
| mar | -0.33 | 0.72 | 0.39 | -1.09 | 0.42 | 0.34 | 1.53 | 0.00 | -0.87 | 0.39 | 1.37 |
| paro | -0.14 | 0.87 | 0.20 | -0.53 | 0.25 | 0.59 | 1.28 | 0.00 | -0.70 | 0.48 | 1.05 |
| bs(prio, df=3)[1] | 1.36 | 3.91 | 0.96 | -0.53 | 3.25 | 0.59 | 25.87 | 0.00 | 1.42 | 0.16 | 2.67 |
| bs(prio, df=3)[2] | -0.24 | 0.78 | 1.05 | -2.30 | 1.81 | 0.10 | 6.11 | 0.00 | -0.23 | 0.82 | 0.29 |
| bs(prio, df=3)[3] | 2.74 | 15.47 | 0.81 | 1.16 | 4.32 | 3.19 | 74.97 | 0.00 | 3.40 | <0.005 | 10.54 |
| Concordance | 0.65 |
|---|---|
| Partial AIC | 1336.50 |
| log-likelihood ratio test | 38.26 on 12 df |
| -log2(p) of ll-ratio test | 12.81 |
The results imply that financial aid leads to an about 30% reduction in the hazard of being arrested again, but the effect is just barely significant. We can now go about replicating these results in glum. For that, we first need to define a function that creates a dataset with one row per convict and period:
[3]:
def survival_split(df, time, outcome):
"""Split survival data into one row per observation and period. Inspired by `SurvSplit` in R."""
# table with unique event or censoring times
df_times = df[[time]].drop_duplicates().sort_values(time)
# create table with one row per time and left row
df["temp"] = 1
# optional: add id
df_times["temp"] = 1
df_ = pd.merge(df, df_times, on="temp", how="left", suffixes=["_end", ""]).drop(
columns="temp"
)
df = df.drop(columns="temp")
# remove rows after censoring or end time
df_ = df_.loc[df_[time + "_end"] >= df_[time]].reset_index(drop=True)
# add outcome
df_[outcome] = np.where(
df_[time + "_end"] == df_[time],
df_[outcome],
False if pd.api.types.is_bool_dtype(df_[time]) else 0,
)
return df_.drop(columns=time + "_end")
All that remains is to estimate a Poisson regression on this transformed dataset. Note that the formula here contains the week of the year as a categorical in addition to the regressors of the Cox model:
[4]:
df_split = survival_split(df, "week", "arrest")
model_glum = glum.GeneralizedLinearRegressor(
family="poisson",
formula="arrest ~ fin + bs(age, df=4) + race + wexp + mar + paro + bs(prio, df=3) + C(week)",
fit_intercept=False,
).fit(df_split)
We can check that the Poisson regression yields estimates and standard errors that, for all practical purposes, are the same as those of the Cox regression:
[5]:
cph.summary.reset_index()[["covariate", "coef", "se(coef)"]].merge(
model_glum.coef_table(X=df_split, robust=False)[["coef", "se"]],
left_on="covariate",
right_index=True,
).rename(
columns={
"coef_x": "coef_coxph",
"coef_y": "coef_poisson",
"se(coef)": "se_coxph",
"se": "se_poisson",
}
)
[5]:
| covariate | coef_coxph | se_coxph | coef_poisson | se_poisson | |
|---|---|---|---|---|---|
| 0 | fin | -0.350001 | 0.192611 | -0.349668 | 0.192792 |
| 1 | bs(age, df=4)[1] | -0.490560 | 0.636493 | -0.488086 | 0.637050 |
| 2 | bs(age, df=4)[2] | -1.807838 | 0.840686 | -1.797258 | 0.841228 |
| 3 | bs(age, df=4)[3] | -0.906026 | 1.409066 | -0.905313 | 1.410311 |
| 4 | bs(age, df=4)[4] | -1.756215 | 1.102792 | -1.747288 | 1.103601 |
| 5 | race | 0.356941 | 0.310188 | 0.356586 | 0.310484 |
| 6 | wexp | -0.093144 | 0.215796 | -0.095144 | 0.215862 |
| 7 | mar | -0.334348 | 0.385933 | -0.333904 | 0.386238 |
| 8 | paro | -0.138912 | 0.198157 | -0.138637 | 0.198338 |
| 9 | bs(prio, df=3)[1] | 1.364030 | 0.963899 | 1.355261 | 0.964604 |
| 10 | bs(prio, df=3)[2] | -0.244846 | 1.048481 | -0.234007 | 1.048920 |
| 11 | bs(prio, df=3)[3] | 2.738914 | 0.805202 | 2.714382 | 0.805035 |
3. Speed Considerations
The Poisson model estimates each time fixed effect parameter and, therefore, many more parameters than the Cox model. In our example, there are 61 coefficients in the Poisson model as opposed to 12 in the Cox model:
[6]:
len(cph.summary), len(model_glum.coef_)
[6]:
(12, 61)
One might, therefore, wonder if the Poisson approach is competitive in terms of estimation speed. For the dataset here, the Poisson approach, including the data transformation by survival_split, turns out to be faster than the Cox model. This speedup is aided by tabmat’s optimizations for the high-dimensional week categorical.
[7]:
%timeit cox_model = CoxPHFitter().fit(df, duration_col='week', event_col='arrest', formula="fin + bs(age, df=4) + race + wexp + mar + paro + bs(prio,df=3)")
32.8 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[8]:
%timeit model_glum = glum.GeneralizedLinearRegressor(family="poisson",formula="arrest ~ fin + bs(age, df=4) + race + wexp + mar + paro + bs(prio,df=3) + C(week)").fit(survival_split(df, 'week', 'arrest'))
18 ms ± 73.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Conclusion
This tutorial has shown how the Cox proportional hazards model can be reformulated as a Poisson regression on a transformed dataset, which can be run without specialized survival estimators. The Poisson regression involves estimating a high-dimensional categorical with one category per event time, for which glum’s handling of categoricals is handy. This reformulation also allows for exploration beyond the Cox model, including the option to replace the piecewise constant baseline hazard with a smooth one, and to swap the Poisson objective for another, such as a binomial.
Footnotes
1: The Cox model assumes that at most one individual has an event at any time (“no ties”). Different tie breaking methods available for datasets with duplicate event times. The most popular ones are an exact method, which is expensive to compute, Efron’s method which is reasonably fast and accurate, and the Breslow approximation, which is the fastest. In the case of many ties, an inherently discrete survival model is probably the best option.
2: The log likelihood of an EDM with parameters \(\theta_i\) is
with cumulant function \(b\) and dispersion parameter \(\sigma_i\). The linear predictor is given by \(\eta_i=g((b'(\theta_i))\) for link function \(g\). The Cox partial log-likelihood cannot be broken down into such a sum over individuals \(i\).
References
[1] Carstensen, B. 2023. Who needs the Cox model anyway. December. Available at: http://bendixcarstensen.com/WntCma.pdf.
[2] Whitehead, J., 1980. Fitting Cox’s regression model to survival data using GLIM. Journal of the Royal Statistical Society Series C: Applied Statistics, 29(3), pp.268-275.