{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tikhonov Regularization Tutorial: Seattle-Tacoma Housing Data\n", "\n", "**Intro**\n", "\n", "This tutorial shows how to use variable $L_2$ regularization with glum. The `P2` parameter of the `GeneralizedLinearRegressor` class allows you to directly set the $L_2$ penalty matrix $w^T P_2 w$. If a 2d array is passed for the `P2` parameter, it is used directly, while if you pass a 1d array as `P2` it will be interpreted as the diagonal of $P_2$ and all other entries will be assumed to be zero.\n", "\n", "*Note*: Variable $L_1$ regularization is also available by passing an array with length `n_features` to the `P1` parameter. \n", "\n", "\n", "**Background**\n", "\n", "For this tutorial, we will model the selling price of homes in King's County, Washington (Seattle-Tacoma Metro area) between May 2014 and May 2015. However, in order to demonstrate a Tikhonov regularization-based spatial smoothing technique, we will focus on a small, skewed data sample from that region in our training data. Specifically, we will show that when we have (a) a fixed effect for each postal code region and (b) only a select number of training observations in a certain region, we can improve the predictive power of our model by regularizing the difference between the coefficients of neighboring regions. While we are constructing a somewhat artificial example here in order to demonstrate the spatial smoothing technique, we have found similar techniques to be applicable to real-world problems. \n", "\n", "We will use a gamma distribution for our model. This choice is motivated by two main factors. First, our target variable, home price, is a positive real number, which matches the support of the gamma distribution. Second, it is expected that factors influencing housing prices are multiplicative rather than additive, which is better captured with a gamma regression than say, OLS.\n", "\n", "\n", "*Note*: a few parts of this tutorial utilize local helper functions outside this notebook. If you wish to run the notebook on your own, you can find the rest of the code [here](https://github.com/Quantco/glum/tree/open-sourcing/docs/tutorials/regularization_housing_data).\n", "\n", "\n", "## Table of Contents\n", "* [1. Load and Prepare Datasets from Openml.org](#1.-Load-and-Prepare-Datasets-from-Openml-Data)\n", "* [2. Visualize Geographic Data with GIS Open Data](#2.-Visualize-Geographic-Data-with-GIS-Open-Data)\n", "* [3. Feature Selection and Transformation](#3.-Feature-Selection-and-Transformation)\n", "* [4. Create P matrix](#4.-Create-P-Matrix)\n", "* [5. Fit Models](#5.-Fit-Models)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "\n", "import geopandas as geopd\n", "import libpysal\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import openml\n", "import pandas as pd\n", "\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from glum import GeneralizedLinearRegressor\n", "\n", "import sys\n", "sys.path.append(\"../\")\n", "from metrics import root_mean_squared_percentage_error\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\", message=\"The weights matrix is not fully connected\")\n", "\n", "import data_prep\n", "import maps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load and prepare datasets from Openml\n", "[back to table of contents](#Table-of-Contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Download and transform\n", "The main dataset is downloaded from OpenML. You can find the main page for the dataset [here](https://www.openml.org/d/42092). It is also available through Kaggle [here](https://www.kaggle.com/harlfoxem/housesalesprediction). \n", "\n", "As part of data preparation, we also do some transformations to the data:\n", "\n", "- We remove some outliers (homes over 1.5 million and under 100k). \n", "- Since we want to focus on geographic features, we also remove a handful of the other features.\n", "\n", "Below, you can see some example rows from the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | bedrooms | \n", "bathrooms | \n", "sqft_living | \n", "floors | \n", "waterfront | \n", "view | \n", "condition | \n", "sqft_basement | \n", "yr_built | \n", "zipcode | \n", "price | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "3 | \n", "1.00 | \n", "1180 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1955 | \n", "98178 | \n", "221900.0 | \n", "
| 1 | \n", "3 | \n", "2.25 | \n", "2570 | \n", "2.0 | \n", "0 | \n", "0 | \n", "3 | \n", "400 | \n", "1951 | \n", "98125 | \n", "538000.0 | \n", "
| 2 | \n", "2 | \n", "1.00 | \n", "770 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1933 | \n", "98028 | \n", "180000.0 | \n", "
| 3 | \n", "4 | \n", "3.00 | \n", "1960 | \n", "1.0 | \n", "0 | \n", "0 | \n", "5 | \n", "910 | \n", "1965 | \n", "98136 | \n", "604000.0 | \n", "
| 4 | \n", "3 | \n", "2.00 | \n", "1680 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1987 | \n", "98074 | \n", "510000.0 | \n", "
| \n", " | OBJECTID | \n", "ZIP | \n", "ZIPCODE | \n", "COUNTY | \n", "SHAPE_Leng | \n", "SHAPE_Area | \n", "geometry | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "98031 | \n", "98031 | \n", "033 | \n", "117508.211718 | \n", "2.280129e+08 | \n", "POLYGON ((-122.2184228967409 47.4375036485968,... | \n", "
| 1 | \n", "2 | \n", "98032 | \n", "98032 | \n", "033 | \n", "166737.664791 | \n", "4.826754e+08 | \n", "(POLYGON ((-122.2418694980486 47.4412158004961... | \n", "
| 2 | \n", "3 | \n", "98033 | \n", "98033 | \n", "033 | \n", "101363.840369 | \n", "2.566747e+08 | \n", "POLYGON ((-122.2057111926017 47.65169738162997... | \n", "
| 3 | \n", "4 | \n", "98034 | \n", "98034 | \n", "033 | \n", "98550.452509 | \n", "2.725072e+08 | \n", "POLYGON ((-122.1755100327681 47.73706057280546... | \n", "
| 4 | \n", "5 | \n", "98030 | \n", "98030 | \n", "033 | \n", "94351.264837 | \n", "2.000954e+08 | \n", "POLYGON ((-122.1674637459728 47.38548925033355... | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 199 | \n", "200 | \n", "98402 | \n", "98402 | \n", "053 | \n", "30734.178112 | \n", "2.612224e+07 | \n", "POLYGON ((-122.4427945843513 47.2647926142345,... | \n", "
| 200 | \n", "201 | \n", "98403 | \n", "98403 | \n", "053 | \n", "23495.038425 | \n", "2.890938e+07 | \n", "POLYGON ((-122.4438167281511 47.26617469660845... | \n", "
| 201 | \n", "202 | \n", "98404 | \n", "98404 | \n", "053 | \n", "61572.154365 | \n", "2.160645e+08 | \n", "POLYGON ((-122.3889999141967 47.23495303304902... | \n", "
| 202 | \n", "203 | \n", "98405 | \n", "98405 | \n", "053 | \n", "50261.100559 | \n", "1.193118e+08 | \n", "POLYGON ((-122.4409198889526 47.23639133730699... | \n", "
| 203 | \n", "204 | \n", "98406 | \n", "98406 | \n", "053 | \n", "74118.972418 | \n", "1.088373e+08 | \n", "(POLYGON ((-122.5212509005256 47.2712095490982... | \n", "
204 rows × 7 columns
\n", "| \n", " | 98001 | \n", "98002 | \n", "98003 | \n", "98004 | \n", "98005 | \n", "98006 | \n", "98007 | \n", "98008 | \n", "98010 | \n", "98011 | \n", "... | \n", "bedrooms | \n", "bathrooms | \n", "sqft_living | \n", "floors | \n", "waterfront | \n", "view | \n", "condition | \n", "sqft_basement | \n", "yr_built | \n", "price | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3 | \n", "1.00 | \n", "1180 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1955 | \n", "221900.0 | \n", "
| 1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3 | \n", "2.25 | \n", "2570 | \n", "2.0 | \n", "0 | \n", "0 | \n", "3 | \n", "400 | \n", "1951 | \n", "538000.0 | \n", "
| 2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "2 | \n", "1.00 | \n", "770 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1933 | \n", "180000.0 | \n", "
| 3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "4 | \n", "3.00 | \n", "1960 | \n", "1.0 | \n", "0 | \n", "0 | \n", "5 | \n", "910 | \n", "1965 | \n", "604000.0 | \n", "
| 4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "3 | \n", "2.00 | \n", "1680 | \n", "1.0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1987 | \n", "510000.0 | \n", "
5 rows × 80 columns
\n", "