---
title: "Texas Housing Sales"
output:
html_document:
params:
spotlight: "Houston"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results='asis')
# this is an R option that controls the number of digits printed
options(digits=4)
# for dataset
library(dplyr)
# for graphing functions
library(ggplot2)
# for pretty regression tables
library(broom)
# graphical parameters
theme_set(theme_bw())
# Using 2 colors from ColorBrewer scle RdBu
palette <- c("#ef8a62", "#67a9cf")
```
# Background
## Purpose
This is a study of housing sales in Texas cities, with a spotlight on the city of `r params$spotlight`. We will have graphs of time trends and run a linear regression model.
## Quick look at the dataset
Dataset from `ggplot2` package
**Variables:**
* **city:** Name of MLS area
* **year, month, date:** year and month, and date expressed as continuous year variable
* **sales:** Number of sales
* **volume:** Total value of sales
* **median:** Median sale price
* **listings:** Total active listings
* **inventory:** "Months inventory", amount of time it would take to sell all current listings at current pace of sales
Here are the first 20 rows of the spotlight city:
```{r dataset}
txhousing <- subset(txhousing, txhousing$year<=2001)
spotcity <- subset(txhousing, txhousing$city==params$spotlight)
knitr::kable(head(spotcity, n=20), align='c')
txhousing$month <- factor(txhousing$month, labels=c("Jan", "Feb", "Mar",
"Apr", "May", "Jun",
"Jul", "Aug", "Sep",
"Oct", "Nov", "Dec"))
```
# Plots
## Time series of Sales By City, Spotlight on `r params$spotlight`
Sales appear to be rising through 2006, then slow until 2010, and then rise again.
```{r, results='hold'}
ggplot(txhousing, aes(x=date, y=sales, group=city)) +
geom_line(color=palette[2], alpha=.5) +
geom_line(data=spotcity, color=palette[1])
```
## Seasonal trends
We detect seasonal trends in the time series plot, so let's aggregate across months (by city), to examine sales by month.
```{r seasonal_data, include=FALSE}
# aggregate data by city and month
by_month <- group_by(txhousing, city, month) %>%
summarise(sales=mean(sales), volume=mean(volume),
median=mean(median), listings=mean(listings),
inventory=mean(inventory))
```
```{r seasonal-plots}
ggplot(by_month, aes(x=month, y=sales, group=city)) +
geom_line(color=palette[2], alpha=.5) +
geom_line(data=filter(by_month, city==params$spotlight), color=palette[1])
```
# Statistical Model
## Regression models
Sales is very skewed so transforming to log for linear regression
```{r, fig.show='hold', out.width='50%'}
txhousing$logsales <- log(txhousing$sales)
# layout(matrix(c(1,2), nrow=1))
# hist(txhousing$sales)
# hist(txhousing$logsales)
# layout(1)
ggplot(txhousing, aes(x=sales)) + geom_histogram(fill=palette[2])
ggplot(txhousing, aes(x=logsales)) + geom_histogram(fill=palette[2])
```
## Model
Model is $log(sales) = city + year + month + median + \epsilon$, where $city$ and $month$ are factors and $year$ and $median$ are continuous predictors.
Here are the coefficients, not including the city effects (too many):
```{r linear model}
m <- lm(logsales ~ city + I(year-2000) + month + median , data=txhousing)
mycoefs <- tidy(m, conf.int=TRUE)
mycoefs$term[mycoefs$term==("I(year - 2000)")] <- "year"
coefs_small <- filter(mycoefs, !grepl("city", term))
knitr::kable(coefs_small)
```
## Diagnostic plots
### Residuals vs. fitted
Linearity assumption seems to be ok, but maybe not homoskedasticity assumption.
```{r data and rvf}
# use broom::augment() to make a dataset of regression diagnostics
diagnostics <- augment(m)
# add the city variable so we can spotlight
diagnostics$city <- txhousing$city[as.numeric(diagnostics$`.rownames`)]
# residuals vs fitted plot
ggplot(diagnostics, aes(x=.fitted, y=.std.resid)) +
geom_point(color=palette[2], alpha=.5) +
geom_smooth(color="black")+
geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1])
```
### q-q plot of residuals
Some departures from normality in tails, but we have a very large sample size, so maybe not a big deal.
```{r qq-plot}
ggplot(diagnostics, aes(sample=.std.resid)) +
geom_qq(alpha=.25) +
geom_qq_line()
```
### Influence plot
Residuals vs leverage, sized by Cook's distance.
```{r influence plot}
ggplot(diagnostics, aes(x=.hat, y=.std.resid, size=.cooksd)) +
geom_point(color=palette[2], alpha=.5) +
geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1])
```