--- title: "Texas Housing Sales" output: html_document: params: spotlight: "Houston" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results='asis') # this is an R option that controls the number of digits printed options(digits=4) # for dataset library(dplyr) # for graphing functions library(ggplot2) # for pretty regression tables library(broom) # graphical parameters theme_set(theme_bw()) # Using 2 colors from ColorBrewer scle RdBu palette <- c("#ef8a62", "#67a9cf") ``` # Background ## Purpose This is a study of housing sales in Texas cities, with a spotlight on the city of `r params$spotlight`. We will have graphs of time trends and run a linear regression model. ## Quick look at the dataset Dataset from `ggplot2` package **Variables:** * **city:** Name of MLS area * **year, month, date:** year and month, and date expressed as continuous year variable * **sales:** Number of sales * **volume:** Total value of sales * **median:** Median sale price * **listings:** Total active listings * **inventory:** "Months inventory", amount of time it would take to sell all current listings at current pace of sales Here are the first 20 rows of the spotlight city: ```{r dataset} txhousing <- subset(txhousing, txhousing$year<=2001) spotcity <- subset(txhousing, txhousing$city==params$spotlight) knitr::kable(head(spotcity, n=20), align='c') txhousing$month <- factor(txhousing$month, labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) ``` # Plots ## Time series of Sales By City, Spotlight on `r params$spotlight` Sales appear to be rising through 2006, then slow until 2010, and then rise again. ```{r, results='hold'} ggplot(txhousing, aes(x=date, y=sales, group=city)) + geom_line(color=palette[2], alpha=.5) + geom_line(data=spotcity, color=palette[1]) ``` ## Seasonal trends We detect seasonal trends in the time series plot, so let's aggregate across months (by city), to examine sales by month. ```{r seasonal_data, include=FALSE} # aggregate data by city and month by_month <- group_by(txhousing, city, month) %>% summarise(sales=mean(sales), volume=mean(volume), median=mean(median), listings=mean(listings), inventory=mean(inventory)) ``` ```{r seasonal-plots} ggplot(by_month, aes(x=month, y=sales, group=city)) + geom_line(color=palette[2], alpha=.5) + geom_line(data=filter(by_month, city==params$spotlight), color=palette[1]) ``` # Statistical Model ## Regression models Sales is very skewed so transforming to log for linear regression ```{r, fig.show='hold', out.width='50%'} txhousing$logsales <- log(txhousing$sales) # layout(matrix(c(1,2), nrow=1)) # hist(txhousing$sales) # hist(txhousing$logsales) # layout(1) ggplot(txhousing, aes(x=sales)) + geom_histogram(fill=palette[2]) ggplot(txhousing, aes(x=logsales)) + geom_histogram(fill=palette[2]) ``` ## Model Model is $log(sales) = city + year + month + median + \epsilon$, where $city$ and $month$ are factors and $year$ and $median$ are continuous predictors. Here are the coefficients, not including the city effects (too many): ```{r linear model} m <- lm(logsales ~ city + I(year-2000) + month + median , data=txhousing) mycoefs <- tidy(m, conf.int=TRUE) mycoefs$term[mycoefs$term==("I(year - 2000)")] <- "year" coefs_small <- filter(mycoefs, !grepl("city", term)) knitr::kable(coefs_small) ``` ## Diagnostic plots ### Residuals vs. fitted Linearity assumption seems to be ok, but maybe not homoskedasticity assumption. ```{r data and rvf} # use broom::augment() to make a dataset of regression diagnostics diagnostics <- augment(m) # add the city variable so we can spotlight diagnostics$city <- txhousing$city[as.numeric(diagnostics$`.rownames`)] # residuals vs fitted plot ggplot(diagnostics, aes(x=.fitted, y=.std.resid)) + geom_point(color=palette[2], alpha=.5) + geom_smooth(color="black")+ geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1]) ``` ### q-q plot of residuals Some departures from normality in tails, but we have a very large sample size, so maybe not a big deal. ```{r qq-plot} ggplot(diagnostics, aes(sample=.std.resid)) + geom_qq(alpha=.25) + geom_qq_line() ``` ### Influence plot Residuals vs leverage, sized by Cook's distance. ```{r influence plot} ggplot(diagnostics, aes(x=.hat, y=.std.resid, size=.cooksd)) + geom_point(color=palette[2], alpha=.5) + geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1]) ```