Introduction to R

September 25, 2017

Background

`R` as a programming environment

R is a programming environment, that

can serve as a data analysis and storage facility
is designed to perform operations on vectors and matrices
uses a well-developed but simple programming language (called S)
allows for rapid development of new tools according to user demand

These tools are distributed as packages, which any user can download to customize the R environment.

Base `R` and packages

Base R and most R packages are available for download from the Comprehensive R Archive Network (CRAN)

cran.r-project.org
base R comes with a number of basic data management, analysis, and graphical tools
However, R's power and flexibility lie in its array of packages (currently more than 11,000 on CRAN!)

Downloading and installing `R`

RStudio

You can work directly in R, but most users prefer a graphical interface. We highly recommend using RStudio, an integrated development environment (IDE) that features:

a console
a powerful code/script editor featuring
- syntax highlighting
- code completion
- smart indentation
special tools for plotting, viewing R objects and code history
workspace management
cheatsheets for R programming
tab-completion for object names and function arguments (enough reason by itself!)

Seminar packages

For the purposes of this seminar, we will be using the following packages frequently:

installr() easy, automatic updating of R and R packages
tidyverse a collection of packages designed to work with tidy data, data that are organized in way to make later data analysis easier. Includes the following packages:
- dplyr various data management tasks
- readxl reading Excel files
- ggplot2 elegant data visualization (graphics) using the Grammar of Graphics
- haven reading data files from other stats packages

Installing packages

To use packages in R, we must first install them using the install.packages() function, which typically downloads the package from CRAN and installs it for use.

install.packages("installr")
install.packages("tidyverse")

Loading packages

After installing a package, we can load it into the R environment using the library() or require() functions, which more or less do the same thing.

Functions and data structures within the package will then be available for use.

library(tidyverse)

library(installr)
## Loading required package: stringr
## 
## Welcome to installr version 0.19.0
## 
## More information is available on the installr project website:
## https://github.com/talgalili/installr/
## 
## Contact: <tal.galili@gmail.com>
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/installr/issues
## 
##          To suppress this message use:
##          suppressPackageStartupMessages(library(installr))

Updating `R` and its packages

R is updated quite frequenlty, and newer versions of R are sometimes incompatible with older versions of packages. So, it is important to keep both R and its packages up to date.

The installr package provides the function updateR(), which will automatically search for and then install new versions of R, and can also update all packages to their newest versions.

updateR()

Basic info on `R` session

To get a description of the version of R and its attached packages used in the current session, we can use the sessionInfo() function

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] installr_0.19.0 stringr_1.2.0   knitr_1.17      dplyr_0.7.4    
##  [5] purrr_0.2.3     readr_1.1.1     tidyr_0.7.1     tibble_1.3.4   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     cellranger_1.1.0 compiler_3.4.2   plyr_1.8.4      
##  [5] bindr_0.1        forcats_0.2.0    tools_3.4.2      digest_0.6.12   
##  [9] lubridate_1.6.0  jsonlite_1.5     evaluate_0.10.1  nlme_3.1-131    
## [13] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.2     
## [17] psych_1.7.8      yaml_2.1.14      parallel_3.4.2   haven_1.1.0     
## [21] bindrcpp_0.2     xml2_1.1.1       httr_1.3.1       hms_0.3         
## [25] rprojroot_1.2    grid_3.4.2       glue_1.1.1       R6_2.2.2        
## [29] readxl_1.0.0     foreign_0.8-69   rmarkdown_1.6    modelr_0.1.1    
## [33] reshape2_1.4.2   magrittr_1.5     backports_1.1.1  scales_0.5.0    
## [37] htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5    
## [41] colorspace_1.3-2 stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3   
## [45] broom_0.4.2

Working directory

Without further specification, files will be loaded from and saved to the working directory. The functions getwd() and setwd() will get and set the working directory, respectively.

#get current directory (not run)
getwd()

# set new working directory (not run)
setwd("/path/to/directory")

R programming 1: Coding

R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source() function.

You can run a command directly from a script by placing the cursor inside the command or highlighting the commands and hitting Ctrl-Enter (Command-Enter on Macs). This will advance the cursor to the next command, where you can hit Ctrl-Enter again to run it, advancing the cursor to the next command…

Commands are separated either by a ; or by a newline.

R is case sensitive.

The # character at the beginning of a line signifies a comment, which is not executed.

Commands can extend beyond one line of text. Put operators like + at the end of lines for multi-line commands.

# Using R as a calculator
2 +
  3
## [1] 5

R programming 2: Objects

R stores both data and output from data analysis (as well as everything else) in objects.

Data are assigned to and stored in objects using the <- or = operator.

To print the contents of an object, specify the object's name alone.

A list of all objects in the current session can be obtained with ls()

# assign the number 3 to object called abc
abc <- 3

# print contents
abc
## [1] 3

# list all objects in current session
ls()
## [1] "abc"              "hook_output"      "my_custom_output"

R programming 3: Functions

Functions perform most of the work on data in R.

Functions in R are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function $f(x) = x^2$, takes an input $x$, and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean.

The inputs to functions are often referred to as arguments.

We have already used a few functions, such as install.packages() and library().

R programming 4: Help files for functions

Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?seq).

In the help file, we will find a list of Arguments to the function, in a specific order. Values for arguments to functions can be specified either by name or position.

# seq() creates a sequence of numbers 

# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5

# specifying arguments by position
seq(10, 0, -2)
## [1] 10  8  6  4  2  0

R programming 5: An example help file

In the Usage section a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.

The Value section specifies what is returned. Usually there are Examples at the bottom.

R Programming 6: More help

If you aren't sure what function to use, or want to search for a topic, ??keyword searches R documentation for keyword (e.g. ??logistic)

??logistic

Many packages include vignettes – longer, tutorial style guides for a package.

To see a list of available vignettes for the packages that are loaded, use vignette() with no arguments. Then to view a vignette, place its name inside vignette():

# list all available vignettes
vignette()

# View the "Introduction to dplyr" vignette
vignette("introduction")

Data Structures

Vectors

Vectors, the fundamental data structure in R, are one-dimensional and homogeneous.

A single variable can usually be represented by one of the following vector data types:

logical: TRUE or FALSE (1 or 0)
integer: integers only (represented by a number followed by L; e.g. 10L is the integer 10)
double: real numbers, also known as numeric
character: strings

A single value is a vector of length one in R.

The c() function combines values of common type together to form a vector.

The typeof() function identifies a vector's type.

The length() function returns its length.

# create a vector
first_vec <- c(1, 3, 5)
first_vec
## [1] 1 3 5

# vector type
typeof(first_vec)
## [1] "double"

# character vector
char_vec <- c("these", "are", "some", "words")
length(char_vec)
## [1] 4

# the result of this comparison is a logical vector
first_vec > c(2, 2, 2)
## [1] FALSE  TRUE  TRUE

`rep()` and `seq()` to generate vectors

To create vectors with a predictable sequence of elements, use rep() to generate repetitive elements and seq() to generate sequential elements.

The expression m:n will generate a vector of integers from m to n

# second argument is number of repetitions
rep(0, times=3)
## [1] 0 0 0
rep("abc", 4)
## [1] "abc" "abc" "abc" "abc"

# from, to, by
seq(from=1, to=5, by=2)
## [1] 1 3 5
seq(10, 0, -5)
## [1] 10  5  0

# colon operator
3:7
## [1] 3 4 5 6 7

# you can nest functions
rep(seq(1,3,1), times=2)
## [1] 1 2 3 1 2 3

# each vs times
rep(seq(1,3,1), each=2)
## [1] 1 1 2 2 3 3

Vector recycling

If we perform an operation on two or more vectors and the vectors of are of unequal length, the values of shorter vector will be recycled until the two vectors are of the same length.

# the single value `1` is a vector of length 1
#  it is recycled to be c(1,1,1)
c(1,2,3) + 1
## [1] 2 3 4

# second vector recycled twice to make c(1,2,1,2,1,2)
c(1,2,3,4,5,6) + c(1,2)
## [1] 2 4 4 6 6 8

# The 2 becomes c(2,2,2)
c(1,2,3) < 2
## [1]  TRUE FALSE FALSE

# what is R complaining about here?
c(2,3,4) + c(10, 20)
## Warning in c(2, 3, 4) + c(10, 20): longer object length is not a multiple
## of shorter object length
## [1] 12 23 14

Subsetting vectors with []

Elements of a vector can be accessed or subset by specifying a vector of numbers (of length 1 or greater) inside [].

# create a vector 10 to 1
# putting () around a command will cause the result to be printed
(a <- seq(10,1,-1))
##  [1] 10  9  8  7  6  5  4  3  2  1

# second element
a[2]
## [1] 9

# first 5 elements
a[seq(1,5)]
## [1] 10  9  8  7  6

# first, third, and fourth elements
a[c(1,3,4)]
## [1] 10  8  7

Vector elements can be named, and then subset by name. Make sure to use "" when subsetting by element name.

scores <- c(John=25, Marge=34, Dan=24, Emily=29)
scores[c("John", "Emily")]
##  John Emily 
##    25    29

Conditional selection - subsetting by value

Vectors elements can also be subset with a logical (TRUE/FALSE) vector, known as logical subsetting.

scores[c(FALSE, TRUE, TRUE, FALSE)]
## Marge   Dan 
##    34    24

This allows us to subset a vector by checking if a condition is satisifed:

# this returns a logical vector...
scores < 30
##  John Marge   Dan Emily 
##  TRUE FALSE  TRUE  TRUE

# ...that we can now use to subset
scores[scores<30]
##  John   Dan Emily 
##    25    24    29

Lists

Like vectors, lists are "one-dimensional" structures, but the elements can be a mixture of types – often vectors (of any length), but also other lists, matrices and data frames (see below).

Lists can be manually generated with list():

# list accepts a mixture of data types
# a list of a numeric vector, an integer vector, and a 
#   character vector
mylist <- list(1.1, c(1L,3L,7L), c("abc", "def"))
mylist
## [[1]]
## [1] 1.1
## 
## [[2]]
## [1] 1 3 7
## 
## [[3]]
## [1] "abc" "def"

List elements can be named as well

# list elements can be named as well
mary_info <- list(classes=c("Biology", "Math", "Music",
                            "Physics"),
                  friends=c("John", "Dan", "Emily"),
                  SAT=1450)
mary_info
## $classes
## [1] "Biology" "Math"    "Music"   "Physics"
## 
## $friends
## [1] "John"  "Dan"   "Emily"
## 
## $SAT
## [1] 1450

Accessing list elements

As the output from the previous 2 sections suggest, there are a couple ways of accessing list elements (the vectors).

Use [[]] to access by position number and $ to access by name.

If the list element is a vector, we can access individual elements within the vector subsetting using [].

# by position
mary_info[[2]]
## [1] "John"  "Dan"   "Emily"

# by name
mary_info$SAT
## [1] 1450

# second element of friends vector
mary_info$friends[2]
## [1] "Dan"

Matrices

Matrices are two-dimensional, homogeneous data structures.

Matrices can be generated manually with matrix(). The input to matrix() is a one-dimensional vector, which is reshaped into a two-dimensional matrix according to the dimensions specified by the user in the arguments nrow and ncol (generally only one is needed).

The matrix is filled down the columns by default, but this can be changed by setting the byrow argument to TRUE.

# create a 2x3 matrix, filling down columns
a <- matrix(1:6, nrow=2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

# now fill across rows 
b <- matrix(5:14, nrow=2, byrow=TRUE)
b
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    6    7    8    9
## [2,]   10   11   12   13   14

Accessing matrix elements

Matrix elements can be accessed with matrix[row,column] notation.

Omitting row requests all rows, and omitting column requests all columns.

# row 2 column 3
a[2,3]
## [1] 6

# all rows column 2
b[,2]
## [1]  6 11

# all columns row 1
a[1,]
## [1] 1 3 5

Data frames

Datasets for statistical analysis are typically stored in data frames in R.

Data frames combine the features of matrices and lists.

like matrices, data frames are rectangular, where the columns are variables and the rows are observations of those variables.
like lists, data frame can have elements (column vectors) of different data types (some double, some character, etc.) – but they must be equal length

Real datasets usually combine variables of different types, so data frames are well suited for storage.

Creating with `data.frame()`

Data frames can be manually created with data.frame() . The syntax resembles the syntax for list(), except that the elements are vectors of equal length.

The elements of a data frame are almost always named.

# a logical vector and numeric vector of equal length
mydata <- data.frame(diabetic = c(TRUE, FALSE, TRUE, FALSE), 
                     height = c(65, 69, 71, 73))
mydata
##   diabetic height
## 1     TRUE     65
## 2    FALSE     69
## 3     TRUE     71
## 4    FALSE     73

Subsetting data frames

As data frames are both matrices and lists, they can be subset by methods for either matrices or lists.

With a two-dimensional structure, data frames can be subset like matrices [rows, columns].

# row 3 column 2
mydata[3,2]
## [1] 71

# using column name
mydata[1:2, "height"]
## [1] 65 69

# all rows of column "height"
mydata[,"diabetic"]
## [1]  TRUE FALSE  TRUE FALSE

We can subset data frames like lists as well. The columns are considered the list elements, so we can use either [[]] or $ to extract columns.

Extracted columns are vectors.

We will generally use $ throughout the seminar to subset data frame columns, because we often perform operations on column variables.

# subsetting creates a numeric vector
mydata$height[2:3]
## [1] 69 71

# this is a numeric vector
mydata[["height"]]
## [1] 65 69 71 73
mydata[["height"]][2]
## [1] 69

Naming data frame columns

colnames(*data frame*) returns the column names of a data frame (or matrix).

colnames(*data frame*) <- c("some", "names") assigns column names to data frame.

# get column names
colnames(mydata)
## [1] "diabetic" "height"

# assign column names
colnames(mydata) <- c("Diabetic", "Height")
colnames(mydata)
## [1] "Diabetic" "Height"

# to change one variable name, just use indexing
colnames(mydata)[1] <- "Diabetes"
colnames(mydata)
## [1] "Diabetes" "Height"

Examining the structure of an object

Use dim() on two-dimensional objects to get the number or rows and columns.

Use str(), to see the structure of the object, including its class (discussed later) and the data types of elements.

# number of rows and columns
dim(mydata)
## [1] 4 2

#d is of class "data.frame"
#all of its variables are of type "integer"
str(mydata)
## 'data.frame':    4 obs. of  2 variables:
##  $ Diabetes: logi  TRUE FALSE TRUE FALSE
##  $ Height  : num  65 69 71 73

R classes

R objects belong to classes. Objects can belong to more than one class.

Many functions only accept objects of a specific class, so it is important to know the classes of our objects.

The class() function lists all classes to which the object belongs. If class() returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of vector (or matrix for 2-d objects).

Data frames are a class as well.

# mydata is of class data.frame
class(mydata)
## [1] "data.frame"

# Height is a numeric vector
class(mydata$Height)
## [1] "numeric"

# colMeans(), for means of columns, wants input of class data.frame or matrix
colMeans(mydata)
## Diabetes   Height 
##      0.5     69.5

# vector input to colMeans() produces an error
colMeans(mydata$Height)
## Error in colMeans(mydata$Height): 'x' must be an array of at least two dimensions

Generic functions

Generic functions match object classes to the appropriate function.

Generic functions remove the need for the user to remember the classes of objects that functions support. The function's help file will usually tell you if a function is generic.

Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class.

The various functions for specific classes can have widely diverging purposes.
For example, summary() is a generic function. When a data frame is passed to summary(), the data frame is then passed to a specific function (method) called summary.data.frame(), which provides a numeric summary of all variables in the data frame.

#summary() calls summary.data.frame() if given a data.frame input
summary(mydata)
##   Diabetes           Height    
##  Mode :logical   Min.   :65.0  
##  FALSE:2         1st Qu.:68.0  
##  TRUE :2         Median :70.0  
##                  Mean   :69.5  
##                  3rd Qu.:71.5  
##                  Max.   :73.0

In contrast, passing a regression model object (class=lm) to summary() calls summary.lm() and produces a regression table instead.

# run a regression and save model of class "lm" in object
model1 <- lm(Height ~ Diabetes, data=mydata)
class(model1)
## [1] "lm"

# summary() calls summary.lm() if given an lm object
summary(model1)
## 
## Call:
## lm(formula = Height ~ Diabetes, data = mydata)
## 
## Residuals:
##  1  2  3  4 
## -3 -2  3  2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    71.000      2.550  27.848  0.00129 **
## DiabetesTRUE   -3.000      3.606  -0.832  0.49291   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.606 on 2 degrees of freedom
## Multiple R-squared:  0.2571, Adjusted R-squared:  -0.1143 
## F-statistic: 0.6923 on 1 and 2 DF,  p-value: 0.4929

The `methods()` function

Methods are class-specific functions.

The methods() functions lists what methods exist in the current R session.

Supply a generic function name to methods() to list all specific functions (methods) that the generic function searches for a class match.

Supply a class to methods(class=) to list all specific functions that accept that class.

# what classes of objects does generic function summary() accept?
methods(summary)
##  [1] summary.aov                    summary.aovlist*              
##  [3] summary.aspell*                summary.check_packages_in_dir*
##  [5] summary.connection             summary.corAR1*               
##  [7] summary.corARMA*               summary.corCAR1*              
##  [9] summary.corCompSymm*           summary.corExp*               
## [11] summary.corGaus*               summary.corIdent*             
## [13] summary.corLin*                summary.corNatural*           
## [15] summary.corRatio*              summary.corSpher*             
## [17] summary.corStruct*             summary.corSymm*              
## [19] summary.data.frame             summary.Date                  
## [21] summary.default                summary.Duration*             
## [23] summary.ecdf*                  summary.factor                
## [25] summary.ggplot*                summary.glm                   
## [27] summary.gls*                   summary.infl*                 
## [29] summary.Interval*              summary.lm                    
## [31] summary.lme*                   summary.lmList*               
## [33] summary.loess*                 summary.manova                
## [35] summary.matrix                 summary.mlm*                  
## [37] summary.modelStruct*           summary.nls*                  
## [39] summary.nlsList*               summary.packageStatus*        
## [41] summary.pdBlocked*             summary.pdCompSymm*           
## [43] summary.pdDiag*                summary.PDF_Dictionary*       
## [45] summary.PDF_Stream*            summary.pdIdent*              
## [47] summary.pdLogChol*             summary.pdMat*                
## [49] summary.pdNatural*             summary.pdSymm*               
## [51] summary.Period*                summary.POSIXct               
## [53] summary.POSIXlt                summary.ppr*                  
## [55] summary.prcomp*                summary.princomp*             
## [57] summary.proc_time              summary.psych*                
## [59] summary.reStruct*              summary.shingle*              
## [61] summary.srcfile                summary.srcref                
## [63] summary.stepfun                summary.stl*                  
## [65] summary.table                  summary.trellis*              
## [67] summary.tukeysmooth*           summary.varComb*              
## [69] summary.varConstPower*         summary.varExp*               
## [71] summary.varFixed*              summary.varFunc*              
## [73] summary.varIdent*              summary.varPower*             
## see '?methods' for accessing help and source code

# what functions accept data frames as arguments?
methods(class="data.frame")
##   [1] $              $<-            [              [[            
##   [5] [[<-           [<-            aggregate      anti_join     
##   [9] anyDuplicated  arrange        arrange_       as.data.frame 
##  [13] as.list        as.matrix      as.tbl         as.tbl_cube   
##  [17] as_data_frame  as_tibble      by             cbind         
##  [21] coerce         collapse       collect        complete      
##  [25] complete_      compute        dim            dimnames      
##  [29] dimnames<-     distinct       distinct_      do            
##  [33] do_            drop_na        drop_na_       droplevels    
##  [37] duplicated     edit           expand         expand_       
##  [41] extract        extract_       fill           fill_         
##  [45] filter         filter_        format         formula       
##  [49] fortify        full_join      gather         gather_       
##  [53] ggplot         glimpse        group_by       group_by_     
##  [57] group_indices  group_indices_ group_size     groups        
##  [61] head           initialize     inner_join     intersect     
##  [65] is.na          is_vector_s3   knit_print     left_join     
##  [69] Math           merge          mutate         mutate_       
##  [73] n_groups       na.exclude     na.omit        nest          
##  [77] nest_          Ops            plot           print         
##  [81] prompt         pull           rbind          rename        
##  [85] rename_        replace_na     right_join     row.names     
##  [89] row.names<-    rowsum         same_src       sample_frac   
##  [93] sample_n       select         select_        semi_join     
##  [97] separate       separate_      separate_rows  separate_rows_
## [101] setdiff        setequal       show           slice         
## [105] slice_         slotsFromS3    split          split<-       
## [109] spread         spread_        stack          str           
## [113] subset         summarise      summarise_     summary       
## [117] Summary        t              tail           tbl_vars      
## [121] transform      type_sum       ungroup        union         
## [125] union_all      unique         unite          unite_        
## [129] unnest         unnest_        unstack        within        
## see '?methods' for accessing help and source code

Review: Data Structures

HAVE YOU BEEN PAYING ATTENTION???

Now we test what you've learned.

2.1 What will the object a contain here?

a <- c(0, 1)
a

a <- c(0, 1)
a
## [1] 0 1

2.2 Now what will the object a contain?

a <- c(10, seq(5, 1, -1))
a

a <- c(10, seq(5, 1, -1))
a
## [1] 10  5  4  3  2  1

2.3 What about now? What will a contain?

a <- c(rep(0,2), seq(1,5,by=2))
a

a <- c(rep(0,2), seq(1,5,by=2))
a
## [1] 0 0 1 3 5

2.4 What will be the result of dim(b)?

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
dim(b)

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
dim(b)
## [1] 3 2

b
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3

2.5 What is the result of b[2,]?

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[2,]

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[2,]
##   letters numbers
## 2       b       2

2.6 What about the result of b[b$numbers<2,]?

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[b$numbers<2,]

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[b$numbers<2,]
##   letters numbers
## 1       a       1

2.7 What are three different ways to access "c" in data frame b?

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))

Also, why does it keep saying "levels" in the output?

b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))

# letters column, element 3 (recommended method)
b$letters[3]
## [1] c
## Levels: a b c

# row 3 column 1
b[3,1]
## [1] c
## Levels: a b c

# element 1 (column 1) of data frame, then element 3 of that
b[[1]][3]
## [1] c
## Levels: a b c

Because by default, character vectors are converted to factors in data.frame().

2.8 We know that b is a data.frame. Why do you think mean(b) does not work? What is R trying to tell us with the warning?

# confirm that it is a data frame
class(b)
## [1] "data.frame"

# NA is not what we want, what is the warning trying to tell us?
mean(b)
## Warning in mean.default(b): argument is not numeric or logical: returning
## NA
## [1] NA

mean() expects a numeric or logical vector as input, not a data frame. It doesn't know how to calculate the mean for several columns.

Subsetting a column of b results in a vector, which works as an input to mean().

methods(mean) shows us that there is no mean function defined specifically for data frames.

# columns of data frames are vectors
class(b$numbers)
## [1] "numeric"

mean(b$numbers)
## [1] 2

# no mean.data.frame
methods(mean)
## [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt 
## see '?methods' for accessing help and source code

Importing Data

Dataset files

R works most easily with datasets stored as text files. Typically, values in text files are separated, or delimited, by tabs or spaces:


gender id race ses schtyp prgtype read write math science socst
0 70 4 1 1 general 57 52 41 47 57
1 121 4 2 1 vocati 68 59 53 63 31
0 86 4 3 1 general 44 33 54 58 31
0 141 4 3 1 vocati 63 44 47 53 56

or by commas (CSV file):


gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst
0,70,4,1,1,general,57,52,41,47,57
1,121,4,2,1,vocati,68,59,53,63,61
0,86,4,3,1,general,44,33,54,58,31
0,141,4,3,1,vocati,63,44,47,53,56

Reading in text data

We recommend the tidyverse (specific package readr) functions read_csv() to read in data stored as CSV and read_delim() to read in text data delimited by other characters.

For read_delim(), specify the delimiter in the delim= argument.

(Base R functions read.csv() and read.delim() have very similar functionality, but have less useful default settings)

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

# run this if you missed it earlier
library(tidyverse)

In the output for read_csv() and read_delim(), you'll see the data type of each column.

# comma separated values
dat_csv <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_character(),
##   ses = col_character(),
##   schtyp = col_character(),
##   prog = col_character(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer(),
##   honors = col_character(),
##   awards = col_integer(),
##   cid = col_integer()
## )

# tab separated values
dat_tab <- read_delim("https://stats.idre.ucla.edu/stat/data/hsb2.txt",
  delim="\t")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_integer(),
##   race = col_integer(),
##   ses = col_integer(),
##   schtyp = col_integer(),
##   prog = col_integer(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer()
## )

Tibbles

If you attempt to print a data frame read in by read_csv() or read_delim(), it prints in special way:

dat_csv
## # A tibble: 200 x 13
##       id female    ses schtyp     prog  read write  math science socst
##    <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
##  1    45 female    low public vocation    34    35    41      29    26
##  2   108   male middle public  general    34    33    41      36    36
##  3    15   male   high public vocation    39    39    44      26    42
##  4    67   male    low public vocation    37    37    42      33    32
##  5   153   male middle public vocation    39    31    40      39    51
##  6    51 female   high public  general    42    36    42      31    39
##  7   164   male middle public vocation    31    36    46      39    46
##  8   133   male middle public vocation    50    31    40      34    31
##  9     2 female middle public vocation    39    41    33      42    41
## 10    53   male middle public vocation    34    37    46      39    31
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## #   awards <int>, cid <int>

This is because read_csv() assigns the data frame to a class called tibble, a tidyverse structure that slightly alters how data.frames behave, such as when they are being created or printed. Tibbles are still data frames, and will work in most functions that require data frame inputs.

Reading in data from other statistical software

We can read in datasets from other statistical analysis software using functions found in the haven package, part of the tidyverse. Note, the haven package is not loaded with library(tidyverse) so must be loaded separately.

require(haven)
## Loading required package: haven
# SPSS files
dat_spss <- read_spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")
# Stata files
dat_dta <- read_stata("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")

Reading in Excel files

Datasets are often saved as Excel spreadsheets. Here we utilize the readxl package to read in the excel file. We need to download the file first.

library(readxl)
# this step only needed to read excel files from the internet
download.file("https://stats.idre.ucla.edu/stat/data/hsb2.xls", "myfile.xls", mode="wb")

dat_xls <- read_excel("myfile.xls")

Viewing data with `head()` and `tail()`

Use head() and tail() to look at a specified number of rows at the begininning or end of a dataset, respectively.

# first 2 rows
head(dat_csv, 2)
## # A tibble: 2 x 13
##      id female    ses schtyp     prog  read write  math science socst
##   <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
## 1    45 female    low public vocation    34    35    41      29    26
## 2   108   male middle public  general    34    33    41      36    36
## # ... with 3 more variables: honors <chr>, awards <int>, cid <int>

# last 8 rows
tail(dat_csv, 8)
## # A tibble: 8 x 13
##      id female    ses  schtyp     prog  read write  math science socst
##   <int>  <chr>  <chr>   <chr>    <chr> <int> <int> <int>   <int> <int>
## 1   174   male middle private academic    68    59    71      66    56
## 2    95   male   high  public academic    73    60    71      61    71
## 3    61 female   high  public academic    76    63    60      67    66
## 4   100 female   high  public academic    63    65    71      69    71
## 5   143   male middle  public vocation    63    63    75      72    66
## 6    68   male middle  public academic    73    67    71      63    66
## 7    57 female middle  public academic    71    65    72      66    56
## 8   132   male middle  public academic    73    62    73      69    66
## # ... with 3 more variables: honors <chr>, awards <int>, cid <int>

Viewing data as a spreadsheet with `View()`

Use View() on a dataset to open a spreadsheet-style view of a dataset. In RStuido, clicking on a dataset in the Environment pane will View() it.

View(dat_csv)

Exporting data

We can export our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta, using write_ functions that reverse the operations of the read_ functions.

Multiple objects can be stored in an R binary file (usally extension ".Rdata") with save() and then later loaded with load().

We did not specify realistic pathnames below.

# write a csv file
write_csv(dat_csv, file = "path/to/save/filename.csv")

# Stata .dta file (need to load foreign package)
write_dta(dat_csv, file = "path/to/save/filename.dta")

# save these objects to an .Rdata file
save(dat_csv, mydata, file="path/to/save/filename.Rdata")

Review: Importing Data

3.1 In the file "http://stats.idre.ucla.edu/stat/data/hsbsemi.txt", the fields are separated by ";". What function could I use to read it in?

Here are how the first few rows appear:

id;female;ses;schtyp;prog;read;write;math;science;socst;honors;awards;cid
45;female;low;public;vocation;34;35;41;29;26;not enrolled;0;1
108;male;middle;public;general;34;33;41;36;36;not enrolled;0;1

Use read_delim() with delim=";"

d_semi <- read_delim("http://stats.idre.ucla.edu/stat/data/hsbsemi.txt",
                     delim=";")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_character(),
##   ses = col_character(),
##   schtyp = col_character(),
##   prog = col_character(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer(),
##   honors = col_character(),
##   awards = col_integer(),
##   cid = col_integer()
## )
# more on next page

d_semi
## # A tibble: 200 x 13
##       id female    ses schtyp     prog  read write  math science socst
##    <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
##  1    45 female    low public vocation    34    35    41      29    26
##  2   108   male middle public  general    34    33    41      36    36
##  3    15   male   high public vocation    39    39    44      26    42
##  4    67   male    low public vocation    37    37    42      33    32
##  5   153   male middle public vocation    39    31    40      39    51
##  6    51 female   high public  general    42    36    42      31    39
##  7   164   male middle public vocation    31    36    46      39    46
##  8   133   male middle public vocation    50    31    40      34    31
##  9     2 female middle public vocation    39    41    33      42    41
## 10    53   male middle public vocation    34    37    46     -99   -99
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## #   awards <int>, cid <int>

Data Exploration

Getting to know your data

Once we have data loaded, it is always wise to familiarize ourselves with variables in the dataset, both individually and their relationships.

First we will read in some data and store it in the object we name d. We prefer short names for objects that we will use frequently.

The dataset contains several school, test, and demographic variables for 200 students.

In this section, we will explore data with both numeric summaries and graphical depictions.

d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_character(),
##   ses = col_character(),
##   schtyp = col_character(),
##   prog = col_character(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer(),
##   honors = col_character(),
##   awards = col_integer(),
##   cid = col_integer()
## )

d
## # A tibble: 200 x 13
##       id female    ses schtyp     prog  read write  math science socst
##    <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
##  1    45 female    low public vocation    34    35    41      29    26
##  2   108   male middle public  general    34    33    41      36    36
##  3    15   male   high public vocation    39    39    44      26    42
##  4    67   male    low public vocation    37    37    42      33    32
##  5   153   male middle public vocation    39    31    40      39    51
##  6    51 female   high public  general    42    36    42      31    39
##  7   164   male middle public vocation    31    36    46      39    46
##  8   133   male middle public vocation    50    31    40      34    31
##  9     2 female middle public vocation    39    41    33      42    41
## 10    53   male middle public vocation    34    37    46     -99   -99
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## #   awards <int>, cid <int>

Continuous and categorical variables

We can distinguish generally between variables measured continuously (quantitative) and those measured categorically (membership to a class).

Methods to explore the two types of variables differ somewhat, so we will visit each separately initially.

We first explore the continuous variables in the dataset, which are the academic test score variables, "read", "write", "math", "science", and "socst".

Exploring continuous variables numerically

Common numeric summaries for continuous variables are the mean, median, and variance, obtained with mean(), median(), and var() (sd() for standard deviation), respectively.

summary() on a numeric vector provides the min, max, mean, median, and first and third quartiles (interquartile range).

mean(d$read)
## [1] 52.23
median(d$read)
## [1] 50
var(d$read)
## [1] 105.1227

summary(d$read)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   44.00   50.00   52.23   60.00   76.00

Introducing `ggplot2` for graphics

We will be using the package ggplot2, part of the tidyverse, to create plots for exploring our data. Although base R has powerful and flexible graphic capabilities on its own, we prefer the approach that ggplot2 takes.

ggplot2 uses a structured grammar of graphics that provides an intuitive framework for building graphics layer-by-layer, rather than memorizing lots of plotting commands and options
ggplot2 graphics take less work to make beautiful and eye-catching

Basic syntax of a `ggplot2` plot

The basic specification for a ggplot2 plot is to specify which variables are mapped to which aspects of the graph (called aesthetics) and then to choose a shape (called a geom) to display on the graph.

For example, we can choose to map one variable to the x-axis, another variable to the y-axis, and to use geom_point() as the shape to plot, which produces a scatter plot.

Within the ggplot() function we specify (Note that the package is named ggplot2 while this function is called ggplot()):

the dataset
inside an aes() function, we then specify which variables are mapped to which aesthetics, which can include:
- x-axis and y-axis
- color, size, and shape of objects

For a much more detailed explanation of the grammar of graphics underlying ggplot2, see our Introduction to ggplot2 seminar.

# a scatterplot of read vs write
ggplot(data=d, aes(x=write, y=read)) + geom_point()

Exploring continuous Variables: Histograms

We can inspect the distributions of continuous variables with histograms, density plots, and boxplots. Each of these plots has a corresponding ggplot2 geom.

Histograms bin continuous variables into intervals and count the frequency of observations in each interval.

For histograms and density plots, we will map the variable of interest to x.

# use the bins= argument to control the number of intervals
ggplot(d, aes(x=write)) + geom_histogram(bins=10)

We can also look at distributions for a subset of our data. Here we examine the distribution for write for students with math score below the mean math score:

# Requesting the rows where math is less than its mean
ggplot(d[d$math < mean(d$math),], 
       aes(x=write)) + geom_histogram(bins=10)

Exploring continuous vars: Density plots

Density Plots smooth out the shape of histograms:

ggplot(d, aes(x = write)) + geom_density()

Exploring continuous vars: boxplots

Boxplots show the median, lower and upper quartiles (the hinges), and outliers.

Unlike histograms and density plots, map the variable whose distribution we want to plot to y instead of x. If we are making a single boxplot, we need an arbitrary value for x, just as a place holder.

# for the overall distribution of one variable, specify x=1 (or any other value)
ggplot(d, aes(x = 1, y = math)) + geom_boxplot()

Data exploration can help us identify suspicious looking values. This value of -99 on science is probably a code for a missing value.

# for the overall distribution of one variable, specify x=1 (or any other value)
ggplot(d, aes(x = 1, y = science)) + geom_boxplot()

Exploring categorical variables

The statistics mean, median and variance cannot be calculated meaningfully for categorical variables (unless just 2 categories).

Instead, we often present frequency tables of the distribution of membership to each category.

Use table() to produce frequency tables.

Use prop.table() on the tables produced by table() (i.e. the output) to see the frequencies expressed as proportions.

Some of the categorical variables in this dataset are:

prog: educational program, "general", "academic", and "vocation"
female: gender, "male" and "female"
honors: enrollment in honors program, "enrolled" and "not enrolled"
ses: socioeconomic status, "low", "middle", "high"

# table() produces counts
table(d$female)
## 
## female   male 
##    109     91
table(d$ses)
## 
##   high    low middle 
##     58     47     95

# for proportions, use output of table() 
#   as input to prop.table()
prop.table(table(d$female))
## 
## female   male 
##  0.545  0.455
prop.table(table(d$ses))
## 
##   high    low middle 
##  0.290  0.235  0.475

Factors

As you may have noticed in the previous section, table() orders the categories of prog and ses alphabetically. Unfortunately, the ordering high-low-middle is not ideal for ses.

Factors in R provide a way to represent categorical variables both numerically and categorically. Basically, factors assign an integer number (beginning with 1) to each distinct category, and then a character label to each category.

We convert character variables to factors with factor(). Specify the names of the categories in the levels= argument, in an order that makes sense to you. If you omit levels=, R will alphabetically sort the categories.

Use levels() on a factor to check the ordering of levels.

Note: The Base R function read.csv() by default reads in character variables as factors using alphabetical ordering, which is not always desirable. Thus we recommend the readr (part of tidyverse) function read_csv(), which leaves them as character.

# before, ses is a character variable
str(d$ses)
##  chr [1:200] "low" "middle" "high" "low" "middle" "high" "middle" ...

# converting ses to factor
#   we need to specify levels explicitly, otherwise R will
#   sort alphabetically
d$ses <- factor(d$ses, levels=c("low", "middle", "high"))

# Now a factor, notice the integer representation
str(d$ses)
##  Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...

# levels() reveals all factors in order
levels(d$ses)
## [1] "low"    "middle" "high"

Factors are represented both by their integers and their character labels.

Factors are converted to 0/1 variables in regression models.

head(d$ses)
## [1] low    middle high   low    middle high  
## Levels: low middle high
head(as.numeric(d$ses))
## [1] 1 2 3 1 2 3

# the first observation of ses is equal to "low"...
d$ses[1] == "low"
## [1] TRUE

# ...and its underlying integer is equal to 1
as.numeric(d$ses[1]) == 1
## [1] TRUE

Let's go ahead and convert some of the other character variables into factors:

# alphabetic ordering fine here, so no need to specify levels
d$female <- factor(d$female)
levels(d$female)
## [1] "female" "male"

d$prog <- factor(d$prog)
levels(d$prog)
## [1] "academic" "general"  "vocation"

Exploring categorical vars: Bar graphs

Distributions of categorical variables are often depicted by bar graphs, which are easily made in ggplot2. By default, geom_bar() counts the number of observations for each value of the variable mapped to x.

ggplot(d, aes(x=prog)) + geom_bar()

Exploring relationships between two variables

After inspecting distributions of variables individually, we then proceed to explore relationships between variables. Namely, we are generally interested whether the values of one variable are independent of the other, or whether they are associated (i.e. correlated or predictive).

We use different numerical and graphical methods for exploration depending on whether the two variables are both continuous, both categorical, or one of each.

Exploring continuous by continuous numerically

Correlations provide quick assessments of whether two continuous variables are linearly related to one another.

The cor() function estimates correlations. If supplied with 2 vectors, cor() will estimate a single correlation. If supplied a data frame with several variables, cor() will estimate a correlation matrix.

# just a single correlation
cor(d$write, d$read)
## [1] 0.5967765

# now isolate all test score variables
scores <- d[, c("read", "write", "math", "science", "socst")]
cor(scores)
##              read     write      math   science     socst
## read    1.0000000 0.5967765 0.6622801 0.1709428 0.1814928
## write   0.5967765 1.0000000 0.6174493 0.1289845 0.1504587
## math    0.6622801 0.6174493 1.0000000 0.2051668 0.1898648
## science 0.1709428 0.1289845 0.2051668 1.0000000 0.9361672
## socst   0.1814928 0.1504587 0.1898648 0.9361672 1.0000000

Exploring continuous by continuous graphically

Scatter plots are an obvious choice to depict the relationship between 2 variables. We can also add a loess smooth layer (geom_smooth()) that provides a "best-fit" curve to the data.

Note that further layers are added with +.

Here we examine the relationship between reading test score and writing test score.

# both scatter plot and loess smooth layers
ggplot(d, aes(x=read, y=write)) + 
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Exploring continuous by categorical: grouping data frames

When exploring the relationship between a continuous variable and categorical variable, we are often interested in whether the distribution (i.e. mean, variance, etc.) of the continuous variable is the same between the classes of the categorical variable.

For example, we might want to know whether the means and variances of math test scores are the same between males and females.

The dplyr package (part of tidyverse) provides a useful function, group_by(), which converts a data frame into a grouped data frame, grouped by one or more variables. After grouping the data frame, we then use the dplyr function summarize() to calculate statistics by group.

# first we group our data frame, d, by female
by_female <- group_by(d, female)

# notice that it is a grouped_df (data frame) now
class(by_female)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Specify a function to evaluate a variable by groups in summarize(). First specify the (grouped) dataset, then the functions to run on variables in the dataset.

Here we get the means and variances of math by gender. We see that the means are nearly the same, but the variance seems higher in males.

summarize(by_female, mean(math), var(math))
## # A tibble: 2 x 3
##   female `mean(math)` `var(math)`
##   <fctr>        <dbl>       <dbl>
## 1 female     52.39450    83.74108
## 2   male     52.94505    93.40806

Exploring continuous by categorical graphically

To plot the distributions of the continuous variables by groups defined by the categorical variables, we will plot separate density plots and boxplots of the continuous variables for each group of the categorical variable.

The grouping variable is commonly mapped to aesthetics that take on categories themselves, such as color or shape, but can be mapped to x as well if it is numeric.

The distributions look very similar, with similar means, and a slightly more spread out shape for males.

ggplot(d, aes(x=math, color=female)) +
  geom_density()

Boxplots of math by female show the same similar looking distributions.

ggplot(d, aes(x=female, y=math)) +
  geom_boxplot()

Exploring categorical by categorical

Two-way and multi-way frequency tables are used to explore the relationships between categorical variables.

We can use table() and prop.table() again. Within prob.table(), use margin=1 for row proportions and margin=2 for column proportions. Omitting margin= will give proportions of the total.

Here, we check whether the proportions of observations that fall into each educational program (prog), are about the same across socioeconomic statuses.

# this time saving the freq table to an object
my2way <- table(d$prog, d$ses)

# counts in each crossing of prog and ses
my2way
##           
##            low middle high
##   academic  19     44   42
##   general   16     20    9
##   vocation  12     31    7

Seems to be association between being in the academic program and in high ses.

# row proportions, 
#   proportion of prog that falls into ses
prop.table(my2way, margin=1)
##           
##                  low    middle      high
##   academic 0.1809524 0.4190476 0.4000000
##   general  0.3555556 0.4444444 0.2000000
##   vocation 0.2400000 0.6200000 0.1400000

# columns proportions,
#   proportion of ses that falls into prog
prop.table(my2way, margin=2)
##           
##                  low    middle      high
##   academic 0.4042553 0.4631579 0.7241379
##   general  0.3404255 0.2105263 0.1551724
##   vocation 0.2553191 0.3263158 0.1206897

Exploring categorical by categorical graphically

We can add a categorical variable to the bar graph of the other categorical variable to depict their relationship.

Here we map prog to fill, the color used to fill the bars of the bar graph. (The color aesthetic specifies the color of the outline of the bars)

This produces a stacked bar chart. We again see that "high" ses has a higher proportion of "academic"

ggplot(d, aes(x=ses, fill=prog)) + 
  geom_bar()

The position argument in geom_bar() changes how the colors are sorted on the graph. We can specify that the color positions should stack (the default), dodge (side-by-side), or fill (uniform height to examine proportions).

ggplot(d, aes(x=ses, fill=prog)) + 
  geom_bar(position="dodge")

Adding more variables to graphs

One of the great strengths ggplot2 is how easy it is to map more variables to graphical aspects of the graph.

Graphs of 3 or more variables allow us to assess interactions of variables.

Adding color by prog to our scatter plot of read vs write. Now we can assess whether the read-write relationship appears the same between programs.

# both scatter plot and loess smooth layers
ggplot(d, aes(x=read, y=write, color=prog)) + 
  geom_point() +
  geom_smooth()

Faceting in ggplot2 is creating multiples (panels) of a plot by a grouping variable.

In the function facet_wrap(), specify a grouping variable by which to split the plots after the ~.

Below we split our plots of prog-by-ses by female.

# all functions after ggplot know
#   to look for variables in dataset "d"
ggplot(d, aes(x=ses, fill=prog)) + 
  geom_bar(position="dodge") +
  facet_wrap(~female)

Base `R` graphics

Base R graphics are easy to create, but not nearly as easy to customize and modernize as ggplot2 graphs. We present some of the graphs created earlier, but now in base R graphics.

Base R histograms

hist(d$write)

Base R scatterplot

plot(d$write, d$read)

Base R bar graph

# barplot wants a table input, not a data frame
#   (ggplot always wants a data.frame)
barplot(table(d$prog))

Coloring a scatter plot by groups

plot(d$write, d$read, col=d$prog)

Review: Data Exploration

4.1 What are a couple of things we can learn from this density plot of awards? Why does it look wrong?

ggplot(d, aes(x=awards)) + geom_density()

Awards has a pretty small range, with most values near 0, and very negative value is probably a missing data code.

ggplot(d, aes(x=awards)) + geom_density()

4.2 How would I obtain the maximum math score for each prog group? (hint: use group_by() and summarize())

# these are the progs again
table(d_semi$prog)
## 
## academic  general vocation 
##      105       45       50

First create a grouped data frame with group_by(), then use summarize() and function max() on math.

by_prog <- group_by(d_semi, prog)
summarize(by_prog, max(math))
## # A tibble: 3 x 2
##       prog `max(math)`
##      <chr>       <dbl>
## 1 academic          75
## 2  general          63
## 3 vocation          75

4.3 Here are the median and inter-quartile ranges (distance between first and third quartiles) of math scores by prog. Describe how the subsequent graph will appear.

by_prog <- group_by(d_semi, prog)
summarize(by_prog, median(math), IQR(math))
## # A tibble: 3 x 3
##       prog `median(math)` `IQR(math)`
##      <chr>          <dbl>       <dbl>
## 1 academic             57       13.00
## 2  general             49       14.00
## 3 vocation             45       11.75

ggplot(d, aes(x=prog, y=math)) + geom_boxplot()

3 boxplots, academic will be highest, general will have largest spread in the middle, vocation at the bottom with narrowest spread:

ggplot(d, aes(x=prog, y=math)) + geom_boxplot()

Data Management

Preparing our dataset for statistical analysis

Now that we have familiarized ourselves with our dataset variables and their relationships, we should clean up any data entry errors and create additional variables or datasets that we might need for our planned statistical analysis.

Let's begin by reading in our dataset again and storing it in object d.

# read data in
d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_character(),
##   ses = col_character(),
##   schtyp = col_character(),
##   prog = col_character(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer(),
##   honors = col_character(),
##   awards = col_integer(),
##   cid = col_integer()
## )

# load packages for this section (if needed)
library(tidyverse)

Sorting

We can sort the order of rows in our data frame by variable values using the arrange function from the dplyr package (part of tidyverse).

Here we are requesting that arrange sort by science, and then by socst. The function arrange returns the sorted dataset.

Hmmm, we see those -99 values again…

d <- arrange(d, science, socst)
d
## # A tibble: 200 x 13
##       id female    ses  schtyp     prog  read write  math science socst
##    <int>  <chr>  <chr>   <chr>    <chr> <int> <int> <int>   <int> <int>
##  1    53   male middle  public vocation    34    37    46     -99   -99
##  2   191 female   high private academic    47    52    43     -99   -99
##  3     9   male middle  public vocation    48    49    52     -99   -99
##  4   159   male   high  public academic    55    61    54     -99   -99
##  5    71 female middle  public  general    57    62    56     -99   -99
##  6   144   male   high  public  general    60    65    58     -99   -99
##  7    61 female   high  public academic    76    63    60     -99   -99
##  8    15   male   high  public vocation    39    39    44      26    42
##  9    45 female    low  public vocation    34    35    41      29    26
## 10    51 female   high  public  general    42    36    42      31    39
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## #   awards <int>, cid <int>

Missing values

Missing values in R are represented by the reserved symbol NA (cannot be used for variable names).

Blank fields in a text file will generally be converted to NA.

We can convert the -99 values in science to NA with conditional selection.

# subset to science values equal to -99, and then change
#  them all to NA
d$science[d$science == -99] <- NA
head(d$science, 10)
##  [1] NA NA NA NA NA NA NA 26 29 31

That works, but what if we suspect there might be -99 values in other variables? If we know beforehand what our missing value codes are, we can specify them in read_csv() with the na= argument and save ourselves the work of conversion.

In the code below, we are specifying that the following should be interpreted as missing for read_csv():

"" (blank field)
-99
"-99" for character variables
"NA"

We can see some NA values in science and socst.

# read in data, specifying missing data codes
d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv",
              na=c("", -99, "-99", "NA"))
d
## # A tibble: 200 x 13
##       id female    ses schtyp     prog  read write  math science socst
##    <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
##  1    45 female    low public vocation    34    35    41      29    26
##  2   108   male middle public  general    34    33    41      36    36
##  3    15   male   high public vocation    39    39    44      26    42
##  4    67   male    low public vocation    37    37    42      33    32
##  5   153   male middle public vocation    39    31    40      39    51
##  6    51 female   high public  general    42    36    42      31    39
##  7   164   male middle public vocation    31    36    46      39    46
##  8   133   male middle public vocation    50    31    40      34    31
##  9     2 female middle public vocation    39    41    33      42    41
## 10    53   male middle public vocation    34    37    46      NA    NA
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## #   awards <int>, cid <int>

Missing values are contagious

Most operations involving and NA value will result in NA:

1 + 2 + NA
## [1] NA

c(1, 2, 3, NA) > 2
## [1] FALSE FALSE  TRUE    NA

mean(c(1,2,3,4,NA))
## [1] NA

However, many functions allow the argument na.rm (or soemthing similar) to be set to TRUE, which will first remove any NA values from the operation before calculating the result:

# NA values will be removed first
sum(c(1,2,NA), na.rm=TRUE)
## [1] 3

mean(c(1,2,3,4,NA), na.rm=TRUE)
## [1] 2.5

You cannot check for equality to NA, as it means "undefined". It will always result in NA.

Use is.na() instead.

x <- c(1, 2, NA)

x == NA
## [1] NA NA NA

is.na(x)
## [1] FALSE FALSE  TRUE

String functions

Base R comes with several functions useful for manipulating string (character) variables.

String variables are notoriously messy, often with typos and extra spaces. One of the advantages of dplyr read_csv() over base R read.csv() is that read_csv() will remove leading and trailing spaces by default, while read.csv() will not.

Two common tasks with strings are extracting substrings and concatenating strings together.

Use substr() to extract a part of a character variable, specified by the start= position and the stop= position.

Imagine we needed to abbreviate our prog names, so that they fit well in a graph or table. We can create a variable consisting of the first 3 letters of prog like so:

# extract starting at first character, stopping at third
d$prog_short <- substr(d$prog, start=1, stop=3)

head(d[,c("prog", "prog_short")], n=5)
## # A tibble: 5 x 2
##       prog prog_short
##      <chr>      <chr>
## 1 vocation        voc
## 2  general        gen
## 3 vocation        voc
## 4 vocation        voc
## 5 vocation        voc

For concatenating strings together, use paste(). The sep= argument specifies which character delimits the strings (space by default).

Below we combine the schtyp (school type) and ses variables into a single variable by pasting their contents together.

d$schtyp_ses1 <- paste(d$schtyp, d$ses, sep=" ")
head(d[, c("schtyp", "ses", "schtyp_ses1")], n=5)
## # A tibble: 5 x 3
##   schtyp    ses   schtyp_ses1
##    <chr>  <chr>         <chr>
## 1 public    low    public low
## 2 public middle public middle
## 3 public   high   public high
## 4 public    low    public low
## 5 public middle public middle

# changing the delimiter to comma
d$schtyp_ses2 <- paste(d$schtyp, d$ses, sep=",")
head(d[, c("schtyp", "ses", "schtyp_ses2")], n=5)
## # A tibble: 5 x 3
##   schtyp    ses   schtyp_ses2
##    <chr>  <chr>         <chr>
## 1 public    low    public,low
## 2 public middle public,middle
## 3 public   high   public,high
## 4 public    low    public,low
## 5 public middle public,middle

`grep()` for partial string matching

If you need find matches of a given pattern within strings of a vector (does not have to be a whole word match), use grep().

By default grep() returns the index number of the matches. Use the argument specification value=TRUE to return the actual strings themselves.

my_char_vec <- c("here", "are", "some", "words", "to", "explore")

# indexes of elements that contain "re"
#   NOTICE that the pattern to be matched goes first, and
#     the input vector goes second
grep(pattern="re", x=my_char_vec)
## [1] 1 2 6

# value=TRUE returns the strings that are matched
grep("re", my_char_vec, value=TRUE)
## [1] "here"    "are"     "explore"

Transforming variables

Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.

Here are some useful functions to transform variables:

log(): logarithm
min_rank(): rank values
cut(): cut a continuous variable into intervals, and new value signifies into which interval the original value valls.
scale(): standardizes variable (substracts mean and divides by standard deviation)
lag(), lead(): lag and lead a variable
cumsum(): cumulative sum
rowMeans(), rowSums(): means and sums of several columns

Adding new variables to the data frame

You can add variables to data frames by declaring them to be column variables of the data frame as they are created.

Trying to add a column of the wrong length will result in an error.

# this will add a column variable called logwrite to d
d$logwrite <- log(d$write)

# now we see logwrite as a column in d
colnames(d)
##  [1] "id"          "female"      "ses"         "schtyp"      "prog"       
##  [6] "read"        "write"       "math"        "science"     "socst"      
## [11] "honors"      "awards"      "cid"         "prog_short"  "schtyp_ses1"
## [16] "schtyp_ses2" "logwrite"

# d has 200 rows, and the rep vector has 300
d$z <- rep(0, 300)
## Error in `$<-.data.frame`(`*tmp*`, z, value = c(0, 0, 0, 0, 0, 0, 0, 0, : replacement has 300 rows, data has 200

Tranforming many variables at once with `mutate()`

The dplyr function mutate() allows us to transform many variables in one step without having to respecify the data frame name over and over.

Below we transform math in 4 different ways.

# create 4 transformations of math
d <- mutate(d,
            logmath = log(math),
            mathrank = min_rank(math),
            mathgrade = cut(math,
                            breaks = c(0, 35, 45, 55, 65, 80),
                            labels = c("F", "D", "C", "B", "A")),
            zmath = scale(math)
            )

Subsetting rows of a data frame with `filter()`

We have already seen how to subset using logical vectors and logical subsetting:

# subset to observations with max reading score
max_read <- d[d$read==max(d$read),]
max_read
## # A tibble: 2 x 21
##      id female   ses schtyp     prog  read write  math science socst
##   <int>  <chr> <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
## 1   103   male  high public academic    76    52    64      64    61
## 2    61 female  high public academic    76    63    60      NA    NA
## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>,
## #   prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>,
## #   logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>,
## #   zmath <dbl>

While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated.

The dplyr function filter() provides a cleaner syntax for subsetting datasets.

# subset to females with high math
d_fem_hi_math <- filter(d, female == "female" & math > 50)
head(d_fem_hi_math, n=3)
## # A tibble: 3 x 21
##      id female    ses schtyp     prog  read write  math science socst
##   <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
## 1     8 female    low public academic    39    44    52      44    48
## 2   142 female middle public vocation    47    42    52      39    51
## 3   151 female middle public vocation    47    46    52      48    46
## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>,
## #   prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>,
## #   logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>,
## #   zmath <dbl>

# subset to students with math < 50 in the general or academic programs
d_gen_aca_low_math <- filter(d, (prog == "general" | prog == "academic") & math < 50)
head(d_gen_aca_low_math, n=3)
## # A tibble: 3 x 21
##      id female    ses schtyp     prog  read write  math science socst
##   <int>  <chr>  <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
## 1   108   male middle public  general    34    33    41      36    36
## 2    51 female   high public  general    42    36    42      31    39
## 3   128   male   high public academic    39    33    38      47    41
## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>,
## #   prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>,
## #   logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>,
## #   zmath <dbl>

Adding Observations (appending by rows)

Sometimes we are given our dataset in parts, with observations spread over many files (collected by different researchers, for example). To create one dataset, we need to append the datasets together row-wise.

The function rbind() appends data frames together. The variables must be the same between datasets.

Here, we rbind() the two datasets we created with filter() above, and check that it was successful by calculating the number of rows.

# rbind works because they have the same variables
d_append <- rbind(d_fem_hi_math, d_gen_aca_low_math)

# dimensions of component datasets
dim(d_fem_hi_math)
## [1] 62 21

dim(d_gen_aca_low_math)
## [1] 47 21

# appended dataset has rows = sum of rows of components
dim(d_append)
## [1] 109  21

Subsetting Variables (columns)

Often, datasets come with many more variable than we want. We can use the dplyr function select() to keep only the variables we need.

# select 4 variables
d_use <- select(d, id, female, read, write)
head(d_use, n=3)
## # A tibble: 3 x 4
##      id female  read write
##   <int>  <chr> <int> <int>
## 1    45 female    34    35
## 2   108   male    34    33
## 3    15   male    39    39

# select everything BUT female, read, write
# note the - preceding c(female...)
d_dropped <- select(d, -c(female, read, write))
head(d_dropped, n=3)
## # A tibble: 3 x 18
##      id    ses schtyp     prog  math science socst       honors awards
##   <int>  <chr>  <chr>    <chr> <int>   <int> <int>        <chr>  <int>
## 1    45    low public vocation    41      29    26 not enrolled      0
## 2   108 middle public  general    41      36    36 not enrolled      0
## 3    15   high public vocation    44      26    42 not enrolled      0
## # ... with 9 more variables: cid <int>, prog_short <chr>,
## #   schtyp_ses1 <chr>, schtyp_ses2 <chr>, logwrite <dbl>, logmath <dbl>,
## #   mathrank <int>, mathgrade <fctr>, zmath <dbl>

Adding columns of data

If we know that the rows of data of 2 columns (or two data frames) correspond to the same observations, we can use cbind() to combine the columns into a single data frame. Columns combined this way must have the same number of rows.

The rows of the two data frames we just created with select() indeed do correspond to the same observations:

d_all <- cbind(d_use, d_dropped)
head(d_all, n=3)
##    id female read write  id    ses schtyp     prog math science socst
## 1  45 female   34    35  45    low public vocation   41      29    26
## 2 108   male   34    33 108 middle public  general   41      36    36
## 3  15   male   39    39  15   high public vocation   44      26    42
##         honors awards cid prog_short   schtyp_ses1   schtyp_ses2 logwrite
## 1 not enrolled      0   1        voc    public low    public,low 3.555348
## 2 not enrolled      0   1        gen public middle public,middle 3.496508
## 3 not enrolled      0   1        voc   public high   public,high 3.663562
##    logmath mathrank mathgrade      zmath
## 1 3.713572       22         D -1.2430021
## 2 3.713572       22         D -1.2430021
## 3 3.784190       43         D -0.9227783

Adding data columns by merging on a key variable

More often, we receive separate datasets with different variables (columns) that must be merged on a key variable.

Merging is an involved topic, with many different kinds of merges possible, depending on whether every observation in one dataset can be matched to an observation in the other dataset. Sometimes, you'll want to keep observations in one dataset, even if it is not matched. Other times, you will not.

We will solely demonstrate merges where only matched observations are kept.

Earlier in the seminar, we learned how to use the dplyr functions group_by() and summarize() to get statistics by group.

Let's use those tools again to get statistics by class (dataset variable cid), namely the class means and medians on math. This time, we will store the output dataset in an object.

# first group data by cid (there are 20 classes)
by_class <- group_by(d, cid)

# then get mean/median on math by class
class_stats <- summarize(by_class, meanmath=mean(math), medmath=median(math))
class_stats
## # A tibble: 20 x 3
##      cid meanmath medmath
##    <int>    <dbl>   <dbl>
##  1     1 41.36364    41.0
##  2     2 40.50000    39.5
##  3     3 43.77778    44.0
##  4     4 44.72727    43.0
##  5     5 44.45455    44.0
##  6     6 47.11111    46.0
##  7     7 48.11111    49.0
##  8     8 49.81818    50.0
##  9     9 50.77778    51.0
## 10    10 50.80000    51.5
## 11    11 49.91667    50.5
## 12    12 55.40000    54.0
## 13    13 54.44444    53.0
## 14    14 58.00000    57.5
## 15    15 58.70000    59.0
## 16    16 57.18182    57.0
## 17    17 61.14286    62.0
## 18    18 63.30000    63.0
## 19    19 66.00000    66.0
## 20    20 69.80000    71.0

Conveniently, the class_stats dataset includes cid, which we will use as our key variable for merging.

Here, we will use the dplyr() function inner_join() to merge the datasets (base R function merge() is quite similar). inner_join() will search both datasets for any variables with the same name, and will use those as matching variables. If you need to control which variables are used to match, use the by= argument.

In our two datasets, the only variable that appears in both is cid, which we want to use as the key variable, so we do not need by=:

d_merged <- inner_join(d, class_stats)
## Joining, by = "cid"

# showing just a few variable for space
head(select(d_merged, cid, math, meanmath, medmath))
## # A tibble: 6 x 4
##     cid  math meanmath medmath
##   <int> <int>    <dbl>   <dbl>
## 1     1    41 41.36364      41
## 2     1    41 41.36364      41
## 3     1    44 41.36364      41
## 4     1    42 41.36364      41
## 5     1    40 41.36364      41
## 6     1    42 41.36364      41

Review: Data Management

5.1 If TRUE = 1 and FALSE = 0, what is the result of sum(b<3)?

b <- c(1,2,3,NA)
sum(b<3)

It's NA! Summing with NA results in NA. Remember to use na.rm=TRUE if you want to remove NA first.

b <- c(1,2,3,NA)
sum(b<3)
## [1] NA

# remove NA first
sum(b<3, na.rm=TRUE)
## [1] 2

5.2 Here is a dataset of names and phone numbers. How do I create a variable that is just the area code (without parenetheses)?

# tibble() is basically same as data.frame()
#  but adds class "tbl_df" to data.frame
directory <- 
  tibble(names=c("Leo Smith", "Karen Smith", 
                  "Audrey Jones", "Dylan Jones"),
         phone=c("(323)555-5432", "(323)555-5421",
                 "(213)555-2154", "(213)555-2155"))

Use substr() to extract from the second to fourth character from phone?

directory <- 
  data.frame(names=c("Leo Smith", "Karen Smith", 
                     "Audrey Jones", "Dylan Jones"),
             phone=c("(323)555-5432", "(323)555-5421",
                     "(213)555-2154", "(213)555-2155"))

directory$area_code <- substr(directory$phone, 2, 4)

directory
##          names         phone area_code
## 1    Leo Smith (323)555-5432       323
## 2  Karen Smith (323)555-5421       323
## 3 Audrey Jones (213)555-2154       213
## 4  Dylan Jones (213)555-2155       213

5.3 Imagine directory was much larger and had thousands or millions of rows? How could I subset the data to everyone with the name "Jones"?

directory <- 
  tibble(names=c("Leo Smith", "Karen Smith", 
                  "Audrey Jones", "Dylan Jones"),
         phone=c("(323)555-5432", "(323)555-5421",
                 "(213)555-2154", "(213)555-2155"))

Use grep() for partial matches. Remember that grep() returns the indices of matches, so we can use the results of grep to subset our directory:

directory <- 
  tibble(names=c("Leo Smith", "Karen Smith", 
                  "Audrey Jones", "Dylan Jones"),
         phone=c("(323)555-5432", "(323)555-5421",
                 "(213)555-2154", "(213)555-2155"))

# match "Jones" in names
my_jones <- grep("Jones", directory$names)
my_jones
## [1] 3 4

directory[my_jones,]
## # A tibble: 2 x 2
##          names         phone
##          <chr>         <chr>
## 1 Audrey Jones (213)555-2154
## 2  Dylan Jones (213)555-2155

5.4 These tibble data frames seem to have the same variables. Why doesn't rbind(y1, y2) work?

y1 <- tibble(Names=c("Mary", "Sue"),
            scores=c(36, 78))
y2 <- tibble(names=c("John", "Jack"),
             scores=c(25, 44))

# what happened?
rbind(y1, y2)
## Error in match.names(clabs, names(xi)): names do not match previous names

Case matters in R. "Names" and "names" are considered different. Make them the same to get rbind() to work:

y1 <- tibble(names=c("Mary", "Sue"),
            scores=c(36, 78))
y2 <- tibble(names=c("John", "Jack"),
             scores=c(25, 44))

# what happened?
rbind(y1, y2)
## # A tibble: 4 x 2
##   names scores
##   <chr>  <dbl>
## 1  Mary     36
## 2   Sue     78
## 3  John     25
## 4  Jack     44

5.5 In the code below, we split the dataset into two - one that contains the numeric test variables (read, write, math, science, and socst) and another that contains all other variables. We then sort the test variables dataset by math.

Why is running cbind() to re-merge the datasets a bad idea? After all, there is no error message…

# create a datset of just test scores
test <- select(d, read, write, math, science, socst)

nontest <- select(d, -c(read, write, math, science, socst))

# sort test scores by test
test <- arrange(test, math)

# CONT>>>

# cbind runs without error
remerged <- cbind(test, nontest)

# but what's wrong here?
head(remerged, n=3)
##   read write math science socst  id female    ses schtyp     prog
## 1   39    41   33      42    41  45 female    low public vocation
## 2   63    49   35      66    41 108   male middle public  general
## 3   36    44   37      42    41  15   male   high public vocation
##         honors awards cid prog_short   schtyp_ses1   schtyp_ses2 logwrite
## 1 not enrolled      0   1        voc    public low    public,low 3.555348
## 2 not enrolled      0   1        gen public middle public,middle 3.496508
## 3 not enrolled      0   1        voc   public high   public,high 3.663562
##    logmath mathrank mathgrade      zmath
## 1 3.713572       22         D -1.2430021
## 2 3.713572       22         D -1.2430021
## 3 3.784190       43         D -0.9227783

The problem is that cbind() does not know you sorted one of the two datasets, so now the order of observations is different between the two. Thus cbind() matches the wrong observations from the 2 datasets together.

Let's see how the observation with id = 1 appears in the original dataset and the remerged dataset:

# the values on the test scores don't match!
rbind(d[d$id==1,], remerged[remerged$id==1,])
## # A tibble: 2 x 21
##      id female   ses schtyp     prog  read write  math science socst
## * <int>  <chr> <chr>  <chr>    <chr> <int> <int> <int>   <int> <int>
## 1     1 female   low public vocation    34    44    40      39    41
## 2     1 female   low public vocation    39    54    39      47    36
## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>,
## #   prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>,
## #   logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>,
## #   zmath <dbl>

Instead, it is safer to use a merge variable. When first splitting the datasets, we should make sure an id variable appears in both dataset.

# This time, add id to test dataset
test <- select(d, id, read, write, math, science, socst)

nontest <- select(d, -c(read, write, math, science, socst))

# sort test scores by test
test <- arrange(test, math)

# cbind runs without error
remerged2 <- merge(test, nontest)

# these should match now
rbind(remerged2[remerged2$id==1,], d[d$id==1,])
##   id read write math science socst female ses schtyp     prog       honors
## 1  1   34    44   40      39    41 female low public vocation not enrolled
## 2  1   34    44   40      39    41 female low public vocation not enrolled
##   awards cid prog_short schtyp_ses1 schtyp_ses2 logwrite  logmath mathrank
## 1      0   1        voc  public low  public,low  3.78419 3.688879       12
## 2      0   1        voc  public low  public,low  3.78419 3.688879       12
##   mathgrade     zmath
## 1         D -1.349743
## 2         D -1.349743

More help

Getting help online

UCLA Stat Consulting, our website with thousands of pages for help with statistical analysis
StackOverflow, a message board where R-related and statistics-related questions are answered
R for Data Science, an online book written by the people who make RStudio and the tidyverse, that discusses how to use R for all steps of data analysis, with special emphasis on using the tidyverse

Books on `R`

There is an extensive list of books on R maintained at http://www.r-project.org/doc/bib/R-books.html.