September 25, 2017
R as a programming environmentR is a programming environment, that
S)These tools are distributed as packages, which any user can download to customize the R environment.
R and packagesBase R and most R packages are available for download from the Comprehensive R Archive Network (CRAN)
R comes with a number of basic data management, analysis, and graphical toolsR's power and flexibility lie in its array of packages (currently more than 11,000 on CRAN!)RYou can work directly in R, but most users prefer a graphical interface. We highly recommend using RStudio, an integrated development environment (IDE) that features:
For the purposes of this seminar, we will be using the following packages frequently:
installr() easy, automatic updating of R and R packagestidyverse a collection of packages designed to work with tidy data, data that are organized in way to make later data analysis easier. Includes the following packages:
dplyr various data management tasksreadxl reading Excel filesggplot2 elegant data visualization (graphics) using the Grammar of Graphicshaven reading data files from other stats packagesTo use packages in R, we must first install them using the install.packages() function, which typically downloads the package from CRAN and installs it for use.
install.packages("installr")
install.packages("tidyverse")
After installing a package, we can load it into the R environment using the library() or require() functions, which more or less do the same thing.
Functions and data structures within the package will then be available for use.
library(tidyverse)
library(installr) ## Loading required package: stringr ## ## Welcome to installr version 0.19.0 ## ## More information is available on the installr project website: ## https://github.com/talgalili/installr/ ## ## Contact: <tal.galili@gmail.com> ## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/installr/issues ## ## To suppress this message use: ## suppressPackageStartupMessages(library(installr))
R and its packagesR is updated quite frequenlty, and newer versions of R are sometimes incompatible with older versions of packages. So, it is important to keep both R and its packages up to date.
The installr package provides the function updateR(), which will automatically search for and then install new versions of R, and can also update all packages to their newest versions.
updateR()
R sessionTo get a description of the version of R and its attached packages used in the current session, we can use the sessionInfo() function
sessionInfo() ## R version 3.4.2 (2017-09-28) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] installr_0.19.0 stringr_1.2.0 knitr_1.17 dplyr_0.7.4 ## [5] purrr_0.2.3 readr_1.1.1 tidyr_0.7.1 tibble_1.3.4 ## [9] ggplot2_2.2.1 tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.13 cellranger_1.1.0 compiler_3.4.2 plyr_1.8.4 ## [5] bindr_0.1 forcats_0.2.0 tools_3.4.2 digest_0.6.12 ## [9] lubridate_1.6.0 jsonlite_1.5 evaluate_0.10.1 nlme_3.1-131 ## [13] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.1.2 ## [17] psych_1.7.8 yaml_2.1.14 parallel_3.4.2 haven_1.1.0 ## [21] bindrcpp_0.2 xml2_1.1.1 httr_1.3.1 hms_0.3 ## [25] rprojroot_1.2 grid_3.4.2 glue_1.1.1 R6_2.2.2 ## [29] readxl_1.0.0 foreign_0.8-69 rmarkdown_1.6 modelr_0.1.1 ## [33] reshape2_1.4.2 magrittr_1.5 backports_1.1.1 scales_0.5.0 ## [37] htmltools_0.3.6 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 ## [41] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 ## [45] broom_0.4.2
Without further specification, files will be loaded from and saved to the working directory. The functions getwd() and setwd() will get and set the working directory, respectively.
#get current directory (not run)
getwd()
# set new working directory (not run)
setwd("/path/to/directory")
R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source() function.
You can run a command directly from a script by placing the cursor inside the command or highlighting the commands and hitting Ctrl-Enter (Command-Enter on Macs). This will advance the cursor to the next command, where you can hit Ctrl-Enter again to run it, advancing the cursor to the next command…
Commands are separated either by a ; or by a newline.
R is case sensitive.
The # character at the beginning of a line signifies a comment, which is not executed.
Commands can extend beyond one line of text. Put operators like + at the end of lines for multi-line commands.
# Using R as a calculator 2 + 3 ## [1] 5
R stores both data and output from data analysis (as well as everything else) in objects.
Data are assigned to and stored in objects using the <- or = operator.
To print the contents of an object, specify the object's name alone.
A list of all objects in the current session can be obtained with ls()
# assign the number 3 to object called abc abc <- 3 # print contents abc ## [1] 3 # list all objects in current session ls() ## [1] "abc" "hook_output" "my_custom_output"
Functions perform most of the work on data in R.
Functions in R are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input \(x\), and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean.
The inputs to functions are often referred to as arguments.
We have already used a few functions, such as install.packages() and library().
Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?seq).
In the help file, we will find a list of Arguments to the function, in a specific order. Values for arguments to functions can be specified either by name or position.
# seq() creates a sequence of numbers # specifying arguments by name seq(from=1, to=5, by=1) ## [1] 1 2 3 4 5 # specifying arguments by position seq(10, 0, -2) ## [1] 10 8 6 4 2 0
In the Usage section a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.
The Value section specifies what is returned. Usually there are Examples at the bottom.
If you aren't sure what function to use, or want to search for a topic, ??keyword searches R documentation for keyword (e.g. ??logistic)
??logistic
Many packages include vignettes – longer, tutorial style guides for a package.
To see a list of available vignettes for the packages that are loaded, use vignette() with no arguments. Then to view a vignette, place its name inside vignette():
# list all available vignettes
vignette()
# View the "Introduction to dplyr" vignette
vignette("introduction")
Vectors, the fundamental data structure in R, are one-dimensional and homogeneous.
A single variable can usually be represented by one of the following vector data types:
A single value is a vector of length one in R.
The c() function combines values of common type together to form a vector.
The typeof() function identifies a vector's type.
The length() function returns its length.
# create a vector first_vec <- c(1, 3, 5) first_vec ## [1] 1 3 5 # vector type typeof(first_vec) ## [1] "double"
# character vector
char_vec <- c("these", "are", "some", "words")
length(char_vec)
## [1] 4
# the result of this comparison is a logical vector
first_vec > c(2, 2, 2)
## [1] FALSE TRUE TRUE
rep() and seq() to generate vectorsTo create vectors with a predictable sequence of elements, use rep() to generate repetitive elements and seq() to generate sequential elements.
The expression m:n will generate a vector of integers from m to n
# second argument is number of repetitions
rep(0, times=3)
## [1] 0 0 0
rep("abc", 4)
## [1] "abc" "abc" "abc" "abc"
# from, to, by seq(from=1, to=5, by=2) ## [1] 1 3 5 seq(10, 0, -5) ## [1] 10 5 0 # colon operator 3:7 ## [1] 3 4 5 6 7 # you can nest functions rep(seq(1,3,1), times=2) ## [1] 1 2 3 1 2 3 # each vs times rep(seq(1,3,1), each=2) ## [1] 1 1 2 2 3 3
If we perform an operation on two or more vectors and the vectors of are of unequal length, the values of shorter vector will be recycled until the two vectors are of the same length.
# the single value `1` is a vector of length 1 # it is recycled to be c(1,1,1) c(1,2,3) + 1 ## [1] 2 3 4 # second vector recycled twice to make c(1,2,1,2,1,2) c(1,2,3,4,5,6) + c(1,2) ## [1] 2 4 4 6 6 8 # The 2 becomes c(2,2,2) c(1,2,3) < 2 ## [1] TRUE FALSE FALSE # what is R complaining about here? c(2,3,4) + c(10, 20) ## Warning in c(2, 3, 4) + c(10, 20): longer object length is not a multiple ## of shorter object length ## [1] 12 23 14
Elements of a vector can be accessed or subset by specifying a vector of numbers (of length 1 or greater) inside [].
# create a vector 10 to 1 # putting () around a command will cause the result to be printed (a <- seq(10,1,-1)) ## [1] 10 9 8 7 6 5 4 3 2 1 # second element a[2] ## [1] 9 # first 5 elements a[seq(1,5)] ## [1] 10 9 8 7 6 # first, third, and fourth elements a[c(1,3,4)] ## [1] 10 8 7
Vector elements can be named, and then subset by name. Make sure to use "" when subsetting by element name.
scores <- c(John=25, Marge=34, Dan=24, Emily=29)
scores[c("John", "Emily")]
## John Emily
## 25 29
Vectors elements can also be subset with a logical (TRUE/FALSE) vector, known as logical subsetting.
scores[c(FALSE, TRUE, TRUE, FALSE)] ## Marge Dan ## 34 24
This allows us to subset a vector by checking if a condition is satisifed:
# this returns a logical vector... scores < 30 ## John Marge Dan Emily ## TRUE FALSE TRUE TRUE # ...that we can now use to subset scores[scores<30] ## John Dan Emily ## 25 24 29
Like vectors, lists are "one-dimensional" structures, but the elements can be a mixture of types – often vectors (of any length), but also other lists, matrices and data frames (see below).
Lists can be manually generated with list():
# list accepts a mixture of data types
# a list of a numeric vector, an integer vector, and a
# character vector
mylist <- list(1.1, c(1L,3L,7L), c("abc", "def"))
mylist
## [[1]]
## [1] 1.1
##
## [[2]]
## [1] 1 3 7
##
## [[3]]
## [1] "abc" "def"
List elements can be named as well
# list elements can be named as well
mary_info <- list(classes=c("Biology", "Math", "Music",
"Physics"),
friends=c("John", "Dan", "Emily"),
SAT=1450)
mary_info
## $classes
## [1] "Biology" "Math" "Music" "Physics"
##
## $friends
## [1] "John" "Dan" "Emily"
##
## $SAT
## [1] 1450
As the output from the previous 2 sections suggest, there are a couple ways of accessing list elements (the vectors).
Use [[]] to access by position number and $ to access by name.
If the list element is a vector, we can access individual elements within the vector subsetting using [].
# by position mary_info[[2]] ## [1] "John" "Dan" "Emily" # by name mary_info$SAT ## [1] 1450 # second element of friends vector mary_info$friends[2] ## [1] "Dan"
Matrices are two-dimensional, homogeneous data structures.
Matrices can be generated manually with matrix(). The input to matrix() is a one-dimensional vector, which is reshaped into a two-dimensional matrix according to the dimensions specified by the user in the arguments nrow and ncol (generally only one is needed).
The matrix is filled down the columns by default, but this can be changed by setting the byrow argument to TRUE.
# create a 2x3 matrix, filling down columns a <- matrix(1:6, nrow=2) a ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 # now fill across rows b <- matrix(5:14, nrow=2, byrow=TRUE) b ## [,1] [,2] [,3] [,4] [,5] ## [1,] 5 6 7 8 9 ## [2,] 10 11 12 13 14
Matrix elements can be accessed with matrix[row,column] notation.
Omitting row requests all rows, and omitting column requests all columns.
# row 2 column 3 a[2,3] ## [1] 6 # all rows column 2 b[,2] ## [1] 6 11 # all columns row 1 a[1,] ## [1] 1 3 5
Datasets for statistical analysis are typically stored in data frames in R.
Data frames combine the features of matrices and lists.
Real datasets usually combine variables of different types, so data frames are well suited for storage.
data.frame()Data frames can be manually created with data.frame() . The syntax resembles the syntax for list(), except that the elements are vectors of equal length.
The elements of a data frame are almost always named.
# a logical vector and numeric vector of equal length
mydata <- data.frame(diabetic = c(TRUE, FALSE, TRUE, FALSE),
height = c(65, 69, 71, 73))
mydata
## diabetic height
## 1 TRUE 65
## 2 FALSE 69
## 3 TRUE 71
## 4 FALSE 73
As data frames are both matrices and lists, they can be subset by methods for either matrices or lists.
With a two-dimensional structure, data frames can be subset like matrices [rows, columns].
# row 3 column 2 mydata[3,2] ## [1] 71 # using column name mydata[1:2, "height"] ## [1] 65 69 # all rows of column "height" mydata[,"diabetic"] ## [1] TRUE FALSE TRUE FALSE
We can subset data frames like lists as well. The columns are considered the list elements, so we can use either [[]] or $ to extract columns.
Extracted columns are vectors.
We will generally use $ throughout the seminar to subset data frame columns, because we often perform operations on column variables.
# subsetting creates a numeric vector mydata$height[2:3] ## [1] 69 71 # this is a numeric vector mydata[["height"]] ## [1] 65 69 71 73 mydata[["height"]][2] ## [1] 69
colnames(*data frame*) returns the column names of a data frame (or matrix).
colnames(*data frame*) <- c("some", "names") assigns column names to data frame.
# get column names
colnames(mydata)
## [1] "diabetic" "height"
# assign column names
colnames(mydata) <- c("Diabetic", "Height")
colnames(mydata)
## [1] "Diabetic" "Height"
# to change one variable name, just use indexing
colnames(mydata)[1] <- "Diabetes"
colnames(mydata)
## [1] "Diabetes" "Height"
Use dim() on two-dimensional objects to get the number or rows and columns.
Use str(), to see the structure of the object, including its class (discussed later) and the data types of elements.
# number of rows and columns dim(mydata) ## [1] 4 2 #d is of class "data.frame" #all of its variables are of type "integer" str(mydata) ## 'data.frame': 4 obs. of 2 variables: ## $ Diabetes: logi TRUE FALSE TRUE FALSE ## $ Height : num 65 69 71 73
R objects belong to classes. Objects can belong to more than one class.
Many functions only accept objects of a specific class, so it is important to know the classes of our objects.
The class() function lists all classes to which the object belongs. If class() returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of vector (or matrix for 2-d objects).
Data frames are a class as well.
# mydata is of class data.frame class(mydata) ## [1] "data.frame" # Height is a numeric vector class(mydata$Height) ## [1] "numeric" # colMeans(), for means of columns, wants input of class data.frame or matrix colMeans(mydata) ## Diabetes Height ## 0.5 69.5 # vector input to colMeans() produces an error colMeans(mydata$Height) ## Error in colMeans(mydata$Height): 'x' must be an array of at least two dimensions
Generic functions match object classes to the appropriate function.
Generic functions remove the need for the user to remember the classes of objects that functions support. The function's help file will usually tell you if a function is generic.
Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class.
The various functions for specific classes can have widely diverging purposes.
For example, summary() is a generic function. When a data frame is passed to summary(), the data frame is then passed to a specific function (method) called summary.data.frame(), which provides a numeric summary of all variables in the data frame.
#summary() calls summary.data.frame() if given a data.frame input summary(mydata) ## Diabetes Height ## Mode :logical Min. :65.0 ## FALSE:2 1st Qu.:68.0 ## TRUE :2 Median :70.0 ## Mean :69.5 ## 3rd Qu.:71.5 ## Max. :73.0
In contrast, passing a regression model object (class=lm) to summary() calls summary.lm() and produces a regression table instead.
# run a regression and save model of class "lm" in object model1 <- lm(Height ~ Diabetes, data=mydata) class(model1) ## [1] "lm"
# summary() calls summary.lm() if given an lm object summary(model1) ## ## Call: ## lm(formula = Height ~ Diabetes, data = mydata) ## ## Residuals: ## 1 2 3 4 ## -3 -2 3 2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 71.000 2.550 27.848 0.00129 ** ## DiabetesTRUE -3.000 3.606 -0.832 0.49291 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.606 on 2 degrees of freedom ## Multiple R-squared: 0.2571, Adjusted R-squared: -0.1143 ## F-statistic: 0.6923 on 1 and 2 DF, p-value: 0.4929
methods() functionMethods are class-specific functions.
The methods() functions lists what methods exist in the current R session.
Supply a generic function name to methods() to list all specific functions (methods) that the generic function searches for a class match.
Supply a class to methods(class=) to list all specific functions that accept that class.
# what classes of objects does generic function summary() accept? methods(summary) ## [1] summary.aov summary.aovlist* ## [3] summary.aspell* summary.check_packages_in_dir* ## [5] summary.connection summary.corAR1* ## [7] summary.corARMA* summary.corCAR1* ## [9] summary.corCompSymm* summary.corExp* ## [11] summary.corGaus* summary.corIdent* ## [13] summary.corLin* summary.corNatural* ## [15] summary.corRatio* summary.corSpher* ## [17] summary.corStruct* summary.corSymm* ## [19] summary.data.frame summary.Date ## [21] summary.default summary.Duration* ## [23] summary.ecdf* summary.factor ## [25] summary.ggplot* summary.glm ## [27] summary.gls* summary.infl* ## [29] summary.Interval* summary.lm ## [31] summary.lme* summary.lmList* ## [33] summary.loess* summary.manova ## [35] summary.matrix summary.mlm* ## [37] summary.modelStruct* summary.nls* ## [39] summary.nlsList* summary.packageStatus* ## [41] summary.pdBlocked* summary.pdCompSymm* ## [43] summary.pdDiag* summary.PDF_Dictionary* ## [45] summary.PDF_Stream* summary.pdIdent* ## [47] summary.pdLogChol* summary.pdMat* ## [49] summary.pdNatural* summary.pdSymm* ## [51] summary.Period* summary.POSIXct ## [53] summary.POSIXlt summary.ppr* ## [55] summary.prcomp* summary.princomp* ## [57] summary.proc_time summary.psych* ## [59] summary.reStruct* summary.shingle* ## [61] summary.srcfile summary.srcref ## [63] summary.stepfun summary.stl* ## [65] summary.table summary.trellis* ## [67] summary.tukeysmooth* summary.varComb* ## [69] summary.varConstPower* summary.varExp* ## [71] summary.varFixed* summary.varFunc* ## [73] summary.varIdent* summary.varPower* ## see '?methods' for accessing help and source code
# what functions accept data frames as arguments? methods(class="data.frame") ## [1] $ $<- [ [[ ## [5] [[<- [<- aggregate anti_join ## [9] anyDuplicated arrange arrange_ as.data.frame ## [13] as.list as.matrix as.tbl as.tbl_cube ## [17] as_data_frame as_tibble by cbind ## [21] coerce collapse collect complete ## [25] complete_ compute dim dimnames ## [29] dimnames<- distinct distinct_ do ## [33] do_ drop_na drop_na_ droplevels ## [37] duplicated edit expand expand_ ## [41] extract extract_ fill fill_ ## [45] filter filter_ format formula ## [49] fortify full_join gather gather_ ## [53] ggplot glimpse group_by group_by_ ## [57] group_indices group_indices_ group_size groups ## [61] head initialize inner_join intersect ## [65] is.na is_vector_s3 knit_print left_join ## [69] Math merge mutate mutate_ ## [73] n_groups na.exclude na.omit nest ## [77] nest_ Ops plot print ## [81] prompt pull rbind rename ## [85] rename_ replace_na right_join row.names ## [89] row.names<- rowsum same_src sample_frac ## [93] sample_n select select_ semi_join ## [97] separate separate_ separate_rows separate_rows_ ## [101] setdiff setequal show slice ## [105] slice_ slotsFromS3 split split<- ## [109] spread spread_ stack str ## [113] subset summarise summarise_ summary ## [117] Summary t tail tbl_vars ## [121] transform type_sum ungroup union ## [125] union_all unique unite unite_ ## [129] unnest unnest_ unstack within ## see '?methods' for accessing help and source code
Now we test what you've learned.
2.1 What will the object a contain here?
a <- c(0, 1) a
a <- c(0, 1) a ## [1] 0 1
2.2 Now what will the object a contain?
a <- c(10, seq(5, 1, -1)) a
a <- c(10, seq(5, 1, -1)) a ## [1] 10 5 4 3 2 1
2.3 What about now? What will a contain?
a <- c(rep(0,2), seq(1,5,by=2)) a
a <- c(rep(0,2), seq(1,5,by=2)) a ## [1] 0 0 1 3 5
2.4 What will be the result of dim(b)?
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
dim(b)
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
dim(b)
## [1] 3 2
b
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
2.5 What is the result of b[2,]?
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[2,]
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[2,]
## letters numbers
## 2 b 2
2.6 What about the result of b[b$numbers<2,]?
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[b$numbers<2,]
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
b[b$numbers<2,]
## letters numbers
## 1 a 1
2.7 What are three different ways to access "c" in data frame b?
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
Also, why does it keep saying "levels" in the output?
b <- data.frame(letters=c("a", "b", "c"), numbers=c(1,2,3))
# letters column, element 3 (recommended method)
b$letters[3]
## [1] c
## Levels: a b c
# row 3 column 1
b[3,1]
## [1] c
## Levels: a b c
# element 1 (column 1) of data frame, then element 3 of that
b[[1]][3]
## [1] c
## Levels: a b c
Because by default, character vectors are converted to factors in data.frame().
2.8 We know that b is a data.frame. Why do you think mean(b) does not work? What is R trying to tell us with the warning?
# confirm that it is a data frame class(b) ## [1] "data.frame" # NA is not what we want, what is the warning trying to tell us? mean(b) ## Warning in mean.default(b): argument is not numeric or logical: returning ## NA ## [1] NA
mean() expects a numeric or logical vector as input, not a data frame. It doesn't know how to calculate the mean for several columns.
Subsetting a column of b results in a vector, which works as an input to mean().
methods(mean) shows us that there is no mean function defined specifically for data frames.
# columns of data frames are vectors class(b$numbers) ## [1] "numeric" mean(b$numbers) ## [1] 2 # no mean.data.frame methods(mean) ## [1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt ## see '?methods' for accessing help and source code
R works most easily with datasets stored as text files. Typically, values in text files are separated, or delimited, by tabs or spaces:
gender id race ses schtyp prgtype read write math science socst 0 70 4 1 1 general 57 52 41 47 57 1 121 4 2 1 vocati 68 59 53 63 31 0 86 4 3 1 general 44 33 54 58 31 0 141 4 3 1 vocati 63 44 47 53 56or by commas (CSV file):
gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst 0,70,4,1,1,general,57,52,41,47,57 1,121,4,2,1,vocati,68,59,53,63,61 0,86,4,3,1,general,44,33,54,58,31 0,141,4,3,1,vocati,63,44,47,53,56
We recommend the tidyverse (specific package readr) functions read_csv() to read in data stored as CSV and read_delim() to read in text data delimited by other characters.
For read_delim(), specify the delimiter in the delim= argument.
(Base R functions read.csv() and read.delim() have very similar functionality, but have less useful default settings)
Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.
Note how we are assigning the loaded data to objects.
# run this if you missed it earlier library(tidyverse)
In the output for read_csv() and read_delim(), you'll see the data type of each column.
# comma separated values
dat_csv <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_character(),
## ses = col_character(),
## schtyp = col_character(),
## prog = col_character(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer(),
## honors = col_character(),
## awards = col_integer(),
## cid = col_integer()
## )
# tab separated values
dat_tab <- read_delim("https://stats.idre.ucla.edu/stat/data/hsb2.txt",
delim="\t")
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_integer(),
## race = col_integer(),
## ses = col_integer(),
## schtyp = col_integer(),
## prog = col_integer(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer()
## )
If you attempt to print a data frame read in by read_csv() or read_delim(), it prints in special way:
dat_csv ## # A tibble: 200 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 45 female low public vocation 34 35 41 29 26 ## 2 108 male middle public general 34 33 41 36 36 ## 3 15 male high public vocation 39 39 44 26 42 ## 4 67 male low public vocation 37 37 42 33 32 ## 5 153 male middle public vocation 39 31 40 39 51 ## 6 51 female high public general 42 36 42 31 39 ## 7 164 male middle public vocation 31 36 46 39 46 ## 8 133 male middle public vocation 50 31 40 34 31 ## 9 2 female middle public vocation 39 41 33 42 41 ## 10 53 male middle public vocation 34 37 46 39 31 ## # ... with 190 more rows, and 3 more variables: honors <chr>, ## # awards <int>, cid <int>
This is because read_csv() assigns the data frame to a class called tibble, a tidyverse structure that slightly alters how data.frames behave, such as when they are being created or printed. Tibbles are still data frames, and will work in most functions that require data frame inputs.
Read more about tibbles here.
We can create tibbles manually in a nearly identical manner to data frames with tibble().
To convert a tibble to a regular data frame, use as.data.frame().
# dat_csv is of class tibble (tbl_df), class table (tbl) and class data.frame class(dat_csv) ## [1] "tbl_df" "tbl" "data.frame" # now just a data.frame class(as.data.frame(dat_csv)) ## [1] "data.frame"
We can read in datasets from other statistical analysis software using functions found in the haven package, part of the tidyverse. Note, the haven package is not loaded with library(tidyverse) so must be loaded separately.
require(haven)
## Loading required package: haven
# SPSS files
dat_spss <- read_spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")
# Stata files
dat_dta <- read_stata("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
Datasets are often saved as Excel spreadsheets. Here we utilize the readxl package to read in the excel file. We need to download the file first.
library(readxl)
# this step only needed to read excel files from the internet
download.file("https://stats.idre.ucla.edu/stat/data/hsb2.xls", "myfile.xls", mode="wb")
dat_xls <- read_excel("myfile.xls")
head() and tail()Use head() and tail() to look at a specified number of rows at the begininning or end of a dataset, respectively.
# first 2 rows head(dat_csv, 2) ## # A tibble: 2 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 45 female low public vocation 34 35 41 29 26 ## 2 108 male middle public general 34 33 41 36 36 ## # ... with 3 more variables: honors <chr>, awards <int>, cid <int>
# last 8 rows tail(dat_csv, 8) ## # A tibble: 8 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 174 male middle private academic 68 59 71 66 56 ## 2 95 male high public academic 73 60 71 61 71 ## 3 61 female high public academic 76 63 60 67 66 ## 4 100 female high public academic 63 65 71 69 71 ## 5 143 male middle public vocation 63 63 75 72 66 ## 6 68 male middle public academic 73 67 71 63 66 ## 7 57 female middle public academic 71 65 72 66 56 ## 8 132 male middle public academic 73 62 73 69 66 ## # ... with 3 more variables: honors <chr>, awards <int>, cid <int>
View()Use View() on a dataset to open a spreadsheet-style view of a dataset. In RStuido, clicking on a dataset in the Environment pane will View() it.
View(dat_csv)
We can export our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta, using write_ functions that reverse the operations of the read_ functions.
Multiple objects can be stored in an R binary file (usally extension ".Rdata") with save() and then later loaded with load().
We did not specify realistic pathnames below.
# write a csv file write_csv(dat_csv, file = "path/to/save/filename.csv") # Stata .dta file (need to load foreign package) write_dta(dat_csv, file = "path/to/save/filename.dta") # save these objects to an .Rdata file save(dat_csv, mydata, file="path/to/save/filename.Rdata")
3.1 In the file "http://stats.idre.ucla.edu/stat/data/hsbsemi.txt", the fields are separated by ";". What function could I use to read it in?
Here are how the first few rows appear:
id;female;ses;schtyp;prog;read;write;math;science;socst;honors;awards;cid 45;female;low;public;vocation;34;35;41;29;26;not enrolled;0;1 108;male;middle;public;general;34;33;41;36;36;not enrolled;0;1
Use read_delim() with delim=";"
d_semi <- read_delim("http://stats.idre.ucla.edu/stat/data/hsbsemi.txt",
delim=";")
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_character(),
## ses = col_character(),
## schtyp = col_character(),
## prog = col_character(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer(),
## honors = col_character(),
## awards = col_integer(),
## cid = col_integer()
## )
# more on next page
d_semi ## # A tibble: 200 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 45 female low public vocation 34 35 41 29 26 ## 2 108 male middle public general 34 33 41 36 36 ## 3 15 male high public vocation 39 39 44 26 42 ## 4 67 male low public vocation 37 37 42 33 32 ## 5 153 male middle public vocation 39 31 40 39 51 ## 6 51 female high public general 42 36 42 31 39 ## 7 164 male middle public vocation 31 36 46 39 46 ## 8 133 male middle public vocation 50 31 40 34 31 ## 9 2 female middle public vocation 39 41 33 42 41 ## 10 53 male middle public vocation 34 37 46 -99 -99 ## # ... with 190 more rows, and 3 more variables: honors <chr>, ## # awards <int>, cid <int>
Once we have data loaded, it is always wise to familiarize ourselves with variables in the dataset, both individually and their relationships.
First we will read in some data and store it in the object we name d. We prefer short names for objects that we will use frequently.
The dataset contains several school, test, and demographic variables for 200 students.
In this section, we will explore data with both numeric summaries and graphical depictions.
d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_character(),
## ses = col_character(),
## schtyp = col_character(),
## prog = col_character(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer(),
## honors = col_character(),
## awards = col_integer(),
## cid = col_integer()
## )
d ## # A tibble: 200 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 45 female low public vocation 34 35 41 29 26 ## 2 108 male middle public general 34 33 41 36 36 ## 3 15 male high public vocation 39 39 44 26 42 ## 4 67 male low public vocation 37 37 42 33 32 ## 5 153 male middle public vocation 39 31 40 39 51 ## 6 51 female high public general 42 36 42 31 39 ## 7 164 male middle public vocation 31 36 46 39 46 ## 8 133 male middle public vocation 50 31 40 34 31 ## 9 2 female middle public vocation 39 41 33 42 41 ## 10 53 male middle public vocation 34 37 46 -99 -99 ## # ... with 190 more rows, and 3 more variables: honors <chr>, ## # awards <int>, cid <int>
We can distinguish generally between variables measured continuously (quantitative) and those measured categorically (membership to a class).
Methods to explore the two types of variables differ somewhat, so we will visit each separately initially.
We first explore the continuous variables in the dataset, which are the academic test score variables, "read", "write", "math", "science", and "socst".
Common numeric summaries for continuous variables are the mean, median, and variance, obtained with mean(), median(), and var() (sd() for standard deviation), respectively.
summary() on a numeric vector provides the min, max, mean, median, and first and third quartiles (interquartile range).
mean(d$read) ## [1] 52.23 median(d$read) ## [1] 50 var(d$read) ## [1] 105.1227 summary(d$read) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 28.00 44.00 50.00 52.23 60.00 76.00
ggplot2 for graphicsWe will be using the package ggplot2, part of the tidyverse, to create plots for exploring our data. Although base R has powerful and flexible graphic capabilities on its own, we prefer the approach that ggplot2 takes.
ggplot2 uses a structured grammar of graphics that provides an intuitive framework for building graphics layer-by-layer, rather than memorizing lots of plotting commands and optionsggplot2 graphics take less work to make beautiful and eye-catchingggplot2 plotThe basic specification for a ggplot2 plot is to specify which variables are mapped to which aspects of the graph (called aesthetics) and then to choose a shape (called a geom) to display on the graph.
For example, we can choose to map one variable to the x-axis, another variable to the y-axis, and to use geom_point() as the shape to plot, which produces a scatter plot.
Within the ggplot() function we specify (Note that the package is named ggplot2 while this function is called ggplot()):
aes() function, we then specify which variables are mapped to which aesthetics, which can include:
For a much more detailed explanation of the grammar of graphics underlying ggplot2, see our Introduction to ggplot2 seminar.
# a scatterplot of read vs write ggplot(data=d, aes(x=write, y=read)) + geom_point()
We can inspect the distributions of continuous variables with histograms, density plots, and boxplots. Each of these plots has a corresponding ggplot2 geom.
Histograms bin continuous variables into intervals and count the frequency of observations in each interval.
For histograms and density plots, we will map the variable of interest to x.
# use the bins= argument to control the number of intervals ggplot(d, aes(x=write)) + geom_histogram(bins=10)
We can also look at distributions for a subset of our data. Here we examine the distribution for write for students with math score below the mean math score:
# Requesting the rows where math is less than its mean
ggplot(d[d$math < mean(d$math),],
aes(x=write)) + geom_histogram(bins=10)
Density Plots smooth out the shape of histograms:
ggplot(d, aes(x = write)) + geom_density()
Boxplots show the median, lower and upper quartiles (the hinges), and outliers.
Unlike histograms and density plots, map the variable whose distribution we want to plot to y instead of x. If we are making a single boxplot, we need an arbitrary value for x, just as a place holder.
# for the overall distribution of one variable, specify x=1 (or any other value) ggplot(d, aes(x = 1, y = math)) + geom_boxplot()
Data exploration can help us identify suspicious looking values. This value of -99 on science is probably a code for a missing value.
# for the overall distribution of one variable, specify x=1 (or any other value) ggplot(d, aes(x = 1, y = science)) + geom_boxplot()
The statistics mean, median and variance cannot be calculated meaningfully for categorical variables (unless just 2 categories).
Instead, we often present frequency tables of the distribution of membership to each category.
Use table() to produce frequency tables.
Use prop.table() on the tables produced by table() (i.e. the output) to see the frequencies expressed as proportions.
Some of the categorical variables in this dataset are:
# table() produces counts table(d$female) ## ## female male ## 109 91 table(d$ses) ## ## high low middle ## 58 47 95 # for proportions, use output of table() # as input to prop.table() prop.table(table(d$female)) ## ## female male ## 0.545 0.455 prop.table(table(d$ses)) ## ## high low middle ## 0.290 0.235 0.475
As you may have noticed in the previous section, table() orders the categories of prog and ses alphabetically. Unfortunately, the ordering high-low-middle is not ideal for ses.
Factors in R provide a way to represent categorical variables both numerically and categorically. Basically, factors assign an integer number (beginning with 1) to each distinct category, and then a character label to each category.
We convert character variables to factors with factor(). Specify the names of the categories in the levels= argument, in an order that makes sense to you. If you omit levels=, R will alphabetically sort the categories.
Use levels() on a factor to check the ordering of levels.
Note: The Base R function read.csv() by default reads in character variables as factors using alphabetical ordering, which is not always desirable. Thus we recommend the readr (part of tidyverse) function read_csv(), which leaves them as character.
# before, ses is a character variable
str(d$ses)
## chr [1:200] "low" "middle" "high" "low" "middle" "high" "middle" ...
# converting ses to factor
# we need to specify levels explicitly, otherwise R will
# sort alphabetically
d$ses <- factor(d$ses, levels=c("low", "middle", "high"))
# Now a factor, notice the integer representation
str(d$ses)
## Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
# levels() reveals all factors in order
levels(d$ses)
## [1] "low" "middle" "high"
Factors are represented both by their integers and their character labels.
Factors are converted to 0/1 variables in regression models.
head(d$ses) ## [1] low middle high low middle high ## Levels: low middle high head(as.numeric(d$ses)) ## [1] 1 2 3 1 2 3 # the first observation of ses is equal to "low"... d$ses[1] == "low" ## [1] TRUE # ...and its underlying integer is equal to 1 as.numeric(d$ses[1]) == 1 ## [1] TRUE
Let's go ahead and convert some of the other character variables into factors:
# alphabetic ordering fine here, so no need to specify levels d$female <- factor(d$female) levels(d$female) ## [1] "female" "male" d$prog <- factor(d$prog) levels(d$prog) ## [1] "academic" "general" "vocation"
Distributions of categorical variables are often depicted by bar graphs, which are easily made in ggplot2. By default, geom_bar() counts the number of observations for each value of the variable mapped to x.
ggplot(d, aes(x=prog)) + geom_bar()
After inspecting distributions of variables individually, we then proceed to explore relationships between variables. Namely, we are generally interested whether the values of one variable are independent of the other, or whether they are associated (i.e. correlated or predictive).
We use different numerical and graphical methods for exploration depending on whether the two variables are both continuous, both categorical, or one of each.
Correlations provide quick assessments of whether two continuous variables are linearly related to one another.
The cor() function estimates correlations. If supplied with 2 vectors, cor() will estimate a single correlation. If supplied a data frame with several variables, cor() will estimate a correlation matrix.
# just a single correlation
cor(d$write, d$read)
## [1] 0.5967765
# now isolate all test score variables
scores <- d[, c("read", "write", "math", "science", "socst")]
cor(scores)
## read write math science socst
## read 1.0000000 0.5967765 0.6622801 0.1709428 0.1814928
## write 0.5967765 1.0000000 0.6174493 0.1289845 0.1504587
## math 0.6622801 0.6174493 1.0000000 0.2051668 0.1898648
## science 0.1709428 0.1289845 0.2051668 1.0000000 0.9361672
## socst 0.1814928 0.1504587 0.1898648 0.9361672 1.0000000
Scatter plots are an obvious choice to depict the relationship between 2 variables. We can also add a loess smooth layer (geom_smooth()) that provides a "best-fit" curve to the data.
Note that further layers are added with +.
Here we examine the relationship between reading test score and writing test score.
# both scatter plot and loess smooth layers ggplot(d, aes(x=read, y=write)) + geom_point() + geom_smooth() ## `geom_smooth()` using method = 'loess'
When exploring the relationship between a continuous variable and categorical variable, we are often interested in whether the distribution (i.e. mean, variance, etc.) of the continuous variable is the same between the classes of the categorical variable.
For example, we might want to know whether the means and variances of math test scores are the same between males and females.
The dplyr package (part of tidyverse) provides a useful function, group_by(), which converts a data frame into a grouped data frame, grouped by one or more variables. After grouping the data frame, we then use the dplyr function summarize() to calculate statistics by group.
# first we group our data frame, d, by female by_female <- group_by(d, female) # notice that it is a grouped_df (data frame) now class(by_female) ## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Specify a function to evaluate a variable by groups in summarize(). First specify the (grouped) dataset, then the functions to run on variables in the dataset.
Here we get the means and variances of math by gender. We see that the means are nearly the same, but the variance seems higher in males.
summarize(by_female, mean(math), var(math)) ## # A tibble: 2 x 3 ## female `mean(math)` `var(math)` ## <fctr> <dbl> <dbl> ## 1 female 52.39450 83.74108 ## 2 male 52.94505 93.40806
To plot the distributions of the continuous variables by groups defined by the categorical variables, we will plot separate density plots and boxplots of the continuous variables for each group of the categorical variable.
The grouping variable is commonly mapped to aesthetics that take on categories themselves, such as color or shape, but can be mapped to x as well if it is numeric.
The distributions look very similar, with similar means, and a slightly more spread out shape for males.
ggplot(d, aes(x=math, color=female)) + geom_density()
Boxplots of math by female show the same similar looking distributions.
ggplot(d, aes(x=female, y=math)) + geom_boxplot()
Two-way and multi-way frequency tables are used to explore the relationships between categorical variables.
We can use table() and prop.table() again. Within prob.table(), use margin=1 for row proportions and margin=2 for column proportions. Omitting margin= will give proportions of the total.
Here, we check whether the proportions of observations that fall into each educational program (prog), are about the same across socioeconomic statuses.
# this time saving the freq table to an object my2way <- table(d$prog, d$ses) # counts in each crossing of prog and ses my2way ## ## low middle high ## academic 19 44 42 ## general 16 20 9 ## vocation 12 31 7
Seems to be association between being in the academic program and in high ses.
# row proportions, # proportion of prog that falls into ses prop.table(my2way, margin=1) ## ## low middle high ## academic 0.1809524 0.4190476 0.4000000 ## general 0.3555556 0.4444444 0.2000000 ## vocation 0.2400000 0.6200000 0.1400000 # columns proportions, # proportion of ses that falls into prog prop.table(my2way, margin=2) ## ## low middle high ## academic 0.4042553 0.4631579 0.7241379 ## general 0.3404255 0.2105263 0.1551724 ## vocation 0.2553191 0.3263158 0.1206897
We can add a categorical variable to the bar graph of the other categorical variable to depict their relationship.
Here we map prog to fill, the color used to fill the bars of the bar graph. (The color aesthetic specifies the color of the outline of the bars)
This produces a stacked bar chart. We again see that "high" ses has a higher proportion of "academic"
ggplot(d, aes(x=ses, fill=prog)) + geom_bar()
The position argument in geom_bar() changes how the colors are sorted on the graph. We can specify that the color positions should stack (the default), dodge (side-by-side), or fill (uniform height to examine proportions).
ggplot(d, aes(x=ses, fill=prog)) + geom_bar(position="dodge")
One of the great strengths ggplot2 is how easy it is to map more variables to graphical aspects of the graph.
Graphs of 3 or more variables allow us to assess interactions of variables.
Adding color by prog to our scatter plot of read vs write. Now we can assess whether the read-write relationship appears the same between programs.
# both scatter plot and loess smooth layers ggplot(d, aes(x=read, y=write, color=prog)) + geom_point() + geom_smooth()
Faceting in ggplot2 is creating multiples (panels) of a plot by a grouping variable.
In the function facet_wrap(), specify a grouping variable by which to split the plots after the ~.
Below we split our plots of prog-by-ses by female.
# all functions after ggplot know # to look for variables in dataset "d" ggplot(d, aes(x=ses, fill=prog)) + geom_bar(position="dodge") + facet_wrap(~female)
R graphicsBase R graphics are easy to create, but not nearly as easy to customize and modernize as ggplot2 graphs. We present some of the graphs created earlier, but now in base R graphics.
Base R histograms
hist(d$write)
Base R scatterplot
plot(d$write, d$read)
Base R bar graph
# barplot wants a table input, not a data frame # (ggplot always wants a data.frame) barplot(table(d$prog))
Coloring a scatter plot by groups
plot(d$write, d$read, col=d$prog)
4.1 What are a couple of things we can learn from this density plot of awards? Why does it look wrong?
ggplot(d, aes(x=awards)) + geom_density()
Awards has a pretty small range, with most values near 0, and very negative value is probably a missing data code.
ggplot(d, aes(x=awards)) + geom_density()
4.2 How would I obtain the maximum math score for each prog group? (hint: use group_by() and summarize())
# these are the progs again table(d_semi$prog) ## ## academic general vocation ## 105 45 50
First create a grouped data frame with group_by(), then use summarize() and function max() on math.
by_prog <- group_by(d_semi, prog) summarize(by_prog, max(math)) ## # A tibble: 3 x 2 ## prog `max(math)` ## <chr> <dbl> ## 1 academic 75 ## 2 general 63 ## 3 vocation 75
4.3 Here are the median and inter-quartile ranges (distance between first and third quartiles) of math scores by prog. Describe how the subsequent graph will appear.
by_prog <- group_by(d_semi, prog) summarize(by_prog, median(math), IQR(math)) ## # A tibble: 3 x 3 ## prog `median(math)` `IQR(math)` ## <chr> <dbl> <dbl> ## 1 academic 57 13.00 ## 2 general 49 14.00 ## 3 vocation 45 11.75
ggplot(d, aes(x=prog, y=math)) + geom_boxplot()
3 boxplots, academic will be highest, general will have largest spread in the middle, vocation at the bottom with narrowest spread:
ggplot(d, aes(x=prog, y=math)) + geom_boxplot()
Now that we have familiarized ourselves with our dataset variables and their relationships, we should clean up any data entry errors and create additional variables or datasets that we might need for our planned statistical analysis.
Let's begin by reading in our dataset again and storing it in object d.
# read data in
d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_character(),
## ses = col_character(),
## schtyp = col_character(),
## prog = col_character(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer(),
## honors = col_character(),
## awards = col_integer(),
## cid = col_integer()
## )
# load packages for this section (if needed)
library(tidyverse)
We can sort the order of rows in our data frame by variable values using the arrange function from the dplyr package (part of tidyverse).
Here we are requesting that arrange sort by science, and then by socst. The function arrange returns the sorted dataset.
Hmmm, we see those -99 values again…
d <- arrange(d, science, socst) d ## # A tibble: 200 x 13 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 53 male middle public vocation 34 37 46 -99 -99 ## 2 191 female high private academic 47 52 43 -99 -99 ## 3 9 male middle public vocation 48 49 52 -99 -99 ## 4 159 male high public academic 55 61 54 -99 -99 ## 5 71 female middle public general 57 62 56 -99 -99 ## 6 144 male high public general 60 65 58 -99 -99 ## 7 61 female high public academic 76 63 60 -99 -99 ## 8 15 male high public vocation 39 39 44 26 42 ## 9 45 female low public vocation 34 35 41 29 26 ## 10 51 female high public general 42 36 42 31 39 ## # ... with 190 more rows, and 3 more variables: honors <chr>, ## # awards <int>, cid <int>
Missing values in R are represented by the reserved symbol NA (cannot be used for variable names).
Blank fields in a text file will generally be converted to NA.
We can convert the -99 values in science to NA with conditional selection.
# subset to science values equal to -99, and then change # them all to NA d$science[d$science == -99] <- NA head(d$science, 10) ## [1] NA NA NA NA NA NA NA 26 29 31
That works, but what if we suspect there might be -99 values in other variables? If we know beforehand what our missing value codes are, we can specify them in read_csv() with the na= argument and save ourselves the work of conversion.
In the code below, we are specifying that the following should be interpreted as missing for read_csv():
We can see some NA values in science and socst.
# read in data, specifying missing data codes
d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv",
na=c("", -99, "-99", "NA"))
d
## # A tibble: 200 x 13
## id female ses schtyp prog read write math science socst
## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 45 female low public vocation 34 35 41 29 26
## 2 108 male middle public general 34 33 41 36 36
## 3 15 male high public vocation 39 39 44 26 42
## 4 67 male low public vocation 37 37 42 33 32
## 5 153 male middle public vocation 39 31 40 39 51
## 6 51 female high public general 42 36 42 31 39
## 7 164 male middle public vocation 31 36 46 39 46
## 8 133 male middle public vocation 50 31 40 34 31
## 9 2 female middle public vocation 39 41 33 42 41
## 10 53 male middle public vocation 34 37 46 NA NA
## # ... with 190 more rows, and 3 more variables: honors <chr>,
## # awards <int>, cid <int>
Most operations involving and NA value will result in NA:
1 + 2 + NA ## [1] NA c(1, 2, 3, NA) > 2 ## [1] FALSE FALSE TRUE NA mean(c(1,2,3,4,NA)) ## [1] NA
However, many functions allow the argument na.rm (or soemthing similar) to be set to TRUE, which will first remove any NA values from the operation before calculating the result:
# NA values will be removed first sum(c(1,2,NA), na.rm=TRUE) ## [1] 3 mean(c(1,2,3,4,NA), na.rm=TRUE) ## [1] 2.5
You cannot check for equality to NA, as it means "undefined". It will always result in NA.
Use is.na() instead.
x <- c(1, 2, NA) x == NA ## [1] NA NA NA is.na(x) ## [1] FALSE FALSE TRUE
Base R comes with several functions useful for manipulating string (character) variables.
String variables are notoriously messy, often with typos and extra spaces. One of the advantages of dplyr read_csv() over base R read.csv() is that read_csv() will remove leading and trailing spaces by default, while read.csv() will not.
Two common tasks with strings are extracting substrings and concatenating strings together.
Use substr() to extract a part of a character variable, specified by the start= position and the stop= position.
Imagine we needed to abbreviate our prog names, so that they fit well in a graph or table. We can create a variable consisting of the first 3 letters of prog like so:
# extract starting at first character, stopping at third
d$prog_short <- substr(d$prog, start=1, stop=3)
head(d[,c("prog", "prog_short")], n=5)
## # A tibble: 5 x 2
## prog prog_short
## <chr> <chr>
## 1 vocation voc
## 2 general gen
## 3 vocation voc
## 4 vocation voc
## 5 vocation voc
For concatenating strings together, use paste(). The sep= argument specifies which character delimits the strings (space by default).
Below we combine the schtyp (school type) and ses variables into a single variable by pasting their contents together.
d$schtyp_ses1 <- paste(d$schtyp, d$ses, sep=" ")
head(d[, c("schtyp", "ses", "schtyp_ses1")], n=5)
## # A tibble: 5 x 3
## schtyp ses schtyp_ses1
## <chr> <chr> <chr>
## 1 public low public low
## 2 public middle public middle
## 3 public high public high
## 4 public low public low
## 5 public middle public middle
# changing the delimiter to comma
d$schtyp_ses2 <- paste(d$schtyp, d$ses, sep=",")
head(d[, c("schtyp", "ses", "schtyp_ses2")], n=5)
## # A tibble: 5 x 3
## schtyp ses schtyp_ses2
## <chr> <chr> <chr>
## 1 public low public,low
## 2 public middle public,middle
## 3 public high public,high
## 4 public low public,low
## 5 public middle public,middle
grep() for partial string matchingIf you need find matches of a given pattern within strings of a vector (does not have to be a whole word match), use grep().
By default grep() returns the index number of the matches. Use the argument specification value=TRUE to return the actual strings themselves.
my_char_vec <- c("here", "are", "some", "words", "to", "explore")
# indexes of elements that contain "re"
# NOTICE that the pattern to be matched goes first, and
# the input vector goes second
grep(pattern="re", x=my_char_vec)
## [1] 1 2 6
# value=TRUE returns the strings that are matched
grep("re", my_char_vec, value=TRUE)
## [1] "here" "are" "explore"
Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.
Here are some useful functions to transform variables:
log(): logarithmmin_rank(): rank valuescut(): cut a continuous variable into intervals, and new value signifies into which interval the original value valls.scale(): standardizes variable (substracts mean and divides by standard deviation)lag(), lead(): lag and lead a variablecumsum(): cumulative sumrowMeans(), rowSums(): means and sums of several columnsYou can add variables to data frames by declaring them to be column variables of the data frame as they are created.
Trying to add a column of the wrong length will result in an error.
# this will add a column variable called logwrite to d d$logwrite <- log(d$write) # now we see logwrite as a column in d colnames(d) ## [1] "id" "female" "ses" "schtyp" "prog" ## [6] "read" "write" "math" "science" "socst" ## [11] "honors" "awards" "cid" "prog_short" "schtyp_ses1" ## [16] "schtyp_ses2" "logwrite" # d has 200 rows, and the rep vector has 300 d$z <- rep(0, 300) ## Error in `$<-.data.frame`(`*tmp*`, z, value = c(0, 0, 0, 0, 0, 0, 0, 0, : replacement has 300 rows, data has 200
mutate()The dplyr function mutate() allows us to transform many variables in one step without having to respecify the data frame name over and over.
Below we transform math in 4 different ways.
# create 4 transformations of math
d <- mutate(d,
logmath = log(math),
mathrank = min_rank(math),
mathgrade = cut(math,
breaks = c(0, 35, 45, 55, 65, 80),
labels = c("F", "D", "C", "B", "A")),
zmath = scale(math)
)
filter()We have already seen how to subset using logical vectors and logical subsetting:
# subset to observations with max reading score max_read <- d[d$read==max(d$read),] max_read ## # A tibble: 2 x 21 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 103 male high public academic 76 52 64 64 61 ## 2 61 female high public academic 76 63 60 NA NA ## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>, ## # prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>, ## # logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>, ## # zmath <dbl>
While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated.
The dplyr function filter() provides a cleaner syntax for subsetting datasets.
# subset to females with high math d_fem_hi_math <- filter(d, female == "female" & math > 50) head(d_fem_hi_math, n=3) ## # A tibble: 3 x 21 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 8 female low public academic 39 44 52 44 48 ## 2 142 female middle public vocation 47 42 52 39 51 ## 3 151 female middle public vocation 47 46 52 48 46 ## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>, ## # prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>, ## # logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>, ## # zmath <dbl>
# subset to students with math < 50 in the general or academic programs d_gen_aca_low_math <- filter(d, (prog == "general" | prog == "academic") & math < 50) head(d_gen_aca_low_math, n=3) ## # A tibble: 3 x 21 ## id female ses schtyp prog read write math science socst ## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 108 male middle public general 34 33 41 36 36 ## 2 51 female high public general 42 36 42 31 39 ## 3 128 male high public academic 39 33 38 47 41 ## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>, ## # prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>, ## # logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>, ## # zmath <dbl>
Sometimes we are given our dataset in parts, with observations spread over many files (collected by different researchers, for example). To create one dataset, we need to append the datasets together row-wise.
The function rbind() appends data frames together. The variables must be the same between datasets.
Here, we rbind() the two datasets we created with filter() above, and check that it was successful by calculating the number of rows.
# rbind works because they have the same variables d_append <- rbind(d_fem_hi_math, d_gen_aca_low_math) # dimensions of component datasets dim(d_fem_hi_math) ## [1] 62 21 dim(d_gen_aca_low_math) ## [1] 47 21 # appended dataset has rows = sum of rows of components dim(d_append) ## [1] 109 21
Often, datasets come with many more variable than we want. We can use the dplyr function select() to keep only the variables we need.
# select 4 variables d_use <- select(d, id, female, read, write) head(d_use, n=3) ## # A tibble: 3 x 4 ## id female read write ## <int> <chr> <int> <int> ## 1 45 female 34 35 ## 2 108 male 34 33 ## 3 15 male 39 39
# select everything BUT female, read, write # note the - preceding c(female...) d_dropped <- select(d, -c(female, read, write)) head(d_dropped, n=3) ## # A tibble: 3 x 18 ## id ses schtyp prog math science socst honors awards ## <int> <chr> <chr> <chr> <int> <int> <int> <chr> <int> ## 1 45 low public vocation 41 29 26 not enrolled 0 ## 2 108 middle public general 41 36 36 not enrolled 0 ## 3 15 high public vocation 44 26 42 not enrolled 0 ## # ... with 9 more variables: cid <int>, prog_short <chr>, ## # schtyp_ses1 <chr>, schtyp_ses2 <chr>, logwrite <dbl>, logmath <dbl>, ## # mathrank <int>, mathgrade <fctr>, zmath <dbl>
If we know that the rows of data of 2 columns (or two data frames) correspond to the same observations, we can use cbind() to combine the columns into a single data frame. Columns combined this way must have the same number of rows.
The rows of the two data frames we just created with select() indeed do correspond to the same observations:
d_all <- cbind(d_use, d_dropped) head(d_all, n=3) ## id female read write id ses schtyp prog math science socst ## 1 45 female 34 35 45 low public vocation 41 29 26 ## 2 108 male 34 33 108 middle public general 41 36 36 ## 3 15 male 39 39 15 high public vocation 44 26 42 ## honors awards cid prog_short schtyp_ses1 schtyp_ses2 logwrite ## 1 not enrolled 0 1 voc public low public,low 3.555348 ## 2 not enrolled 0 1 gen public middle public,middle 3.496508 ## 3 not enrolled 0 1 voc public high public,high 3.663562 ## logmath mathrank mathgrade zmath ## 1 3.713572 22 D -1.2430021 ## 2 3.713572 22 D -1.2430021 ## 3 3.784190 43 D -0.9227783
More often, we receive separate datasets with different variables (columns) that must be merged on a key variable.
Merging is an involved topic, with many different kinds of merges possible, depending on whether every observation in one dataset can be matched to an observation in the other dataset. Sometimes, you'll want to keep observations in one dataset, even if it is not matched. Other times, you will not.
We will solely demonstrate merges where only matched observations are kept.
Earlier in the seminar, we learned how to use the dplyr functions group_by() and summarize() to get statistics by group.
Let's use those tools again to get statistics by class (dataset variable cid), namely the class means and medians on math. This time, we will store the output dataset in an object.
# first group data by cid (there are 20 classes) by_class <- group_by(d, cid) # then get mean/median on math by class class_stats <- summarize(by_class, meanmath=mean(math), medmath=median(math)) class_stats ## # A tibble: 20 x 3 ## cid meanmath medmath ## <int> <dbl> <dbl> ## 1 1 41.36364 41.0 ## 2 2 40.50000 39.5 ## 3 3 43.77778 44.0 ## 4 4 44.72727 43.0 ## 5 5 44.45455 44.0 ## 6 6 47.11111 46.0 ## 7 7 48.11111 49.0 ## 8 8 49.81818 50.0 ## 9 9 50.77778 51.0 ## 10 10 50.80000 51.5 ## 11 11 49.91667 50.5 ## 12 12 55.40000 54.0 ## 13 13 54.44444 53.0 ## 14 14 58.00000 57.5 ## 15 15 58.70000 59.0 ## 16 16 57.18182 57.0 ## 17 17 61.14286 62.0 ## 18 18 63.30000 63.0 ## 19 19 66.00000 66.0 ## 20 20 69.80000 71.0
Conveniently, the class_stats dataset includes cid, which we will use as our key variable for merging.
Here, we will use the dplyr() function inner_join() to merge the datasets (base R function merge() is quite similar). inner_join() will search both datasets for any variables with the same name, and will use those as matching variables. If you need to control which variables are used to match, use the by= argument.
In our two datasets, the only variable that appears in both is cid, which we want to use as the key variable, so we do not need by=:
d_merged <- inner_join(d, class_stats) ## Joining, by = "cid" # showing just a few variable for space head(select(d_merged, cid, math, meanmath, medmath)) ## # A tibble: 6 x 4 ## cid math meanmath medmath ## <int> <int> <dbl> <dbl> ## 1 1 41 41.36364 41 ## 2 1 41 41.36364 41 ## 3 1 44 41.36364 41 ## 4 1 42 41.36364 41 ## 5 1 40 41.36364 41 ## 6 1 42 41.36364 41
5.1 If TRUE = 1 and FALSE = 0, what is the result of sum(b<3)?
b <- c(1,2,3,NA) sum(b<3)
It's NA! Summing with NA results in NA. Remember to use na.rm=TRUE if you want to remove NA first.
b <- c(1,2,3,NA) sum(b<3) ## [1] NA # remove NA first sum(b<3, na.rm=TRUE) ## [1] 2
5.2 Here is a dataset of names and phone numbers. How do I create a variable that is just the area code (without parenetheses)?
# tibble() is basically same as data.frame()
# but adds class "tbl_df" to data.frame
directory <-
tibble(names=c("Leo Smith", "Karen Smith",
"Audrey Jones", "Dylan Jones"),
phone=c("(323)555-5432", "(323)555-5421",
"(213)555-2154", "(213)555-2155"))
Use substr() to extract from the second to fourth character from phone?
directory <-
data.frame(names=c("Leo Smith", "Karen Smith",
"Audrey Jones", "Dylan Jones"),
phone=c("(323)555-5432", "(323)555-5421",
"(213)555-2154", "(213)555-2155"))
directory$area_code <- substr(directory$phone, 2, 4)
directory
## names phone area_code
## 1 Leo Smith (323)555-5432 323
## 2 Karen Smith (323)555-5421 323
## 3 Audrey Jones (213)555-2154 213
## 4 Dylan Jones (213)555-2155 213
5.3 Imagine directory was much larger and had thousands or millions of rows? How could I subset the data to everyone with the name "Jones"?
directory <-
tibble(names=c("Leo Smith", "Karen Smith",
"Audrey Jones", "Dylan Jones"),
phone=c("(323)555-5432", "(323)555-5421",
"(213)555-2154", "(213)555-2155"))
Use grep() for partial matches. Remember that grep() returns the indices of matches, so we can use the results of grep to subset our directory:
directory <-
tibble(names=c("Leo Smith", "Karen Smith",
"Audrey Jones", "Dylan Jones"),
phone=c("(323)555-5432", "(323)555-5421",
"(213)555-2154", "(213)555-2155"))
# match "Jones" in names
my_jones <- grep("Jones", directory$names)
my_jones
## [1] 3 4
directory[my_jones,]
## # A tibble: 2 x 2
## names phone
## <chr> <chr>
## 1 Audrey Jones (213)555-2154
## 2 Dylan Jones (213)555-2155
5.4 These tibble data frames seem to have the same variables. Why doesn't rbind(y1, y2) work?
y1 <- tibble(Names=c("Mary", "Sue"),
scores=c(36, 78))
y2 <- tibble(names=c("John", "Jack"),
scores=c(25, 44))
# what happened?
rbind(y1, y2)
## Error in match.names(clabs, names(xi)): names do not match previous names
Case matters in R. "Names" and "names" are considered different. Make them the same to get rbind() to work:
y1 <- tibble(names=c("Mary", "Sue"),
scores=c(36, 78))
y2 <- tibble(names=c("John", "Jack"),
scores=c(25, 44))
# what happened?
rbind(y1, y2)
## # A tibble: 4 x 2
## names scores
## <chr> <dbl>
## 1 Mary 36
## 2 Sue 78
## 3 John 25
## 4 Jack 44
5.5 In the code below, we split the dataset into two - one that contains the numeric test variables (read, write, math, science, and socst) and another that contains all other variables. We then sort the test variables dataset by math.
Why is running cbind() to re-merge the datasets a bad idea? After all, there is no error message…
# create a datset of just test scores test <- select(d, read, write, math, science, socst) nontest <- select(d, -c(read, write, math, science, socst)) # sort test scores by test test <- arrange(test, math) # CONT>>>
# cbind runs without error remerged <- cbind(test, nontest) # but what's wrong here? head(remerged, n=3) ## read write math science socst id female ses schtyp prog ## 1 39 41 33 42 41 45 female low public vocation ## 2 63 49 35 66 41 108 male middle public general ## 3 36 44 37 42 41 15 male high public vocation ## honors awards cid prog_short schtyp_ses1 schtyp_ses2 logwrite ## 1 not enrolled 0 1 voc public low public,low 3.555348 ## 2 not enrolled 0 1 gen public middle public,middle 3.496508 ## 3 not enrolled 0 1 voc public high public,high 3.663562 ## logmath mathrank mathgrade zmath ## 1 3.713572 22 D -1.2430021 ## 2 3.713572 22 D -1.2430021 ## 3 3.784190 43 D -0.9227783
The problem is that cbind() does not know you sorted one of the two datasets, so now the order of observations is different between the two. Thus cbind() matches the wrong observations from the 2 datasets together.
Let's see how the observation with id = 1 appears in the original dataset and the remerged dataset:
# the values on the test scores don't match! rbind(d[d$id==1,], remerged[remerged$id==1,]) ## # A tibble: 2 x 21 ## id female ses schtyp prog read write math science socst ## * <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 1 female low public vocation 34 44 40 39 41 ## 2 1 female low public vocation 39 54 39 47 36 ## # ... with 11 more variables: honors <chr>, awards <int>, cid <int>, ## # prog_short <chr>, schtyp_ses1 <chr>, schtyp_ses2 <chr>, ## # logwrite <dbl>, logmath <dbl>, mathrank <int>, mathgrade <fctr>, ## # zmath <dbl>
Instead, it is safer to use a merge variable. When first splitting the datasets, we should make sure an id variable appears in both dataset.
# This time, add id to test dataset test <- select(d, id, read, write, math, science, socst) nontest <- select(d, -c(read, write, math, science, socst)) # sort test scores by test test <- arrange(test, math) # cbind runs without error remerged2 <- merge(test, nontest)
# these should match now rbind(remerged2[remerged2$id==1,], d[d$id==1,]) ## id read write math science socst female ses schtyp prog honors ## 1 1 34 44 40 39 41 female low public vocation not enrolled ## 2 1 34 44 40 39 41 female low public vocation not enrolled ## awards cid prog_short schtyp_ses1 schtyp_ses2 logwrite logmath mathrank ## 1 0 1 voc public low public,low 3.78419 3.688879 12 ## 2 0 1 voc public low public,low 3.78419 3.688879 12 ## mathgrade zmath ## 1 D -1.349743 ## 2 D -1.349743
RThere is an extensive list of books on R maintained at http://www.r-project.org/doc/bib/R-books.html.