dplyr
dplyr
packagefilter()
R
Logical operators and functionsselect()
select()
to reorder columnsarrange()
mutate()
*pivot_longer()
to create a variable out of column headings and restructure the datasetpivot_longer(names_to="year", values_to="grad", -id)
names_to="year"
values_to="grad"
-id
pivot_wider()
to spread a single column into multiple columnspivot_wider(names_from=feature, values_from=measure)
separate()
and unite()
to split and concatenate columns, respectivelytidyverse
cheatsheetsstringr
paste()
to concatenate stringstr_c()
from stringr
to concatenate stringsunite()
from dplyr
to concatenate stringsgrep()
sub()
This seminar introduces tools in R
useful for data management.
The seminar is focused on the tools found in the tidyverse
collection of packages, but also includes various tools pulled from other packages that complement tidyverse
tools.
tidyverse
: a collection of packages with tools for most aspects of data analysis, particularly strong in data import, management, and visualization. We will focus on the following packages within tidyverse
:
readr
- updated data import functions like read_csv()
dplyr
- subsetting, sorting, transforming variables, groupingtidyr
- restructuring rows and columnsmagrittr
- piping a chain of commandslubridate
- date and time variable processingstringr
- string variable manipulationHmisc
: another large collection of tools for most aspects of data analysis, but we use it here for describe()
, a dataset summary function
install.packages("tidyverse", dependencies=TRUE)
install.packages("Hmisc", dependencies=TRUE)
library(tidyverse)
library(Hmisc)
Datasets for statistical analysis are typically stored in data frames in R
.
Data frames are rectangular, where the columns are variables and the rows are observations of those variables. Columns can be of different types (double, character, etc.) but must be of equal length.
When we import data from a file, such as with read.csv()
, the resulting structure is typically a data frame.
Most of the tools discussed in this seminar work on data frames.
We begin our data mangement work on a dataset of 120 observations of 25 variables describing simulated cancer patient data. Each patient is also linked to a doctor in the dataset. We will merge this doctor dataset into this dataset later in the seminar.
We will be using the function read_csv()
, from the tidyverse
package readr
to import our data for this seminar. Although very similar to the base R function read.csv()
, read_csv()
has slightly different defaults and options, but the two can be used interchangeably for the most part. One of the primary differences is that read_csv()
will import the data as a tibble
, an updated class of data frames, which we will introduce shortly.
d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt1_dm.csv")
View()
Use View()
on a dataset to open a spreadsheet-style view of a dataset. In RStudio, clicking on a dataset in the Environment pane will View()
it.
View(d)
Throughout this seminar, we will be using tidyverse
functions for data management. These functions typically accept data frames as inputs, and output tibbles
, a special class of data frames.
Tibbles print to screen a bit differently from regular data frames – usually up to 10 rows and only as many columns as can fit on the screen. We also get each column’s “type”:
d
## # A tibble: 120 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCLA 1 1-1 6-Sep-09 male 65.0 3.70 8.09 4 68.0
## 2 UCLA 1 1-1 7-Jan-11 fema~ 53.9 2.63 0.803 2 64.7
## 3 UCLA 1 1-1 4-Sep-10 male 41.4 -99 2.13 3 86.4
## 4 UCLA 1 1-1 25-Jun-~ male 46.8 3.89 1.35 3 53.4
## 5 UCLA 1 1-1 1-Jul-09 male 51.9 1.42 2.19 4 51.7
## 6 UCLA 1 1-1 6-Mar-09 fema~ 53.8 2.29 8.61 3 78.9
## 7 UCLA 1 1-1 15-Apr-~ male 54.4 8.03 7.23 4 62.9
## 8 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2
## 9 UCLA 1 1-11 25-Jul-~ fema~ 59.3 -99 5.18 4 73.2
## 10 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2
## # ... with 110 more rows, and 14 more variables: co2 <dbl>, wound <dbl>,
## # mobility <dbl>, ntumors <dbl>, remission <dbl>, lungcapacity <dbl>,
## # married <dbl>, familyhx <chr>, smokinghx <chr>, cancerstage <chr>,
## # lengthofstay <dbl>, wbc <chr>, rbc <dbl>, bmi <dbl>
Tibbles are still data frames, and will work in nearly all functions that require data frames as inputs. For this seminar, anything we say about data frames will also apply to tibbles.
To convert a tibble to a regular data frame, use as.data.frame()
.
# d is of class tibble (tbl_df), class table (tbl) and class data.frame
class(d)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# now just a data.frame
class(as.data.frame(d))
## [1] "data.frame"
Read more about tibbles.
We can select a particular row and column of data frame x
, with the syntax x[rows, columns]
, where rows
and columns
are one of the following: a number, a vector of numbers, the name of a variable that stores numbers, or omitted.
Omitting the row value x[, columns]
selects all rows, and omitting the column value, x[rows, ]
, selects all columns. (Using this syntax to extract a row or column will result in a data.frame, not a vector)
The syntax x$colname
extracts the column with name colname
as a vector. We recommend this syntax for extracting columns.
The syntax x[[“colname”]]
and x[[colnum]]
can be used to extract column vectors by name or by number, respectively. (Using this syntax to extract a column will result in a vector)
# row 3 column 2
d[3,2]
## # A tibble: 1 x 1
## hospid
## <dbl>
## 1 1
# rows 2 and 4 of columns age
d[c(2,4), "age"]
## # A tibble: 2 x 1
## age
## <dbl>
## 1 53.9
## 2 46.8
# all columns for row 10
d[10,]
## # A tibble: 1 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2 1.55
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
# all rows of column "pain"
d[,"pain"]
## # A tibble: 120 x 1
## pain
## <dbl>
## 1 4
## 2 2
## 3 3
## 4 3
## 5 4
## 6 3
## 7 4
## 8 5
## 9 4
## 10 5
## # ... with 110 more rows
# subsetting with $ creates a numeric vector, which itself can be subset
d$age[2:3]
## [1] 53.91714 41.36804
We can also use logical expressions to subset. Logical comparisons such as x > 0
return TRUE/FALSE vectors.
We can place such an expression inside of []
and values with TRUE will be selected.
# logical comparison results in a logical vector
d$age > 60
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# when placed in [] before the comma, rows with TRUE are selected
# this returns columns age and pain for observations where age > 60
d[d$age>60, c("age", "pain")]
## # A tibble: 5 x 2
## age pain
## <dbl> <dbl>
## 1 65.0 4
## 2 65.8 4
## 3 358. 6
## 4 63.9 8
## 5 64.2 7
Functions perform most of the work on data in R
.
Functions in R
are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input \(x\), and returns its square. Similarly, the mean()
function in R
takes a vector of numbers and returns its mean.
When we load a new package into R
with library()
, the set of functions associated with that library are loaded into the R
session.
In the help file for a function (?function_name
), we will find a list of Arguments to the function, in a specific order. Arguments are the inputs to the function. Values for arguments to functions can be specified either by name or position.
In the Usage section, a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.
The Value section specifies what is returned. Usually there are Examples at the bottom.
Input arguments to functions can be specified by either name or position.
# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5
# specifying arguments by position
seq(10, 0, -2)
## [1] 10 8 6 4 2 0
We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:
The base R
function summary()
provides summary statistics for numeric variables and frequencies for factor variables (we have no factors yet). Character variables are left unsummarized.
summary(d)
## hospital hospid docid dis_date
## Length:120 Min. :1.000 Length:120 Length:120
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :1.000 Mode :character Mode :character
## Mean :1.483
## 3rd Qu.:2.000
## Max. :2.000
## sex age test1 test2
## Length:120 Min. : 34.19 Min. :-99.000 Min. :-99.000
## Class :character 1st Qu.: 47.75 1st Qu.: 1.560 1st Qu.: 2.249
## Mode :character Median : 51.83 Median : 3.107 Median : 4.162
## Mean : 53.59 Mean : -1.989 Mean : -1.226
## 3rd Qu.: 55.01 3rd Qu.: 5.707 3rd Qu.: 6.166
## Max. :357.89 Max. : 12.416 Max. : 17.228
## pain tumorsize co2 wound
## Min. :1.000 Min. : 49.10 Min. :-98.0000 Min. :2.000
## 1st Qu.:4.000 1st Qu.: 62.18 1st Qu.: 1.4954 1st Qu.:5.000
## Median :5.000 Median : 68.23 Median : 1.5859 Median :6.000
## Mean :5.325 Mean : 70.08 Mean : -0.8901 Mean :5.592
## 3rd Qu.:6.000 3rd Qu.: 77.48 3rd Qu.: 1.6860 3rd Qu.:7.000
## Max. :9.000 Max. :109.01 Max. : 1.9424 Max. :9.000
## mobility ntumors remission lungcapacity
## Min. :2.000 Min. :0.00 Min. :0.00 Min. :-99.0000
## 1st Qu.:5.000 1st Qu.:1.00 1st Qu.:0.00 1st Qu.: 0.5141
## Median :6.000 Median :2.00 Median :0.00 Median : 0.7457
## Mean :6.033 Mean :3.15 Mean :0.35 Mean :-18.3280
## 3rd Qu.:7.000 3rd Qu.:5.00 3rd Qu.:1.00 3rd Qu.: 0.8721
## Max. :9.000 Max. :9.00 Max. :1.00 Max. : 0.9982
## married familyhx smokinghx cancerstage
## Min. :0.0000 Length:120 Length:120 Length:120
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :1.0000 Mode :character Mode :character Mode :character
## Mean :0.6333
## 3rd Qu.:1.0000
## Max. :1.0000
## lengthofstay wbc rbc bmi
## Min. :3.000 Length:120 Min. :4.360 Min. :18.45
## 1st Qu.:4.000 Class :character 1st Qu.:4.825 1st Qu.:24.51
## Median :5.000 Mode :character Median :4.978 Median :27.82
## Mean :5.308 Mean :4.970 Mean :29.38
## 3rd Qu.:6.000 3rd Qu.:5.150 3rd Qu.:34.31
## Max. :8.000 Max. :5.535 Max. :58.00
describe()
from package Hmisc
for more detailed summariesThe describe()
function from package Hmisc
will give differents summary statistics for all variables in a data frame. The information provided depends on the variable’s type and number of distinct values.
Important Features:
NA
) for all variablesThe results of describe()
can be used to determine quickly whether variables have plausible values and distributions:
library(Hmisc)
# detailed summaries of variables
describe(d)
## d
##
## 24 Variables 120 Observations
## --------------------------------------------------------------------------------
## hospital
## n missing distinct
## 120 0 2
##
## Value UCLA UCSF
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## hospid
## n missing distinct Info Mean Gmd
## 120 0 2 0.749 1.483 0.5036
##
## Value 1 2
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## docid
## n missing distinct
## 120 0 22
##
## lowest : 1-1 1-100 1-11 1-21 1-22 , highest: 2-177 2-178 2-188 2-201 2-216
## --------------------------------------------------------------------------------
## dis_date
## n missing distinct
## 120 0 104
##
## lowest : 1-Jul-09 10-Jun-09 10-Jun-10 11-Apr-10 11-Dec-09
## highest: 9-Apr-10 9-Feb-09 9-Feb-10 9-Jun-10 9-May-10
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 53.59 11.92 40.86 41.52
## .25 .50 .75 .90 .95
## 47.75 51.83 55.01 58.61 59.65
##
## lowest : 34.19229 35.31930 37.25225 39.61641 40.03724
## highest: 63.93238 64.16432 64.96824 65.80417 357.89001
##
## Value 35 40 45 50 55 60 65 360
## Frequency 3 10 16 38 35 13 4 1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## test1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.989 14.11 -99.0000 0.5809
## .25 .50 .75 .90 .95
## 1.5597 3.1065 5.7067 7.9931 9.5583
##
## lowest : -99.0000000 0.1048958 0.1927608 0.4293420 0.5155185
## highest: 9.7981329 10.2903990 10.4685400 11.0714440 12.4163920
##
## Value -99 0 1 2 3 4 5 6 7 8 9
## Frequency 7 3 20 17 21 12 8 11 6 7 1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##
## Value 10 11 12
## Frequency 5 1 1
## Proportion 0.042 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## test2
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.226 14.59 -99.0000 0.7881
## .25 .50 .75 .90 .95
## 2.2494 4.1620 6.1657 8.9973 10.9861
##
## lowest : -99.0000000 0.5807591 0.6179262 0.6571499 0.7485080
## highest: 11.4315750 12.4493730 14.2252270 14.5365420 17.2275810
##
## Value -99 1 2 3 4 5 6 7 8 9 10
## Frequency 7 15 12 19 16 15 9 6 6 6 1
## Proportion 0.058 0.125 0.100 0.158 0.133 0.125 0.075 0.050 0.050 0.050 0.008
##
## Value 11 12 14 15 17
## Frequency 4 1 1 1 1
## Proportion 0.033 0.008 0.008 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## pain
## n missing distinct Info Mean Gmd
## 120 0 9 0.961 5.325 1.742
##
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 1 2 11 21 33 25 18 5 4
## Proportion 0.008 0.017 0.092 0.175 0.275 0.208 0.150 0.042 0.033
## --------------------------------------------------------------------------------
## tumorsize
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 70.08 12.92 53.48 55.97
## .25 .50 .75 .90 .95
## 62.18 68.23 77.48 84.06 92.54
##
## lowest : 49.09861 50.28009 50.50217 51.65727 53.19937
## highest: 98.05510 98.32850 98.70570 102.69671 109.00956
## --------------------------------------------------------------------------------
## co2
## n missing distinct Info Mean Gmd .05 .10
## 120 0 115 1 -0.8901 5.03 1.390 1.440
## .25 .50 .75 .90 .95
## 1.495 1.586 1.686 1.756 1.793
##
## lowest : -98.000000 1.327440 1.362927 1.372113 1.391310
## highest: 1.810847 1.820266 1.905606 1.920025 1.942401
##
## Value -98 1 2
## Frequency 3 29 88
## Proportion 0.025 0.242 0.733
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## wound
## n missing distinct Info Mean Gmd
## 120 0 8 0.958 5.592 1.767
##
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##
## Value 2 3 4 5 6 7 8 9
## Frequency 5 10 12 23 36 22 10 2
## Proportion 0.042 0.083 0.100 0.192 0.300 0.183 0.083 0.017
## --------------------------------------------------------------------------------
## mobility
## n missing distinct Info Mean Gmd
## 120 0 8 0.969 6.033 2.264
##
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##
## Value 2 3 4 5 6 7 8 9
## Frequency 6 4 19 19 28 15 5 24
## Proportion 0.050 0.033 0.158 0.158 0.233 0.125 0.042 0.200
## --------------------------------------------------------------------------------
## ntumors
## n missing distinct Info Mean Gmd .05 .10
## 120 0 10 0.981 3.15 3.101 0 0
## .25 .50 .75 .90 .95
## 1 2 5 8 9
##
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 23 21 19 9 12 11 7 5 5 8
## Proportion 0.192 0.175 0.158 0.075 0.100 0.092 0.058 0.042 0.042 0.067
## --------------------------------------------------------------------------------
## remission
## n missing distinct Info Sum Mean Gmd
## 120 0 2 0.683 42 0.35 0.4588
##
## --------------------------------------------------------------------------------
## lungcapacity
## n missing distinct Info Mean Gmd .05 .10
## 120 0 96 0.995 -18.33 31.26 -99.0000 -99.0000
## .25 .50 .75 .90 .95
## 0.5142 0.7457 0.8721 0.9573 0.9790
##
## lowest : -99.0000000 -98.0000000 0.2949074 0.3264440 0.3450253
## highest: 0.9856416 0.9864109 0.9924814 0.9940955 0.9982018
##
## Value -99.0 -98.0 0.2 0.4 0.6 0.8 1.0
## Frequency 20 3 1 6 21 46 23
## Proportion 0.167 0.025 0.008 0.050 0.175 0.383 0.192
##
## For the frequency table, variable is rounded to the nearest 0.2
## --------------------------------------------------------------------------------
## married
## n missing distinct Info Sum Mean Gmd
## 120 0 2 0.697 76 0.6333 0.4683
##
## --------------------------------------------------------------------------------
## familyhx
## n missing distinct
## 120 0 3
##
## Value -99 no yes
## Frequency 6 97 17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## smokinghx
## n missing distinct
## 120 0 4
##
## Value -99 current former never
## Frequency 6 26 22 66
## Proportion 0.050 0.217 0.183 0.550
## --------------------------------------------------------------------------------
## cancerstage
## n missing distinct
## 120 0 4
##
## Value I II III IV
## Frequency 40 54 17 9
## Proportion 0.333 0.450 0.142 0.075
## --------------------------------------------------------------------------------
## lengthofstay
## n missing distinct Info Mean Gmd
## 120 0 6 0.933 5.308 1.271
##
## lowest : 3 4 5 6 7, highest: 4 5 6 7 8
##
## Value 3 4 5 6 7 8
## Frequency 5 29 29 40 15 2
## Proportion 0.042 0.242 0.242 0.333 0.125 0.017
## --------------------------------------------------------------------------------
## wbc
## n missing distinct
## 120 0 116
##
## lowest : 3671.880371 4176.054199 4201.741211 4238.355957 4331.902344
## highest: 7999.091309 8340.71582 8415.605469 8567.246094 not assessed
## --------------------------------------------------------------------------------
## rbc
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 4.97 0.2885 4.526 4.606
## .25 .50 .75 .90 .95
## 4.825 4.978 5.150 5.285 5.353
##
## lowest : 4.359662 4.436482 4.456108 4.465468 4.470100
## highest: 5.441604 5.442614 5.459067 5.502604 5.535052
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 29.38 7.278 20.71 22.07
## .25 .50 .75 .90 .95
## 24.51 27.82 34.31 37.75 40.71
##
## lowest : 18.44992 18.68505 20.07485 20.17994 20.49195
## highest: 42.84858 44.04223 46.50746 52.30723 58.00000
## --------------------------------------------------------------------------------
Here are some variables with values that we are not sure are correct:
describe(d[,c("age", "sex", "test1")])
## d[, c("age", "sex", "test1")]
##
## 3 Variables 120 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 53.59 11.92 40.86 41.52
## .25 .50 .75 .90 .95
## 47.75 51.83 55.01 58.61 59.65
##
## lowest : 34.19229 35.31930 37.25225 39.61641 40.03724
## highest: 63.93238 64.16432 64.96824 65.80417 357.89001
##
## Value 35 40 45 50 55 60 65 360
## Frequency 3 10 16 38 35 13 4 1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## test1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.989 14.11 -99.0000 0.5809
## .25 .50 .75 .90 .95
## 1.5597 3.1065 5.7067 7.9931 9.5583
##
## lowest : -99.0000000 0.1048958 0.1927608 0.4293420 0.5155185
## highest: 9.7981329 10.2903990 10.4685400 11.0714440 12.4163920
##
## Value -99 0 1 2 3 4 5 6 7 8 9
## Frequency 7 3 20 17 21 12 8 11 6 7 1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##
## Value 10 11 12
## Frequency 5 1 1
## Proportion 0.042 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
describe()
The results of describe()
can be plotted:
These plots are also useful to inspect distributions of variables and to look for suspicious values.
In the dot plots, look for suspicious labels for categorical variables (see variables sex
, familyhx
and smokinghx
below).
In the histogram plots, each histogram has its own scale, so bunched-up histograms often denote the presence of extreme values (see variables age
, test1
, test2
, co2
and lungcapacity
).
plot(describe(d))
## $Categorical
##
## $Continuous
A look at the output of our dataset summaries, as well as the plot of describe()
identified several suspicious values:
age
has value 357.89001, likely a data entry errorsex
has value “12.2”, also a data entry errortest1
, test2
, lungcapacity
, smokinghx
, familyhx
have value -99, a missing value codeco2
and lungcapacity
have value -98, another missing value codewbc
seems to be a numeric variable, but has the value “not assessed”, which causes R
to read it in as characterBecasue we don’t know any replacement values, we will change suspicous values to NA
, the missing value indicator in R
.
Note: Many times, datasets differentiate between different types of missing. For example, -99=“absent”, while -98=“refused to answer”. For analysis purposes, using NA
for both will usually suffice. For users that need to distinguish between different kinds of missing, we suggest looking into the as.item()
function from package memisc
.
We can use logical subsetting to assign NA
to the data entry errors:
# change all impossible age values to NA
# assume all adults in this dataset
d$age[d$age<18|d$age>120] <- NA
# remember to use quotes for character variables
d$sex[d$sex == "12.2"] <- NA
Now we want to translate all -99, -98, and “not assessed” values to NA
. The following syntax can be used to translate all values of a
to b
in dataset df
:
df[df==a] <- b
d[d==-99] <- NA
d[d==-98] <- NA
d[d=="not assessed"] <- NA
complete.cases(x)
returns a logical (TRUE/FALSE) vector of whether each row of x
has no missing. Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.
# which cases are complete
complete.cases(d)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [13] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [25] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [37] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [49] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [73] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [85] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [97] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
## [109] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
# number of complete cases
sum(complete.cases(d))
## [1] 82
# create a dataset of complete cases
d_comp <- d[complete.cases(d),]
describe()
plot after cleaning dataThe plot of describe()
looks better now – no suspcious labels in the dot plots and nicely spread out histograms.
Color indicates the number of missing values for that variable.
plot(describe(d))
## $Categorical
##
## $Continuous
plot(describe(d), which="continuous")
dplyr
dplyr
packageThe dplyr
package provides tools for some of the most common data management tasks. Its primary functions are “verbs” to help you think about what you need to do to your dataset:
filter()
: select rows according to conditionsselect()
: select columns (you can rename as you select)arrange()
: sort rowsmutate()
: add new columnssummarize()
: calculate statistics across groupsTwo great conveniences of using dplyr
functions:
dplyr
functions is a data frame, and all variables supplied as arguments are assumed to be column of that data frame, so you only need to specify the dataset name once.The dplyr
package is automatically loaded with library(tidyverse)
.
filter()
We have already seen how to subset using logical vectors and logical subsetting:
# all variables for patients with pain>=9
pain9 <- d[d$pain>=9,]
pain9
## # A tibble: 4 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-177 9-Jun-10 male 58.7 7.62 9.31 9 63.5 1.48
## 2 UCSF 2 2-188 15-Jul-~ fema~ 41.5 4.00 7.57 9 98.3 1.82
## 3 UCSF 2 2-216 11-Apr-~ fema~ 52.0 6.70 9.37 9 78.5 1.70
## 4 UCSF 2 2-177 9-Jun-10 male 58.7 7.62 9.31 9 63.5 1.48
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated, especially witha long data set name.
The dplyr
function filter()
provides a cleaner syntax for subsetting datasets. Conditions separated by ,
are joined by &
(logical AND).
# subset to females with pain >= 9
dfhp <- filter(d, sex=="female", pain>=9)
dfhp
## # A tibble: 2 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-188 15-Jul-~ fema~ 41.5 4.00 7.57 9 98.3 1.82
## 2 UCSF 2 2-216 11-Apr-~ fema~ 52.0 6.70 9.37 9 78.5 1.70
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
R
Logical operators and functionsHere are some operators and functions to help with selection:
==
: equality>
, >=
: greater than, greater than or equal to!
: not&
: AND|
: OR%in%
: matches any of (2 %in% c(1,2,3)
= TRUE)is.na()
: equality to NA
near()
: checking for equality for floating point (decimal) numbers, has a built-in tolerance# select those not in UCLA, who are either younger than 40 or older than 60
filter(d, hospital!="UCLA", age<40 | age>60)
## # A tibble: 3 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-113 4-Feb-10 male 63.9 4.77 1.50 8 75.1 1.56
## 2 UCSF 2 2-178 19-Apr-~ male 34.2 4.63 3.26 3 65.5 1.64
## 3 UCSF 2 2-201 21-Nov-~ fema~ 64.2 2.03 6.20 7 79.7 1.72
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
select()
Use dplyr
function select()
to keep only the variables you need.
# select 4 variables
d_small <- select(d, hospid, docid, age, cancerstage)
head(d_small, n=3)
## # A tibble: 3 x 4
## hospid docid age cancerstage
## <dbl> <chr> <dbl> <chr>
## 1 1 1-1 65.0 II
## 2 1 1-1 53.9 II
## 3 1 1-1 41.4 I
Often we want to select a group of related variables with similar names. dplyr
supplies helper functions to find columns whose names match a specified pattern:
starts_with(x)
: matches names that begin with the string x
ends_with(x)
: matches names that end with the string x
contains(x)
: matches names that contain the string x
matches(re)
: matches regular expression re
num_range(prefix, range)
matches names that contain prefix
and one element of range
# get hospital variables
dhosp <- select(d, starts_with("hosp"))
names(dhosp)
## [1] "hospital" "hospid"
# get "test1" and "test2" variables using num_range
tests <- select(d, num_range("test", 1:2))
names(tests)
## [1] "test1" "test2"
select()
to reorder columnsYou can specify a particular ordering of columns with select()
. To place the remaining columns in their current ordering, use everything()
. You can also rename variables with select.
# put the two test vars first while renaming them,
# then everything else
tests_first <- select(d, t1=test1, t2=test2, everything())
names(tests_first)
## [1] "t1" "t2" "hospital" "hospid" "docid"
## [6] "dis_date" "sex" "age" "pain" "tumorsize"
## [11] "co2" "wound" "mobility" "ntumors" "remission"
## [16] "lungcapacity" "married" "familyhx" "smokinghx" "cancerstage"
## [21] "lengthofstay" "wbc" "rbc" "bmi"
arrange()
Sort the order of rows by variable values using arrange()
from dplyr
.
Be default, ascending order will be used. Surround a sorting variable with desc()
to sort by descending order instead.
# sort, with males before females, then by age, youngest first
arrange(d, desc(sex), age)
## # A tibble: 120 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-178 19-Apr-~ male 34.2 4.63 3.26 3 65.5 1.64
## 2 UCLA 1 1-58 9-Feb-10 male 35.3 2.14 12.4 6 60.5 1.56
## 3 UCSF 2 2-121 20-Apr-~ male 40.1 NA NA 5 73.5 1.65
## 4 UCLA 1 1-1 4-Sep-10 male 41.4 NA 2.13 3 86.4 1.45
## 5 UCSF 2 2-121 22-Aug-~ male 41.5 2.88 6.80 5 60.2 1.50
## 6 UCSF 2 2-163 19-Apr-~ male 42.5 1.63 0.749 4 54.0 1.72
## 7 UCLA 1 1-11 19-Aug-~ male 43.7 10.5 5.89 7 61.3 1.49
## 8 UCLA 1 1-1 25-Jun-~ male 46.8 3.89 1.35 3 53.4 1.57
## 9 UCLA 1 1-21 22-Apr-~ male 48.0 3.39 NA 8 79.6 1.55
## 10 UCSF 2 2-178 12-Mar-~ male 48.9 4.15 3.04 7 71.8 1.47
## # ... with 110 more rows, and 13 more variables: wound <dbl>, mobility <dbl>,
## # ntumors <dbl>, remission <dbl>, lungcapacity <dbl>, married <dbl>,
## # familyhx <chr>, smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>,
## # wbc <chr>, rbc <dbl>, bmi <dbl>
mutate()
*The function mutate()
allows us to transform many variables in one step without having to respecify the data frame name over and over.
Useful R
functions for transforming:
log()
: logarithmmin_rank()
: rank valuescut()
: cut a continuous variable into intervals with new integer value signifying into which interval original value fallsscale()
: standardizes variable (substracts mean and divides by standard deviation)lag()
, lead()
: lag and lead a variable (from dplyr
)cumsum()
: cumulative sumrowMeans()
, rowSums()
: means and sums of several columnsrecode()
: recode values (from dplyr
)# create age category variable, and highpain binary variable
d <- mutate(d,
agecat = cut(age, breaks=c(30,40,50,60,70,120)),
highpain = pain > mean(pain))
table(d$agecat, d$highpain)
##
## FALSE TRUE
## (30,40] 2 2
## (40,50] 27 15
## (50,60] 37 32
## (60,70] 2 2
## (70,120] 0 0
For many datasets we want to calculate statistics by group, rather than over the whole dataset.
The dplyr
functions group_by()
and summarise()
together greatly simplify group operations.
First, we create a grouped data frame (tibble) with group_by()
. We will group our patient dataset by doctor id.
# group_by creates a grouped_df (grouped data frame) class structure
by_doc <- group_by(d, docid)
class(by_doc)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Next we use summarise()
and statistical or numerical functions to summarize variables by group.
Example of useful functions for summarise()
:
min()
, max()
, mean()
, sum()
, var()
, sd()
n()
: number of observations in the groupn_distinct(x)
: number of distinct values in variable x
# Create summaries of patients by doctor
pat_summ <- summarise(by_doc,
n_pat = n(), # number of patients
longest_los = max(lengthofstay), # longest length of stay
avg_age = mean(age), # average age
n_stage_four = sum(cancerstage=="IV") # number of stage IV patients
)
pat_summ
## # A tibble: 22 x 5
## docid n_pat longest_los avg_age n_stage_four
## <chr> <int> <dbl> <dbl> <int>
## 1 1-1 7 6 52.5 0
## 2 1-100 9 7 NA 0
## 3 1-11 8 7 51.7 1
## 4 1-21 1 4 48.0 0
## 5 1-22 4 6 48.8 0
## 6 1-33 6 7 49.0 0
## 7 1-48 7 7 51.3 1
## 8 1-57 3 5 45.6 0
## 9 1-58 3 7 51.1 0
## 10 1-72 8 8 50.3 0
## # ... with 12 more rows
The dplyr
package provides summarize()
for those who prefer an American spelling to summarise()
. However, the package Hmisc
also contains a summarize()
function, and if loaded into the same session as dplyr()
, a conflict arises between the two summarize()
functions. The function from the package that was loaded first will be masked when the second package is loaded into the session (with an accompanying red warning message). The function loaded second will be the default function used when summarize()
is specified.
So, if Hmisc
is loaded first and dplyr
loaded second, specifying summarize(x)
will call summarize()
from dplyr
.
One can still access the function loaded from the first package in a couple of ways.
package::function()
, for example Hmisc::summarize(x)
detach(package:name)
, for example detach(package:dplyr)
%>%
A data management task may involve many steps to reach the final desired dataset. Often, during intermediate steps, datasets are generated that we don’t care about or plan on keeping. For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand.
Naming datasets takes time to think about and clutters code. Piping makes your code more readable by focusing on the functions used rather than the name of datasets.
The pipe operator is %>%
, , which can be typed with CTRL-SHIFT-M (CMD-SHIFT-M on a Mac). The pipe operator can be expressed as “then” when reading code. The operator is loaded with pacakage magrittr
(automatically loaded with library(tidyverse)
).
The pipe operator “pipes” the dataset on the left of the %>%
operator to the function on the right of the operator.
The code x %>% f(y)
translates to f(x,y)
, that is, x
is treated by default as the first argument of f()
. If the function returns a data frame, we can then pipe this data frame into another function. Thus x %>% f(y) %>% g(z)
translates to g(f(x,y), z)
.
As a first example, perhaps we want to create a dataset of just females under 40, with only the age and pain variables selected. We could do this in 2 steps, like so:
# a dataset of age and pain for females younger than 40
f40 <- filter(d, sex=="female" & age<40)
f40_small <- select(f40, age, pain)
f40_small
## # A tibble: 2 x 2
## age pain
## <dbl> <dbl>
## 1 37.3 8
## 2 39.6 5
While that works fine, the intermediate dataset f40 is not of interest and is cluttering up memory and the workspace unnecessarily.
We could use %>%
instead:
# start with d, then filter rows, then select variables
f40_small <- d %>%
filter(sex=="female" & age<40) %>%
select(age, pain)
f40_small
## # A tibble: 2 x 2
## age pain
## <dbl> <dbl>
## 1 37.3 8
## 2 39.6 5
No intermediate dataset was created, and dataset names are rarely needed in intermediate steps. The dataset transformations, filtering and selecting, are easy to see.
Piping really helps make a long chain of commands more immediately understandable.
# create a plot of average age vs tumor size
# by doctors, for doctors with more than 5 patients
g1 <- d %>%
group_by(docid) %>%
summarise(n_pat=n(),
avg_age=mean(age),
avg_tumor=mean(tumorsize)) %>%
filter(n_pat > 5) %>%
ggplot(aes(x=avg_age, y=avg_tumor)) + geom_point() +
geom_smooth()
g1
.
in piped commandsIf the dataset is not to be specified as the first argument of the function on the right of the %>%
operator, use .
to specify where the dataset should appear as the argument.
x %>% f(y,.)
translates to f(y,x)
.
For example, for modeling functions such as lm()
, the model formula is typically the first argument rather than the dataset, but we can use .
to stand in for the dataset in the piping chain:
d %>%
filter(age < 40) %>%
lm(tumorsize ~ age, data=.) # the . is the filtered dataset
##
## Call:
## lm(formula = tumorsize ~ age, data = .)
##
## Coefficients:
## (Intercept) age
## -78.219 4.116
Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.
rbind()
and bind_rows()
We can use either the base R
function rbind()
or the dplyr
function bind_rows()
to append. They differ in how they handle the situation where the datasets being appended have non-matching columns.
rbind()
will produce and errorbind_rows()
will append the datasets, and fill in with NA
values for missing values in unmatched columnsThey also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).
rbind()
will coerce one type to the other (direction of coercion is logical -> integer -> double -> character)bind_rows()
will produce an errorLet’s load a second dataset that contains the same variables as the original version of our first dataset but with 2 variables omitted.
# new data set that contains the same variables as d, except is missing 2 of them
d2 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt2_dm.csv")
# rows and columns of d2 and d
dim(d2)
## [1] 111 24
dim(d)
## [1] 120 26
Because our first data set d
has a few extra variables we created earlier with mutate()
, the two datasets do not have completely matching columns.
rbind()
on the datasets as they are will produce an error.bind_rows()
will produce NA
values for the observation on the second datasets on the variables that are exclusively in the first dataset.The best solution would be to calculate the new variables for the second dataset and then use either function to append…
However, to demonstrate the behavior of bind_rows()
, we append the datasets as they are.
Use the .id=
argument in bind_rows
to create a variable that identifies source dataset.
# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")
# these are the rows where the datasets meet
# hospital is found in both datasets, agecat and highpain are not
select(d3, source, hospital, agecat, highpain)[118:123,]
## # A tibble: 6 x 4
## source hospital agecat highpain
## <chr> <chr> <fct> <lgl>
## 1 1 UCSF (50,60] TRUE
## 2 1 UCSF (50,60] TRUE
## 3 1 UCSF (50,60] TRUE
## 4 2 Cedars-Sinai <NA> NA
## 5 2 Cedars-Sinai <NA> NA
## 6 2 Cedars-Sinai <NA> NA
# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:24], d2)
dplyr
joinsAppending adds more rows of observations, whereas merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).
The dplyr
“join” functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by=
argument to specify specific matching id variables.
These joins all return a table with all columns from x
and y
, but differ in how they deal with mismatched rows:
inner_join(x, y)
: returns all rows from x
where there is a matching value in y
(returns only matching rows).
left_join(x, y)
: returns all rows from x
, unmatched rows in x
will have NA in the columns from y
. Unmatched rows in y
not returned.
full_join(x, y)
: returns all rows from x
and from y
; unmatched rows in either will have NA in new columns
Each of these joins is possible with base R
function merge()
, but merge()
is many times slower than the joins.
inner_join()
We begin our demonstration of merging by loading a dataset of doctor variables, that describes the doctor’s years of experience, schooling, number of lawsuits, etc.
d_doc <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/doctor_dm.csv")
It is easiest to show how the joins differ if we work on smaller, more manageable datasets.
We will be merging a subset of the d_doc
dataset we just imported into a subset the d3
dataset we created earlier (the two patient datasets appended together). We will select two doctors from d3
, doctors “1-21” and “2-178”, and one matching doctor and one non-matching doctor from d_doc
, “2-178” and “3-407”.
# select one non-matching and one matching doctor from each to demo joins
# just a few variables from d3
d3.1 <- select(filter(d3, docid == "1-21" | docid == "2-178"),
docid, sex, age, test1, test2)
d_doc.1 <- filter(d_doc, docid == "3-407" | docid == "2-178")
Let’s take a look at our small datasets
# first a look at the two datasets to be merged
# 2 rows for docid 2-178 in d3.1
d3.1
## # A tibble: 3 x 5
## docid sex age test1 test2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA
## 2 2-178 male 34.2 4.63 3.26
## 3 2-178 male 48.9 4.15 3.04
d_doc.1
## # A tibble: 2 x 5
## docid experience school lawsuits medicaid
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 2-178 15 average 4 0.817
## 2 3-407 23 average 3 0.343
The merging id variable “docid” appears in both datasets, so we do not need to specify a merging variables.
Notice how inner_join()
, left_join()
and full_join()
differ in their behavior:
# only matching rows returned
# 2-178 from d_doc.1 matched twice to 2-178 in d3.1
inner_join(d3.1, d_doc.1)
## # A tibble: 2 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 2 2-178 male 48.9 4.15 3.04 15 average 4 0.817
# all rows from d3.1 returned
left_join(d3.1, d_doc.1)
## # A tibble: 3 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA NA <NA> NA NA
## 2 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 3 2-178 male 48.9 4.15 3.04 15 average 4 0.817
# all rows from both returned
full_join(d3.1, d_doc.1)
## # A tibble: 4 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA NA <NA> NA NA
## 2 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 3 2-178 male 48.9 4.15 3.04 15 average 4 0.817
## 4 3-407 <NA> NA NA NA 23 average 3 0.343
We now explore the tools in the tidyverse
package tidyr
(loaded with library(tidyverse)
), which make datasets “tidy”.
Tidy datasets adhere to a simple set of structural rules:
Tidy data conventions provide a standard organization that prevents having to “rethink the wheel” each time data need to be organized.
Doing a little more work upfront to make data “tidy” will pay off with time saved later on.
Each row of the dataset should represent one unit of analysis. Units may be subjects, or trials within subjects, or whole groups of subjects.
Some analyses may involve both variables describing individual units and aggregated units. For example, a student-level analysis may involve school-level variables. Longitudinal studies may involve within-subject and between-subject variables. For multilevel analyses, the unit of analysis is generally the lowest level.
# data at multiple levels - patients nested in doctors
d %>%
inner_join(pat_summ) %>%
select(sex, age, lengthofstay, docid, avg_age, longest_los) %>%
head(n=3)
## # A tibble: 3 x 6
## sex age lengthofstay docid avg_age longest_los
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 male 65.0 6 1-1 52.5 6
## 2 female 53.9 6 1-1 52.5 6
## 3 male 41.4 5 1-1 52.5 6
The use of data not collected by the analyst makes encountering untidy data more common than desired.
Data can be messy in many ways, but we focus on two ways that can be tidied with tidyr
tools:
Let’s take a look at some untidy datsets and see how the tidyr
functions pivot_longer()
and pivot_wider()
will help us to tidy them.
We first load a dataset of the number of graduates from 3 academic departments across 3 years. The planned analysis will examine whether graduation has grown over the years.
The unit of analysis here is the graduation rate in a particular department in a particular year. Each row should thus represents one year in one department.
dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 x 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
Year is a predictor in our analysis, so should be a column variable. Each department has 3 years of data so each requires 3 rows of data. Additionally, the entire table of values measures the same thing, graduation numbers in one department in one year, so we can put all values in one column.
pivot_longer()
to create a variable out of column headings and restructure the datasetTo use pivot_longer()
, we select a set of column variable for reshaping:
This process is also known as “reshaping long”.
Arguments to pivot_longer()
:
names_to=
: name of the new column variable that will hold the stacked selected column headingsvalues_to=
: the name of the new column variable that will hold the stacked values of the selected columnsBy default, pivot_longer
will select all columns for stacking. However, columns that do not vary with values in the new names_to=
column should not be stacked.
Here again is the data set we want to reshape:
dept
## # A tibble: 3 x 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
In the newly reshaped dataset, we want to put years and graudation rates into 2 columns. Since years are the column headings, we will name the names_to=
variable “year”. The graudation rate values in the columns 2015
, 2016
, and 2017
will be stacked into another new variable whose name we specify in values_to=
as “grad”.
Department name (id
) does not vary with year, so we do not need to stack it. We can specify which columns are to be reshaped by specifying only the columns that need to be collapsed as c(`2015`, `2016`, `2017`)
* or by specifying the columns not to be collapsed after the symbol -
, as in -id
.
*Note that the ` are needed to specify the column names here because they don’t start with a letter.
# the new column "year" uses the column headings as values,
# the new column "graduates" will be the collapsed values
# we do not want to collapse id
dept_by_year <- dept %>%
pivot_longer(names_to="year", values_to="grad", -id)
pivot_longer(names_to="year", values_to="grad", -id)
names_to="year"
values_to="grad"
-id
Columns should contain values that represent one variable, but we often encounter datasets where multiple variables are stored in the same column.
Let’s take a look at a dataset of worms, who have had their age, length, and weight measured, but all stored in one column.
worms <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/worms.csv")
worms
## # A tibble: 9 x 3
## worm feature measure
## <dbl> <chr> <dbl>
## 1 1 age 5
## 2 1 length 3.2
## 3 1 weight 4.1
## 4 2 age 4
## 5 2 length 2.6
## 6 2 weight 3.5
## 7 3 age 5
## 8 3 length 3.6
## 9 3 weight 5.5
We are interested in whether age predicts length and weight, which would make the unit of analysis the worm. So each row should be one worm.
We would prefer that the feature values be spread across three columns.
pivot_wider()
to spread a single column into multiple columnsThis problem reverses the previous problem solved by pivot_longer()
where the column headings should actually be values of a variable. Here, we want “age”, “length”, and “weight” to be the column headings. In this reverse procedure, the names_from=
column variable is converted to column headings, and the value_from=
column is subsequently spread across the new columns.
This process is sometimes known as “reshaping wide”.
Here, we want to use the values in the feature
variable as the new column headings, so we specify it in names_from=
. We want the values in the measure
column to fill the rows of data in the newly formed columns, so we specify it in values_from=
.
by_worm <- worms %>%
pivot_wider(names_from=feature, values_from=measure)
by_worm
## # A tibble: 3 x 4
## worm age length weight
## <dbl> <dbl> <dbl> <dbl>
## 1 1 5 3.2 4.1
## 2 2 4 2.6 3.5
## 3 3 5 3.6 5.5
pivot_wider(names_from=feature, values_from=measure)
Two other issues that make datasets untidy are columns where multiple variables are stored in the same cell, or where one variable is stored across multiple cells.
The table5
data that is loaded with the tidyr
package exhibits both of these problems. The rate variable is actually two variables, “cases”, which is divided by “population”. The year variable is split across two columns, “century” and “year”.
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
separate()
and unite()
to split and concatenate columns, respectivelyseparate()
and unite()
are complements of each other.
For separate()
:
col=
is the name of the column to splitinto=
are the names of the new character variable columnssep=
separator character between columnsremove=
remove original column from data frame; TRUE,For unite()
:
col=
name of new, concatenated columnsep=
separator, "_" by defaulttable5 %>%
separate(col=rate, into=c("cases", "population"), sep="/") %>%
unite(col="year", century, year, sep="")
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
tidyverse
cheatsheetsIn the Help menu for RStudio, there is a submenu for “Cheatsheets”, which are documents that summarize the usage and syntax of the various tools in tidyverse
. You’ll find sheets for dplyr
and tidyr
, and a separate cheatsheet for ggplot2
.
We will begin with dataset that contains test scores for two classes that each contain one female and one male student from each of three socioeconomic (ses) statuses, low, middle and high (12 students total). We are interested in whether scores are related to ses, but only comparing the mean test scores of female students of low and middle ses. The data are untidy.
From this dataset, we want to create a new dataset that
Then we will create a dataset that contains the mean scores of the 2 students in each ses category.
We will try to make as much use of %>%
as possible.
1.1 Load the dataset. How are the data untidy?
ex1 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/exercise_1.csv")
ex1
## # A tibble: 2 x 7
## class f_low f_mid f_high m_low m_mid m_high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 52 48 52 50 58 56
## 2 2 56 47 42 50 46 49
1.2 How do we make the column headings a variable?
pivot_longer()
to create a new variable from the column headings, and a new variable from the stacked data.We will put the column headings into a variable called group
and the test scores in a variable called score
. Remember that because these are new variable names, they must go inside quotes.
ex1 %>%
pivot_longer(names_to="group", values_to="score", -class)
## # A tibble: 12 x 3
## class group score
## <dbl> <chr> <dbl>
## 1 1 f_low 52
## 2 1 f_mid 48
## 3 1 f_high 52
## 4 1 m_low 50
## 5 1 m_mid 58
## 6 1 m_high 56
## 7 2 f_low 56
## 8 2 f_mid 47
## 9 2 f_high 42
## 10 2 m_low 50
## 11 2 m_mid 46
## 12 2 m_high 49
1.3 Now we see that there might be two variables in group
. How do we fix that?
separate()
to split a string into columns.ex1 %>%
pivot_longer(names_to="group", values_to="score", -class) %>%
separate(group, into=c("gender", "ses"), sep="_")
## # A tibble: 12 x 4
## class gender ses score
## <dbl> <chr> <chr> <dbl>
## 1 1 f low 52
## 2 1 f mid 48
## 3 1 f high 52
## 4 1 m low 50
## 5 1 m mid 58
## 6 1 m high 56
## 7 2 f low 56
## 8 2 f mid 47
## 9 2 f high 42
## 10 2 m low 50
## 11 2 m mid 46
## 12 2 m high 49
1.4 Doing great so far! Now, how can we select only female students from either the “low” or “mid” ses?
filter()
to select rows by condition.ex1 %>%
pivot_longer(names_to="group", values_to="score", -class) %>%
separate(group, into=c("gender", "ses"), sep="_") %>%
filter(gender=="f", ses=="low"|ses=="mid")
## # A tibble: 4 x 4
## class gender ses score
## <dbl> <chr> <chr> <dbl>
## 1 1 f low 52
## 2 1 f mid 48
## 3 2 f low 56
## 4 2 f mid 47
1.5 Ok, finally we need to get the means of score
by ses
. How can we get the means by ses
?
group_by()
to create a dataset groupsed by ses, and then summarise()
to create summary statistics by sesex1 %>%
pivot_longer(names_to="group", values_to="score", -class) %>%
separate(group, into=c("gender", "ses"), sep="_") %>%
filter(gender=="f", ses=="low"|ses=="mid") %>%
group_by(ses) %>%
summarise(meanscore=mean(score))
## # A tibble: 2 x 2
## ses meanscore
## <chr> <dbl>
## 1 low 54
## 2 mid 47.5
Before dates in a dataset are loaded into R
, typically they are stored as a column of character (string) values. However, dates are inherently numeric and we lose this information when they are stored as strings.
# string dates have no numeric value, so this errors
"2018-03-05" + 1
## Error in "2018-03-05" + 1: non-numeric argument to binary operator
In R
, we want to convert our string dates to R
class Date
, which preserves the dates’ numeric values and allows us to take advantage of the many date-related functions in R
.
Once converted to class Date
, the numeric value for the date represents the number of days since January 1, 1970 (1970-01-01).
as.Date()
Base R
provides as.Date()
to convert strings to dates, but it can be intimidating and unwieldy to use, particularly if your dates are not stored in one of 2 default formats (see ?strptime
for more about date formats).
# as.Date only accepts a couple of formats by default
# good
as.Date("2015-02-14")
## [1] "2015-02-14"
# bad
as.Date("02/14/2014")
## Error in charToDate(x): character string is not in a standard unambiguous format
# specify a format to fix
as.Date("02/14/2014", format="%m/%d/%Y")
## [1] "2014-02-14"
Once our dates are class Date
, we can perform date arithmetic.
a <- as.Date("1971-01-01")
class(a)
## [1] "Date"
# days since 1970-01-01
as.numeric(a)
## [1] 365
# date arithmetic
a - as.Date("1970/12/31")
## Time difference of 1 days
a + 2
## [1] "1971-01-03"
We just need something easier to use to convert strings of various formats to class Date
!
lubridate
tidyverse
provides the amusingly named package lubridate
to help R
users convert their string dates to R Date
format more easily as well as functions to process those dates.
The Date
conversion functions in lubridate
accept a wide variety of date formats, removing the need for us to remember all of those format specifications. Instead, just take the letters y, m, and d and place them in the order of the year, month and day, respectively, as they are stored in the date column. That ordering produces the name of the function to convert that column to Date
(e.g. ymd()
, mdy()
, dmy()
).
lubridate
is not automatically loaded with library(tidyverse)
, though is installed with tidyverse
, so we load it now.
library(lubridate)
lubridate
We’ll first load a dataset with various date formats to demonstrate the flexibility of lubridate
functions.
d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dates.csv")
d
## # A tibble: 3 x 5
## fmt1 fmt2 fmt3 fmt4 decision_time
## <chr> <chr> <date> <dbl> <chr>
## 1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
## 2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23
## 3 03/15/84 March 3, 2010 2017-12-25 20071011 Approved 04-24-15 02:03:03
Currently, the first 4 columns are stored as chr
, chr
, date
, and int
. The third column is actually read in as Date
by read_csv()
because of its particular formatting signals a date to read_csv()
. (We’ll look at the fifth column later)
The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.
So, we want mdy()
and ymd()
.
Marvel at the flexibility!
# no format specifications needed
# just ordering y,m, and d
dates <- data.frame(f1=mdy(d$fmt1), f2=mdy(d$fmt2),
f3=ymd(d$fmt3), f4=ymd(d$fmt4))
dates
## f1 f2 f3 f4
## 1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
## 2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
## 3 1984-03-15 2010-03-03 2017-12-25 2007-10-11
If your date column additionally contains time information (i.e. is date-time), you can add one or more of h, m, and s to y, m, and d to form the function name to convert the string to class POSIXct
. Class POSIXct
stores date-time variables as a number representing the number of seconds since the beginning of 1970.
The fifth column of our dates
dataset, decision_time
, is a date-time variable. (It also contains an “Approved/Denied” string at the beginning that the lubridate()
function will ignore!). Specifically, the date-time is recorded as month, day, year, hour, minute, second, so we want mdy_hms()
:
# month day year hour minute second
mdy_hms(d$decision_time)
## [1] "2015-10-10 07:15:55 UTC" "2011-09-27 14:57:23 UTC"
## [3] "2015-04-24 02:03:03 UTC"
# POSIXct is a standard way of representing calendar time
class(mdy_hms(d$decision_time))
## [1] "POSIXct" "POSIXt"
The “time zone” standard UTC
(Coordinated Universal Time) is used as the default zone. Use the argument, tz=
to assign a zone. To see a list of valid time zone specifications, run OlsonNames()
.
# we'll use this for our dates variable
dates$decision_time <- mdy_hms(d$decision_time, tz="US/Pacific")
dates$decision_time
## [1] "2015-10-10 07:15:55 PDT" "2011-09-27 14:57:23 PDT"
## [3] "2015-04-24 02:03:03 PDT"
# first 20 valid time zones
head(OlsonNames(), n=20)
## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
## [7] "Africa/Bamako" "Africa/Bangui" "Africa/Banjul"
## [10] "Africa/Bissau" "Africa/Blantyre" "Africa/Brazzaville"
## [13] "Africa/Bujumbura" "Africa/Cairo" "Africa/Casablanca"
## [16] "Africa/Ceuta" "Africa/Conakry" "Africa/Dakar"
## [19] "Africa/Dar_es_Salaam" "Africa/Djibouti"
Date
variableslubridate
provides several functions to extract specific information from Date
variables including:
day()
: day of the monthwday()
: weekdayyday()
: day of the yearmonth()
: month of the yearyear()
: yearSome examples of extracting information from Date
variables.
# we'll use the first column of our dates dataset
dates$f1
## [1] "1989-01-15" "1992-02-13" "1984-03-15"
# day of the month
day(dates$f1)
## [1] 15 13 15
# day of the year
yday(dates$f1)
## [1] 15 44 75
Some more date information extraction:
# weekday as numbers
wday(dates$f1)
## [1] 1 5 5
# weekday with labels
wday(dates$f1, label=TRUE)
## [1] Sun Thu Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
# month of the year
month(dates$f1)
## [1] 1 2 3
POSIXct
variableslubridate
also has functions to extract time information from POSIXct
date-time variables.
hour()
minute()
second()
#break up the time variable decision time
#display as a data.frame with 4 columns
with(dates, ## with() tells R to look for variables in object "dates"
data.frame(time=decision_time, h=hour(decision_time),
m=minute(decision_time), s=second(decision_time)))
## time h m s
## 1 2015-10-10 07:15:55 7 15 55
## 2 2011-09-27 14:57:23 14 57 23
## 3 2015-04-24 02:03:03 2 3 3
If you need to add or subtract time to your date variables, lubridate
provides 2 sets of functions.
One set gives “intuitive” results, and will ignore conventions like leap year. These include seconds()
, minutes()
, hours()
, days()
, weeks()
, years()
.
The other set will adhere to those conventions, and are named by adding d
to the names of the previous functions, dseconds()
, dminutes()
, dhours()
, ddays()
, dweeks()
, dyears()
.
#2016 is a leap year
# the intuitive result
ymd("2015-02-14") + years(2)
## [1] "2017-02-14"
# the exact result
ymd("2015-02-14") + dyears(2)
## [1] "2017-02-13 12:00:00 UTC"
Strings are character values. Columns in external datasets that contain any non-number value will generally be read in as strings. You can generate string variables manually with either double quotes, ""
, or single quotes, ''
.
Base R
provides a number of functions to process string variables. We will cover functions that:
stringr
The tidyverse
package stringr
provides many functions that have nearly the same functionality as base R
string functions.
Like the rest of tidyverse
, the stringr
functions have consistent names (they all begin with str_
) and consistent syntax, unlike the base R
functions. However, because the base R
string functions are so widely used we will cover them primarily and point out tidyverse
alternatives.
The package stringr
is automatically loaded with library(tidyverse)
.
We will use a dataset of locations of partially fictitious statistical consulting centers.
centers <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/stats_centers.csv")
centers
## # A tibble: 4 x 6
## NAME ROOM BUILDING UNIVERSITY CITY STATE
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE UCLA LOS ANGELES CA
## 2 OIT CONSULTING Y521 HOUSTON HALL LONE STAR UNIV. DALLAS TX
## 3 ATS STAT HELP M234 ORANGE BUILDING SUNSHINE COLLEGE MIAMI FL
## 4 OAC STAT CENTER B2431 SNOW HALL GRIZZLY UNIVERSITY JUNEAU AK
At least 3 functions loaded in the current session (base R
and tidyverse
) can concatenate strings together. We highlight what makes each unique.
paste()
from base R
str_c()
from stringr
(tidyverse
)unite()
from dplyr
(tidyverse
)paste()
to concatenate stringWhen used on vectors, paste()
concatenates element-wise. By default, joined elements are separated by a space.
# element-by-element concatenation
paste(centers$ROOM, centers$BUILDING)
## [1] "X4919 MATH SCIENCE" "Y521 HOUSTON HALL" "M234 ORANGE BUILDING"
## [4] "B2431 SNOW HALL"
You can change the separating character with sep=
.
# separate with comma and space
paste(centers$CITY, centers$STATE, sep=", ")
## [1] "LOS ANGELES, CA" "DALLAS, TX" "MIAMI, FL" "JUNEAU, AK"
Vectors of numbers used in paste()
are converted to character before pasting.
# separate with comma and space
paste(1:4, centers$UNIVERSITY)
## [1] "1 UCLA" "2 LONE STAR UNIV." "3 SUNSHINE COLLEGE"
## [4] "4 GRIZZLY UNIVERSITY"
Interestingly, paste()
will coerce the missing value NA
to the string "NA"
. This usually is not desirable.
# separate with comma and space
paste(centers$CITY, centers$STATE, c("USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, NA" "MIAMI, FL, USA"
## [4] "JUNEAU, AK, NA"
If you need to paste the elements within a vector together into a single string, specify a separating character with collapse=
instead of sep=
.
# put everything into a single string with collapse
paste(1:4, centers$UNIVERSITY, collapse="; ")
## [1] "1 UCLA; 2 LONE STAR UNIV.; 3 SUNSHINE COLLEGE; 4 GRIZZLY UNIVERSITY"
str_c()
from stringr
to concatenate stringsstr_c()
from the package stringr
is quite similar to paste()
. It has both sep=
and collapse=
arguments that function that same as in paste()
. It differs in a couple of ways.
First, its default separator is the empty character "":
# default is no separating character
str_c(centers$CITY, centers$STATE)
## [1] "LOS ANGELESCA" "DALLASTX" "MIAMIFL" "JUNEAUAK"
Second, concatenating a NA
into a string will result in NA
:
# separate with comma and space
str_c(centers$CITY, centers$STATE, c("USA", "USA", "USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA" "MIAMI, FL, USA"
## [4] NA
unite()
from dplyr
to concatenate stringsWe previously encountered unite()
when tidying data, and its important distinction from paste()
and str_c()
is that it is strictly designed to concatenate columns stored in the same dataset. It will not accept a vector from outside the dataset. Indeed, the dataset is the first argument to unite()
. It is best used to add a column of concatenated existing variables to a dataset (and remove the original columns).
# concatenate CITY and STATE as citystate, then remove them
centers <- unite(centers, col="citystate", CITY, STATE, sep=", ")
centers
## # A tibble: 4 x 5
## NAME ROOM BUILDING UNIVERSITY citystate
## <chr> <chr> <chr> <chr> <chr>
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE UCLA LOS ANGELES, CA
## 2 OIT CONSULTING Y521 HOUSTON HALL LONE STAR UNIV. DALLAS, TX
## 3 ATS STAT HELP M234 ORANGE BUILDING SUNSHINE COLLEGE MIAMI, FL
## 4 OAC STAT CENTER B2431 SNOW HALL GRIZZLY UNIVERSITY JUNEAU, AK
To extract characters, or a “substring”, from a string by position, we can use either of two similar functions:
substr()
from base R
str_sub()
from stringr
For both functions, we specify a string variable, a start position, and an end position.
# start at first, end at third character
substr(centers$BUILDING, 1, 3)
## [1] "MAT" "HOU" "ORA" "SNO"
# specifying an end longer than the entirety of the string
# results in truncation
str_sub(centers$UNIVERSITY, 2, 6)
## [1] "CLA" "ONE S" "UNSHI" "RIZZL"
One difference between substr()
and str_sub()
is that str_sub()
accepts negative numbers, which count from the end of the string.
# start from third to last, end at last
str_sub(centers$UNIVERSITY, -3, -1)
## [1] "CLA" "IV." "EGE" "ITY"
Using the assignment operator <-
with substr()
and str_sub()
allows us to replace a substring in string variables by position.
# change the first character of the ROOM variable to "R"
substr(centers$ROOM, 1, 1) <- "R"
centers$ROOM
## [1] "R4919" "R521" "R234" "R2431"
# change the last 2 characters of citystate to US
str_sub(centers$citystate, -2, -1) <- "US"
centers$citystate
## [1] "LOS ANGELES, US" "DALLAS, US" "MIAMI, US" "JUNEAU, US"
Use base R
functions tolower()
and toupper()
to change the case of string variables.
# lower case city names
tolower(centers$citystate)
## [1] "los angeles, us" "dallas, us" "miami, us" "juneau, us"
# change variable names to lower case
names(centers) <- tolower(names(centers))
names(centers)
## [1] "name" "room" "building" "university" "citystate"
The stringr
functions str_to_lower()
and str_to_upper()
work very similarly.
grep()
Often, we need to scan through a vector of strings and find values that match a text pattern (e.g. finding all phone numbers beginning with area code “310”)
Base R
and stringr
provide several functions for powerful and flexible pattern matching.
The primary base R
function for pattern matching is grep()
. Several other related functions also match patterns, but return slightly different output (e.g. grepl()
and regexpr()
).
The basic syntax is grep(pattern, x)
, where pattern
is the text pattern to match in character vector x
.
By default, grep()
returns the index (row) number of the matches, but we can get the strings themselves returned with value=TRUE
.
# stat center names
centers$name
## [1] "IDRE STAT CONSULT" "OIT CONSULTING" "ATS STAT HELP"
## [4] "OAC STAT CENTER"
# row numbers of names with "STAT"
grep(pattern="STAT", centers$name)
## [1] 1 3 4
# we can use those indices to subset
centers$name[grep(pattern="STAT", centers$name)]
## [1] "IDRE STAT CONSULT" "ATS STAT HELP" "OAC STAT CENTER"
# or just use value=TRUE to get the words directly
grep(pattern="STAT", centers$name, value=TRUE)
## [1] "IDRE STAT CONSULT" "ATS STAT HELP" "OAC STAT CENTER"
sub()
We can also substitute based on pattern matches with base R
function sub()
. Now the syntax is sub(pattern, replacement, x)
, where pattern
is the pattern to be matched, replacement
is the replacement text, and x
is the string vector.
# change STAT to STATISTICAL
centers$name <- sub("STAT", "STATISTICAL", centers$name)
centers$name
## [1] "IDRE STATISTICAL CONSULT" "OIT CONSULTING"
## [3] "ATS STATISTICAL HELP" "OAC STATISTICAL CENTER"
Regular expressions are a “language” in which a sequence of character codes forms a search pattern for string matching and substitution. For example, the expression “[0-9]” matches any single number. The expression “[0-9]{2}-[0-9]{2}” matches any two numbers followed by a hyphen followed by any two numbers.
We encourage users who know they will be doing intense string matching to learn regular expressions syntax, which is beyond the scope of this seminar.
R
string functions support regular expression pattern matching. All pattern matching function mentioned so far, including grep()
and sub()
support regular expressions.
# vector of dates formatted in two different ways
four_dates <- c("09-12-2018", "2015-11-13",
"03-24-2016", "2017-05-12")
# the regular expression means,
# 2 numbers, hyphen, 2 numbers, hyphen, 4 numbers
grep("[0-9]{2}-[0-9]{2}-[0-9]{4}",
four_dates, value=TRUE)
## [1] "09-12-2018" "03-24-2016"
Substitution with regular expressions is immensely powerful – here we take the four-digit years that appear at the end of the two dates and move them to the beginning of each date string, and add a hyphen afterward.
# the replacement expression \\3-\\1 means:
# put the substring captured in the 3rd set of parentheses first,
# then a hyphen, then the substring captured in
# the first set of parenetheses
sub("([0-9]{2}-[0-9]{2})(-)([0-9]{4})", "\\3-\\1",
four_dates)
## [1] "2018-09-12" "2015-11-13" "2016-03-24" "2017-05-12"
Several stringr
functions also support regular expressions, but we do not discuss them here (see str_detect()
and str_replace()
as alternatives to grep()
and sub()
, respectively).
We will practice using date and string functions, as well as looping in this exercise.
We will practice on a dataset of admission, treatment, and death dates. We want to convert all of these variables to R class Date
and POSIXct
as appropriate.
We will want to combine Deathday
and TimeofDeath
into a single variable that codes for date-time of death, before converting to POSIXct
.
ex2 <- read_csv("http://stats.idre.ucla.edu/stat/data/rdm/exercise_2.csv")
ex2
## # A tibble: 3 x 5
## Admit Treat1 Treat2 DeathDay TimeofDeath
## <chr> <chr> <chr> <chr> <time>
## 1 May 24, 2013 03-15-2014 04-17-2114 January 2, 2015 11:15:50
## 2 August 23, 2012 07-05-2013 10-11-2113 July 19, 2016 18:59:15
## 3 December 1, 2013 07-25-2014 11-18-2114 May 25, 2017 02:04:16
2.1. First, let’s convert all of the column names to lower case so that they’re easier to work with.
What function do we want to use with names()
to do this?
# how to change these to lower case?
names(ex2)
## [1] "Admit" "Treat1" "Treat2" "DeathDay" "TimeofDeath"
tolower()
:# how to change these to lower case
names(ex2) <- tolower(names(ex2))
names(ex2)
## [1] "admit" "treat1" "treat2" "deathday" "timeofdeath"
2.2. Currently, all the date variables are character. We want them converted to Date
(or POSIXct
for the time variable). First, there are some data entry issues. We see that the year in “Treat2” that years are in the 2100s rather than the 2000s, so first let’s fix that.
# fix these
ex2$treat2
## [1] "04-17-2114" "10-11-2113" "11-18-2114"
How can we use sub()
to match a pattern and fix the error?
211
and substitute with replacement 201
:# don't match to "21", you will match dates where day is the 21st
# (e.g. 07-21-2113)
ex2$treat2 <- sub("211", "201", ex2$treat2)
ex2$treat2
## [1] "04-17-2014" "10-11-2013" "11-18-2014"
2.3. Now let’s concatenate deathday and timeofdeath together so we can create a POSIXct
date-time variables, but let’s keep the old variables.
# concatenate these together
select(ex2, deathday, timeofdeath)
## # A tibble: 3 x 2
## deathday timeofdeath
## <chr> <time>
## 1 January 2, 2015 11:15:50
## 2 July 19, 2016 18:59:15
## 3 May 25, 2017 02:04:16
How can we do that?
paste()
or str_c()
:# concatenate these together
ex2$deathtime <- paste(ex2$deathday, ex2$timeofdeath)
ex2$deathtime
## [1] "January 2, 2015 11:15:50" "July 19, 2016 18:59:15"
## [3] "May 25, 2017 02:04:16"
2.4. Now let’s change “admit”, “treat1”, “treat2”, and “deathday” to Date
and deathtime to POSIXct
.
select(ex2, admit, treat1, treat2, deathday, deathtime)
## # A tibble: 3 x 5
## admit treat1 treat2 deathday deathtime
## <chr> <chr> <chr> <chr> <chr>
## 1 May 24, 2013 03-15-2014 04-17-2014 January 2, 2015 January 2, 2015 11:15:~
## 2 August 23, 2012 07-05-2013 10-11-2013 July 19, 2016 July 19, 2016 18:59:15
## 3 December 1, 2013 07-25-2014 11-18-2014 May 25, 2017 May 25, 2017 02:04:16
What function(s) do we want to conver to Date
? And what to convert to POSIXct
?
mdy()
to convert admit, treat1, treat2, and deathday to Date
. We want mdy_hms()
to convert deathtime to POSIXct
:ex2 <- ex2 %>%
mutate(admit=mdy(admit),
treat1=mdy(treat1),
treat2=mdy(treat2),
deathday=mdy(deathday),
deathtime=mdy_hms(deathtime))
# now everything has the correct class
str(ex2)
## tibble [3 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ admit : Date[1:3], format: "2013-05-24" "2012-08-23" ...
## $ treat1 : Date[1:3], format: "2014-03-15" "2013-07-05" ...
## $ treat2 : Date[1:3], format: "2014-04-17" "2013-10-11" ...
## $ deathday : Date[1:3], format: "2015-01-02" "2016-07-19" ...
## $ timeofdeath: 'hms' num [1:3] 11:15:50 18:59:15 02:04:16
## ..- attr(*, "units")= chr "secs"
## $ deathtime : POSIXct[1:3], format: "2015-01-02 11:15:50" "2016-07-19 18:59:15" ...
Data management often involves performing the same task over and over, whether processing a set of variables in the same way, or appending many datasets together. With repetitive tasks, we can easily copy-and-paste code with slight modifications for each specific task. However, copying and pasting code has at least 2 drawbacks:
In this section we introduce two tools that can eliminate the need to copy-and-paste code for repetitive tasks:
For this section, we will load a dataset of 10 students, with variables describing the student’s gender and scores on 4 academic tests.
## gender and 4 test scores
hsb <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/hsb_small.csv")
hsb
## # A tibble: 10 x 5
## female read write math science
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 male 63 57 55 58
## 2 female 50 49 56 47
## 3 female 50 52 45 58
## 4 male 34 49 39 42
## 5 male 50 59 42 53
## 6 female 76 63 60 67
## 7 female 52 67 57 63
## 8 female 71 65 72 66
## 9 male 60 65 58 61
## 10 male 50 40 39 49
Imagine your task is to calculate a “trimmed mean” for each of the test scores. For this trimmed mean, we want to take the mean of a column after removing the minimum and maximum score.
We could write code to calculate the trimmed mean for one column, say “read”, and then copy-and-paste that code for the other 3 variables, changing “read” to the new variable name.
# first do it for read
hsb$read
## [1] 63 50 50 34 50 76 52 71 60 50
# put the min and max in a vector
min_max_read <- c(min(hsb$read), max(hsb$read))
# we don't want the min and max (the ! means not)
!(hsb$read %in% min_max_read)
## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
# take all but the min and mix
trim_read <- hsb$read[!(hsb$read %in% min_max_read)]
trim_read
## [1] 63 50 50 50 52 71 60 50
# mean of trim_read
trmean_read <- mean(trim_read)
trmean_read
## [1] 55.75
We could then copy-and-paste that code three more times for each of the other variables:
# repeat for write
min_max_write <- c(min(hsb$write), max(hsb$write))
trim_write <- hsb$write[!(hsb$write %in% min_max_write)]
trmean_write <- mean(trim_write)
# then repeat for math and science...
# 6 more lines of code
While that would work fine, it is inefficient coding. First, there are numerous places for coding errors. Second, imagine we wanted to change how the trimmed means was calculated – if we wanted to trim only the minimum score before trimming the mean, we have to be careful to change the code in all the correct places.
Functions allow us to code tasks in a general way, allowing the inputs to vary. Whenever you have written the same code multiple times, consider using a function instead. We can write a function that calculates the trimmed mean for any input.
The two main decisions for writing a function are:
For our trimmed mean function the input is a numeric vector, and the output is the trimmed mean, a single number (numeric vector of length one).
Function definitions consist of the following elements:
{}
The first line of a function definition consists of its name, <-
, the keyword function
, and the arguments. For example:
my_fun <- function(arg1, arg2=0)
In that syntax, my_fun
is the name of the function, arg1
is the first argument name, arg2
is the second argument name, and its default value is set to zero. The keyword function
tells R
that my_fun
is a function (often called type closure
in R
errors/messages/warnings).
The remaining syntax is the body of code, encased in {}
. The opening {
should be on the same line as the function header.
The value returned by the function is the last line in the body (the object output_object
would be returned by the function below). You can also use return()
to return the object you want.
# not run, just a template
# my_fun is the name
# 2 arguments, arg1 has no default, arg2 has default value zero
my_fun <- function(arg1, arg2=0) {
do something to arg1 and arg2
output_object # or return(output_object)
}
Let’s create our function to compute a trimmed mean. We only need one argument, the numeric vector:
# function_name <- function(argument)
trim_mean <- function(x) {
min_max <- c(min(x), max(x)) # x is the input numeric vector
x <- x[! x %in% min_max]
mean(x) ## last line, so returned as output
}
# now use it!
c(trim_mean(hsb$read), trim_mean(hsb$write),
trim_mean(hsb$math), trim_mean(hsb$science))
## [1] 55.75000 57.37500 53.28571 56.87500
Now if we want to change how we calculate the trimmed mean, we can just incorporate those changes in the function.
Let’s change the function to just trim off the minimum before calculating the mean.
# you only have to change the function once to reestimate 4 new trimmed means
trim_mean_min <- function(x) {
x <- x[x != min(x)] # now just trimming the min
mean(x)
}
c(trim_mean_min(hsb$read), trim_mean_min(hsb$write),
trim_mean_min(hsb$math), trim_mean_min(hsb$science))
## [1] 58.00000 58.44444 55.62500 58.00000
Even better, we can add a second argument that allows you to choose whether to trim just the minimum, or both the min and max. Here the second argument is called trim=
, and it is set by default to “both”, which causes the function to trim both the min and max. If the user passes “min” to trim=
instead, it will trim just the min.
# be default, we will trim both the min and max
trim_mean_both <- function(x, trim="both") {
min_max <- c(min(x), max(x))
if (trim == "min") { # if trim="min", trim just the min
min_max <- min_max[1] # subset min_max to its first element, the min
}
x <- x[! x %in% min_max]
mean(x)
}
#defaults to trimming both
trim_mean_both(hsb$read)
## [1] 55.75
# just trim the min
trim_mean_both(hsb$read, trim="min")
## [1] 58
for
loopsLooping structures provide another tool to code repetitive tasks efficiently. Most (probably all) programming languages support looping structures, including R
. Loops in R
are useful for iterating across the columns of a dataset or even across different datasets.
The ubiquitous for
loop is a popular choice to iterate over a repetitive task because it allows the programmer to control how many times the loop iterates. The for
loop executes the same block of code for each value in a vector. R
also has while
loops that terminate when a condition is FALSE
, but we do not discuss these in this seminar.
The basic syntax of a for
loop is:
for
, and then encased in ()
, a loop control variable and a vector of values to loop over{}
For example, this loop prints “Hello world!” 5 times, once for each value in the vector 1:5:
# keyword for, then loop control variable, i, and vector 1:5 inside ()
for (i in 1:5) {
print("Hello world!")
}
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"
The process of the loop above can be described:
i
, takes on the value 1i
is in the vector (1,2,3,4,5). If so, the code block is executed. If not, the loop ends.i
increments by 1.i
increments to 6.We often want to use the value of the loop control variable (i
above) in the code block.
Loops will generally run faster if you pre-allocate space for any output generated by the loop. Growing the output object inside a loop will be slower.
# create a vector to hold values first
result <- rep(0, 5)
# i is used as a row number and in the calculation
for (i in 1:5) {
result[i] <- i*5
}
result
## [1] 5 10 15 20 25
Now we have the tools to loop over a set of variables in a data frame, because we can access the columns by number.
# we want the test score columns
names(hsb)
## [1] "female" "read" "write" "math" "science"
# loop over columns 2 through 4
for (col in 2:4) {
print(mean(hsb[[col]])) ## use [[]] when extracting by column number
}
## [1] 55.6
## [1] 56.6
## [1] 52.3
The vector of values over which the for
loop iterates does not need to be numeric. We can repeat the loop above using the string names of the columns rather than their position:
# we want the test score columns, so drop the first
mycols <- names(hsb)[-1]
mycols
## [1] "read" "write" "math" "science"
# loop over columns 2 through 4
for (col in mycols) {
print(mean(hsb[[col]])) ## use [[]] to extract a column vector by name
}
## [1] 55.6
## [1] 56.6
## [1] 52.3
## [1] 56.4
Here is a slightly more sophisticated example of what loops can do. In this loop, we add the trimmed means for each of the 4 test scores to the dataset.
# let's add trim means to the dataset
for (col in mycols) {
hsb$tm <- trim_mean(hsb[[col]]) # add new trim mean to dataset, call it "tm" temporarily
newname = paste("tm", col, sep="_") # create new variable name like tm_read
names(hsb)[ncol(hsb)] <- newname # replace last variable name, currently tm, with newname
}
# the result
hsb
## # A tibble: 10 x 9
## female read write math science tm_read tm_write tm_math tm_science
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 male 63 57 55 58 55.8 57.4 53.3 56.9
## 2 female 50 49 56 47 55.8 57.4 53.3 56.9
## 3 female 50 52 45 58 55.8 57.4 53.3 56.9
## 4 male 34 49 39 42 55.8 57.4 53.3 56.9
## 5 male 50 59 42 53 55.8 57.4 53.3 56.9
## 6 female 76 63 60 67 55.8 57.4 53.3 56.9
## 7 female 52 67 57 63 55.8 57.4 53.3 56.9
## 8 female 71 65 72 66 55.8 57.4 53.3 56.9
## 9 male 60 65 58 61 55.8 57.4 53.3 56.9
## 10 male 50 40 39 49 55.8 57.4 53.3 56.9
In this final exercise, we will use functions and loops to convert a set of continuous variables into categorical binary variables. The binary variables will take the value 0 if the original value is less than the mean of that variable and 1 if is greater. Each of the variables may have missing data code “-99”, which should be converted to missing value NA
before calculating the mean. We will be adding the binary variables to the data set, not replacing the continuous variables.
We will practice on a dataset in which subjects take 3 surveys (q1
, q2
, q3
) at 3 different time points .
## gender and 4 test scores
ex3 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/exercise_3.csv")
# notice the missing data codes
ex3
## # A tibble: 10 x 7
## id q1 adm1 q2 adm2 q3 adm3
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 1 4 AL 6 SD 5 CW
## 2 2 2 SJ 2 CW 4 JL
## 3 3 4 CW 5 JL 5 SJ
## 4 4 3 JL -99 AL -99 SD
## 5 5 -99 SD 1 SJ 2 AL
## 6 6 1 AL 4 SD 6 CW
## 7 7 7 SJ 1 CW -99 JL
## 8 8 5 CW 3 JL 1 SJ
## 9 9 2 JL 2 AL 1 SD
## 10 10 1 SD 5 SJ 4 AL
3.1 First, let’s write a function that accepts a vector and performs 3 operations on that vector:
NA
NA
should be left NA
)The function will return a vector of binary values (and NA
)
Before we write the function, let’s make sure we know how to do each of the individual operations. We will practice on the first q1
variable.
# make a copy of ex3$q1 to practice
x <- ex3$q1
x
## [1] 4 2 4 3 -99 1 7 5 2 1
# this returns a logical vector
x == -99
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# we can use that to logical vector to find the -99 and convert to NA
x[x == -99] <- NA
x
## [1] 4 2 4 3 NA 1 7 5 2 1
# use mean() with na.rm=TRUE to take mean, ignoring missing values
meanx <- mean(x, na.rm=TRUE)
meanx
## [1] 3.222222
# this expression will assign TRUE to x_bin if x >= meanx, FALSE otherwise, unless x==NA
x_bin <- (x > meanx)
x_bin
## [1] TRUE FALSE TRUE FALSE NA FALSE TRUE TRUE FALSE FALSE
# using as.numeric() on a TRUE/FALSE vector will convert it to 1/0
as.numeric(x_bin)
## [1] 1 0 1 0 NA 0 1 1 0 0
3.2 Now let’s write a function to perform these 3 operations on any vector.
What elements do we need in the function header (the first line)?
dichotmoize
<-
function
x
And inside what do we enclose the body of the function?
{}
# function header
dichotomize <- function(x) {
}
3.3 Now let’s fill in the body of the function with the operations. We can use the code we wrote before when practicing on vector x
:
# function body filled in
dichotomize <- function(x) {
x[x == -99] <- NA # convert -99 to NA
meanx <- mean(x, na.rm=TRUE) # take mean, ignoring NA
x_bin <- (x > meanx) # convert to TRUE/FALSE
as.numeric(x_bin) # return TRUE/FALSE converted to 1/0
}
# check if it works
dichotomize(ex3$q1)
## [1] 1 0 1 0 NA 0 1 1 0 0
3.4 Finally, we will write for
loop to use this function on the 3 variables in our dataset, adding the 3 new variables to the dataset as new columns.
What are the elements of a for
loop declaration?
for
()
col
in
c("q1", "q2", "q3")
The code block again is encased in {}
dichotomize <- function(x) {
x[x == -99] <- NA
meanx <- mean(x, na.rm=TRUE)
x_bin <- (x > meanx)
as.numeric(x_bin)
}
# for loop, col iterates over q1, q2, q3
for (col in c("q1", "q2", "q3")) {
newcolname <- paste(col, "_bin", sep="") # creates a new column name that appends "_bin" to old column name
ex3$newvar <- dichotomize(ex3[[col]]) # call dichotomize() on current col, store in new variable temporarily named newvar
names(ex3)[ncol(ex3)] <- newcolname # rename the last variable (newvar) to newcolname
}
# view the results
ex3
## # A tibble: 10 x 10
## id q1 adm1 q2 adm2 q3 adm3 q1_bin q2_bin q3_bin
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 4 AL 6 SD 5 CW 1 1 1
## 2 2 2 SJ 2 CW 4 JL 0 0 1
## 3 3 4 CW 5 JL 5 SJ 1 1 1
## 4 4 3 JL -99 AL -99 SD 0 NA NA
## 5 5 -99 SD 1 SJ 2 AL NA 0 0
## 6 6 1 AL 4 SD 6 CW 0 1 1
## 7 7 7 SJ 1 CW -99 JL 1 0 NA
## 8 8 5 CW 3 JL 1 SJ 1 0 0
## 9 9 2 JL 2 AL 1 SD 0 0 0
## 10 10 1 SD 5 SJ 4 AL 0 1 1
We did it!
R
tidyverse
vignettes: vignette("dplyr")
, vignette("tidy-data")
, vignette("lubridate")
, vignette("stringr")