dplyr
dplyr
packagefilter()
R
Logical operators and functionsselect()
select()
to reorder columnsarrange()
mutate()
*pivot_longer()
to create a variable out of column headings and restructure the datasetpivot_longer(names_to="year", values_to="grad", -id)
names_to="year"
values_to="grad"
-id
pivot_wider()
to spread a single column into multiple columnspivot_wider(names_from=feature, values_from=measure)
separate()
and unite()
to split and concatenate columns, respectivelytidyverse
cheatsheetsstringr
paste()
to concatenate stringstr_c()
from stringr
to concatenate stringsunite()
from dplyr
to concatenate stringsgrep()
sub()
This seminar introduces tools in R
useful for data management.
The seminar is focused on the tools found in the tidyverse
collection of packages, but also includes various tools pulled from other packages that complement tidyverse
tools.
tidyverse
: a collection of packages with tools for most aspects of data analysis, particularly strong in data import, management, and visualization. We will focus on the following packages within tidyverse
:
readr
- updated data import functions like read_csv()
dplyr
- subsetting, sorting, transforming variables, groupingtidyr
- restructuring rows and columnsmagrittr
- piping a chain of commandslubridate
- date and time variable processingstringr
- string variable manipulationHmisc
: another large collection of tools for most aspects of data analysis, but we use it here for describe()
, a dataset summary function
install.packages("tidyverse", dependencies=TRUE)
install.packages("Hmisc", dependencies=TRUE)
library(tidyverse)
library(Hmisc)
Datasets for statistical analysis are typically stored in data frames in R
.
Data frames are rectangular, where the columns are variables and the rows are observations of those variables. Columns can be of different types (double, character, etc.) but must be of equal length.
When we import data from a file, such as with read.csv()
, the resulting structure is typically a data frame.
Most of the tools discussed in this seminar work on data frames.
We begin our data mangement work on a dataset of 120 observations of 25 variables describing simulated cancer patient data. Each patient is also linked to a doctor in the dataset. We will merge this doctor dataset into this dataset later in the seminar.
We will be using the function read_csv()
, from the tidyverse
package readr
to import our data for this seminar. Although very similar to the base R function read.csv()
, read_csv()
has slightly different defaults and options, but the two can be used interchangeably for the most part. One of the primary differences is that read_csv()
will import the data as a tibble
, an updated class of data frames, which we will introduce shortly.
d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt1_dm.csv")
View()
Use View()
on a dataset to open a spreadsheet-style view of a dataset. In RStudio, clicking on a dataset in the Environment pane will View()
it.
View(d)
Throughout this seminar, we will be using tidyverse
functions for data management. These functions typically accept data frames as inputs, and output tibbles
, a special class of data frames.
Tibbles print to screen a bit differently from regular data frames – usually up to 10 rows and only as many columns as can fit on the screen. We also get each column’s “type”:
d
## # A tibble: 120 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCLA 1 1-1 6-Sep-09 male 65.0 3.70 8.09 4 68.0
## 2 UCLA 1 1-1 7-Jan-11 fema~ 53.9 2.63 0.803 2 64.7
## 3 UCLA 1 1-1 4-Sep-10 male 41.4 -99 2.13 3 86.4
## 4 UCLA 1 1-1 25-Jun-~ male 46.8 3.89 1.35 3 53.4
## 5 UCLA 1 1-1 1-Jul-09 male 51.9 1.42 2.19 4 51.7
## 6 UCLA 1 1-1 6-Mar-09 fema~ 53.8 2.29 8.61 3 78.9
## 7 UCLA 1 1-1 15-Apr-~ male 54.4 8.03 7.23 4 62.9
## 8 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2
## 9 UCLA 1 1-11 25-Jul-~ fema~ 59.3 -99 5.18 4 73.2
## 10 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2
## # ... with 110 more rows, and 14 more variables: co2 <dbl>, wound <dbl>,
## # mobility <dbl>, ntumors <dbl>, remission <dbl>, lungcapacity <dbl>,
## # married <dbl>, familyhx <chr>, smokinghx <chr>, cancerstage <chr>,
## # lengthofstay <dbl>, wbc <chr>, rbc <dbl>, bmi <dbl>
Tibbles are still data frames, and will work in nearly all functions that require data frames as inputs. For this seminar, anything we say about data frames will also apply to tibbles.
To convert a tibble to a regular data frame, use as.data.frame()
.
# d is of class tibble (tbl_df), class table (tbl) and class data.frame
class(d)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# now just a data.frame
class(as.data.frame(d))
## [1] "data.frame"
Read more about tibbles.
We can select a particular row and column of data frame x
, with the syntax x[rows, columns]
, where rows
and columns
are one of the following: a number, a vector of numbers, the name of a variable that stores numbers, or omitted.
Omitting the row value x[, columns]
selects all rows, and omitting the column value, x[rows, ]
, selects all columns. (Using this syntax to extract a row or column will result in a data.frame, not a vector)
The syntax x$colname
extracts the column with name colname
as a vector. We recommend this syntax for extracting columns.
The syntax x[[“colname”]]
and x[[colnum]]
can be used to extract column vectors by name or by number, respectively. (Using this syntax to extract a column will result in a vector)
# row 3 column 2
d[3,2]
## # A tibble: 1 x 1
## hospid
## <dbl>
## 1 1
# rows 2 and 4 of columns age
d[c(2,4), "age"]
## # A tibble: 2 x 1
## age
## <dbl>
## 1 53.9
## 2 46.8
# all columns for row 10
d[10,]
## # A tibble: 1 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCLA 1 1-11 12-Jul-~ fema~ 47.1 0.810 2.60 5 81.2 1.55
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
# all rows of column "pain"
d[,"pain"]
## # A tibble: 120 x 1
## pain
## <dbl>
## 1 4
## 2 2
## 3 3
## 4 3
## 5 4
## 6 3
## 7 4
## 8 5
## 9 4
## 10 5
## # ... with 110 more rows
# subsetting with $ creates a numeric vector, which itself can be subset
d$age[2:3]
## [1] 53.91714 41.36804
We can also use logical expressions to subset. Logical comparisons such as x > 0
return TRUE/FALSE vectors.
We can place such an expression inside of []
and values with TRUE will be selected.
# logical comparison results in a logical vector
d$age > 60
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# when placed in [] before the comma, rows with TRUE are selected
# this returns columns age and pain for observations where age > 60
d[d$age>60, c("age", "pain")]
## # A tibble: 5 x 2
## age pain
## <dbl> <dbl>
## 1 65.0 4
## 2 65.8 4
## 3 358. 6
## 4 63.9 8
## 5 64.2 7
Functions perform most of the work on data in R
.
Functions in R
are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input \(x\), and returns its square. Similarly, the mean()
function in R
takes a vector of numbers and returns its mean.
When we load a new package into R
with library()
, the set of functions associated with that library are loaded into the R
session.
In the help file for a function (?function_name
), we will find a list of Arguments to the function, in a specific order. Arguments are the inputs to the function. Values for arguments to functions can be specified either by name or position.
In the Usage section, a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.
The Value section specifies what is returned. Usually there are Examples at the bottom.
Input arguments to functions can be specified by either name or position.
# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5
# specifying arguments by position
seq(10, 0, -2)
## [1] 10 8 6 4 2 0
We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:
The base R
function summary()
provides summary statistics for numeric variables and frequencies for factor variables (we have no factors yet). Character variables are left unsummarized.
summary(d)
## hospital hospid docid dis_date
## Length:120 Min. :1.000 Length:120 Length:120
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :1.000 Mode :character Mode :character
## Mean :1.483
## 3rd Qu.:2.000
## Max. :2.000
## sex age test1 test2
## Length:120 Min. : 34.19 Min. :-99.000 Min. :-99.000
## Class :character 1st Qu.: 47.75 1st Qu.: 1.560 1st Qu.: 2.249
## Mode :character Median : 51.83 Median : 3.107 Median : 4.162
## Mean : 53.59 Mean : -1.989 Mean : -1.226
## 3rd Qu.: 55.01 3rd Qu.: 5.707 3rd Qu.: 6.166
## Max. :357.89 Max. : 12.416 Max. : 17.228
## pain tumorsize co2 wound
## Min. :1.000 Min. : 49.10 Min. :-98.0000 Min. :2.000
## 1st Qu.:4.000 1st Qu.: 62.18 1st Qu.: 1.4954 1st Qu.:5.000
## Median :5.000 Median : 68.23 Median : 1.5859 Median :6.000
## Mean :5.325 Mean : 70.08 Mean : -0.8901 Mean :5.592
## 3rd Qu.:6.000 3rd Qu.: 77.48 3rd Qu.: 1.6860 3rd Qu.:7.000
## Max. :9.000 Max. :109.01 Max. : 1.9424 Max. :9.000
## mobility ntumors remission lungcapacity
## Min. :2.000 Min. :0.00 Min. :0.00 Min. :-99.0000
## 1st Qu.:5.000 1st Qu.:1.00 1st Qu.:0.00 1st Qu.: 0.5141
## Median :6.000 Median :2.00 Median :0.00 Median : 0.7457
## Mean :6.033 Mean :3.15 Mean :0.35 Mean :-18.3280
## 3rd Qu.:7.000 3rd Qu.:5.00 3rd Qu.:1.00 3rd Qu.: 0.8721
## Max. :9.000 Max. :9.00 Max. :1.00 Max. : 0.9982
## married familyhx smokinghx cancerstage
## Min. :0.0000 Length:120 Length:120 Length:120
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :1.0000 Mode :character Mode :character Mode :character
## Mean :0.6333
## 3rd Qu.:1.0000
## Max. :1.0000
## lengthofstay wbc rbc bmi
## Min. :3.000 Length:120 Min. :4.360 Min. :18.45
## 1st Qu.:4.000 Class :character 1st Qu.:4.825 1st Qu.:24.51
## Median :5.000 Mode :character Median :4.978 Median :27.82
## Mean :5.308 Mean :4.970 Mean :29.38
## 3rd Qu.:6.000 3rd Qu.:5.150 3rd Qu.:34.31
## Max. :8.000 Max. :5.535 Max. :58.00
describe()
from package Hmisc
for more detailed summariesThe describe()
function from package Hmisc
will give differents summary statistics for all variables in a data frame. The information provided depends on the variable’s type and number of distinct values.
Important Features:
NA
) for all variablesThe results of describe()
can be used to determine quickly whether variables have plausible values and distributions:
library(Hmisc)
# detailed summaries of variables
describe(d)
## d
##
## 24 Variables 120 Observations
## --------------------------------------------------------------------------------
## hospital
## n missing distinct
## 120 0 2
##
## Value UCLA UCSF
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## hospid
## n missing distinct Info Mean Gmd
## 120 0 2 0.749 1.483 0.5036
##
## Value 1 2
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## docid
## n missing distinct
## 120 0 22
##
## lowest : 1-1 1-100 1-11 1-21 1-22 , highest: 2-177 2-178 2-188 2-201 2-216
## --------------------------------------------------------------------------------
## dis_date
## n missing distinct
## 120 0 104
##
## lowest : 1-Jul-09 10-Jun-09 10-Jun-10 11-Apr-10 11-Dec-09
## highest: 9-Apr-10 9-Feb-09 9-Feb-10 9-Jun-10 9-May-10
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 53.59 11.92 40.86 41.52
## .25 .50 .75 .90 .95
## 47.75 51.83 55.01 58.61 59.65
##
## lowest : 34.19229 35.31930 37.25225 39.61641 40.03724
## highest: 63.93238 64.16432 64.96824 65.80417 357.89001
##
## Value 35 40 45 50 55 60 65 360
## Frequency 3 10 16 38 35 13 4 1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## test1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.989 14.11 -99.0000 0.5809
## .25 .50 .75 .90 .95
## 1.5597 3.1065 5.7067 7.9931 9.5583
##
## lowest : -99.0000000 0.1048958 0.1927608 0.4293420 0.5155185
## highest: 9.7981329 10.2903990 10.4685400 11.0714440 12.4163920
##
## Value -99 0 1 2 3 4 5 6 7 8 9
## Frequency 7 3 20 17 21 12 8 11 6 7 1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##
## Value 10 11 12
## Frequency 5 1 1
## Proportion 0.042 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## test2
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.226 14.59 -99.0000 0.7881
## .25 .50 .75 .90 .95
## 2.2494 4.1620 6.1657 8.9973 10.9861
##
## lowest : -99.0000000 0.5807591 0.6179262 0.6571499 0.7485080
## highest: 11.4315750 12.4493730 14.2252270 14.5365420 17.2275810
##
## Value -99 1 2 3 4 5 6 7 8 9 10
## Frequency 7 15 12 19 16 15 9 6 6 6 1
## Proportion 0.058 0.125 0.100 0.158 0.133 0.125 0.075 0.050 0.050 0.050 0.008
##
## Value 11 12 14 15 17
## Frequency 4 1 1 1 1
## Proportion 0.033 0.008 0.008 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## pain
## n missing distinct Info Mean Gmd
## 120 0 9 0.961 5.325 1.742
##
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 1 2 11 21 33 25 18 5 4
## Proportion 0.008 0.017 0.092 0.175 0.275 0.208 0.150 0.042 0.033
## --------------------------------------------------------------------------------
## tumorsize
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 70.08 12.92 53.48 55.97
## .25 .50 .75 .90 .95
## 62.18 68.23 77.48 84.06 92.54
##
## lowest : 49.09861 50.28009 50.50217 51.65727 53.19937
## highest: 98.05510 98.32850 98.70570 102.69671 109.00956
## --------------------------------------------------------------------------------
## co2
## n missing distinct Info Mean Gmd .05 .10
## 120 0 115 1 -0.8901 5.03 1.390 1.440
## .25 .50 .75 .90 .95
## 1.495 1.586 1.686 1.756 1.793
##
## lowest : -98.000000 1.327440 1.362927 1.372113 1.391310
## highest: 1.810847 1.820266 1.905606 1.920025 1.942401
##
## Value -98 1 2
## Frequency 3 29 88
## Proportion 0.025 0.242 0.733
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## wound
## n missing distinct Info Mean Gmd
## 120 0 8 0.958 5.592 1.767
##
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##
## Value 2 3 4 5 6 7 8 9
## Frequency 5 10 12 23 36 22 10 2
## Proportion 0.042 0.083 0.100 0.192 0.300 0.183 0.083 0.017
## --------------------------------------------------------------------------------
## mobility
## n missing distinct Info Mean Gmd
## 120 0 8 0.969 6.033 2.264
##
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##
## Value 2 3 4 5 6 7 8 9
## Frequency 6 4 19 19 28 15 5 24
## Proportion 0.050 0.033 0.158 0.158 0.233 0.125 0.042 0.200
## --------------------------------------------------------------------------------
## ntumors
## n missing distinct Info Mean Gmd .05 .10
## 120 0 10 0.981 3.15 3.101 0 0
## .25 .50 .75 .90 .95
## 1 2 5 8 9
##
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 23 21 19 9 12 11 7 5 5 8
## Proportion 0.192 0.175 0.158 0.075 0.100 0.092 0.058 0.042 0.042 0.067
## --------------------------------------------------------------------------------
## remission
## n missing distinct Info Sum Mean Gmd
## 120 0 2 0.683 42 0.35 0.4588
##
## --------------------------------------------------------------------------------
## lungcapacity
## n missing distinct Info Mean Gmd .05 .10
## 120 0 96 0.995 -18.33 31.26 -99.0000 -99.0000
## .25 .50 .75 .90 .95
## 0.5142 0.7457 0.8721 0.9573 0.9790
##
## lowest : -99.0000000 -98.0000000 0.2949074 0.3264440 0.3450253
## highest: 0.9856416 0.9864109 0.9924814 0.9940955 0.9982018
##
## Value -99.0 -98.0 0.2 0.4 0.6 0.8 1.0
## Frequency 20 3 1 6 21 46 23
## Proportion 0.167 0.025 0.008 0.050 0.175 0.383 0.192
##
## For the frequency table, variable is rounded to the nearest 0.2
## --------------------------------------------------------------------------------
## married
## n missing distinct Info Sum Mean Gmd
## 120 0 2 0.697 76 0.6333 0.4683
##
## --------------------------------------------------------------------------------
## familyhx
## n missing distinct
## 120 0 3
##
## Value -99 no yes
## Frequency 6 97 17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## smokinghx
## n missing distinct
## 120 0 4
##
## Value -99 current former never
## Frequency 6 26 22 66
## Proportion 0.050 0.217 0.183 0.550
## --------------------------------------------------------------------------------
## cancerstage
## n missing distinct
## 120 0 4
##
## Value I II III IV
## Frequency 40 54 17 9
## Proportion 0.333 0.450 0.142 0.075
## --------------------------------------------------------------------------------
## lengthofstay
## n missing distinct Info Mean Gmd
## 120 0 6 0.933 5.308 1.271
##
## lowest : 3 4 5 6 7, highest: 4 5 6 7 8
##
## Value 3 4 5 6 7 8
## Frequency 5 29 29 40 15 2
## Proportion 0.042 0.242 0.242 0.333 0.125 0.017
## --------------------------------------------------------------------------------
## wbc
## n missing distinct
## 120 0 116
##
## lowest : 3671.880371 4176.054199 4201.741211 4238.355957 4331.902344
## highest: 7999.091309 8340.71582 8415.605469 8567.246094 not assessed
## --------------------------------------------------------------------------------
## rbc
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 4.97 0.2885 4.526 4.606
## .25 .50 .75 .90 .95
## 4.825 4.978 5.150 5.285 5.353
##
## lowest : 4.359662 4.436482 4.456108 4.465468 4.470100
## highest: 5.441604 5.442614 5.459067 5.502604 5.535052
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 29.38 7.278 20.71 22.07
## .25 .50 .75 .90 .95
## 24.51 27.82 34.31 37.75 40.71
##
## lowest : 18.44992 18.68505 20.07485 20.17994 20.49195
## highest: 42.84858 44.04223 46.50746 52.30723 58.00000
## --------------------------------------------------------------------------------
Here are some variables with values that we are not sure are correct:
describe(d[,c("age", "sex", "test1")])
## d[, c("age", "sex", "test1")]
##
## 3 Variables 120 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 117 1 53.59 11.92 40.86 41.52
## .25 .50 .75 .90 .95
## 47.75 51.83 55.01 58.61 59.65
##
## lowest : 34.19229 35.31930 37.25225 39.61641 40.03724
## highest: 63.93238 64.16432 64.96824 65.80417 357.89001
##
## Value 35 40 45 50 55 60 65 360
## Frequency 3 10 16 38 35 13 4 1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## test1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 111 1 -1.989 14.11 -99.0000 0.5809
## .25 .50 .75 .90 .95
## 1.5597 3.1065 5.7067 7.9931 9.5583
##
## lowest : -99.0000000 0.1048958 0.1927608 0.4293420 0.5155185
## highest: 9.7981329 10.2903990 10.4685400 11.0714440 12.4163920
##
## Value -99 0 1 2 3 4 5 6 7 8 9
## Frequency 7 3 20 17 21 12 8 11 6 7 1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##
## Value 10 11 12
## Frequency 5 1 1
## Proportion 0.042 0.008 0.008
##
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
describe()
The results of describe()
can be plotted:
These plots are also useful to inspect distributions of variables and to look for suspicious values.
In the dot plots, look for suspicious labels for categorical variables (see variables sex
, familyhx
and smokinghx
below).
In the histogram plots, each histogram has its own scale, so bunched-up histograms often denote the presence of extreme values (see variables age
, test1
, test2
, co2
and lungcapacity
).
plot(describe(d))
## $Categorical
##
## $Continuous
A look at the output of our dataset summaries, as well as the plot of describe()
identified several suspicious values:
age
has value 357.89001, likely a data entry errorsex
has value “12.2”, also a data entry errortest1
, test2
, lungcapacity
, smokinghx
, familyhx
have value -99, a missing value codeco2
and lungcapacity
have value -98, another missing value codewbc
seems to be a numeric variable, but has the value “not assessed”, which causes R
to read it in as characterBecasue we don’t know any replacement values, we will change suspicous values to NA
, the missing value indicator in R
.
Note: Many times, datasets differentiate between different types of missing. For example, -99=“absent”, while -98=“refused to answer”. For analysis purposes, using NA
for both will usually suffice. For users that need to distinguish between different kinds of missing, we suggest looking into the as.item()
function from package memisc
.
We can use logical subsetting to assign NA
to the data entry errors:
# change all impossible age values to NA
# assume all adults in this dataset
d$age[d$age<18|d$age>120] <- NA
# remember to use quotes for character variables
d$sex[d$sex == "12.2"] <- NA
Now we want to translate all -99, -98, and “not assessed” values to NA
. The following syntax can be used to translate all values of a
to b
in dataset df
:
df[df==a] <- b
d[d==-99] <- NA
d[d==-98] <- NA
d[d=="not assessed"] <- NA
complete.cases(x)
returns a logical (TRUE/FALSE) vector of whether each row of x
has no missing. Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.
# which cases are complete
complete.cases(d)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [13] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [25] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [37] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [49] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [73] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [85] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [97] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
## [109] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
# number of complete cases
sum(complete.cases(d))
## [1] 82
# create a dataset of complete cases
d_comp <- d[complete.cases(d),]
describe()
plot after cleaning dataThe plot of describe()
looks better now – no suspcious labels in the dot plots and nicely spread out histograms.
Color indicates the number of missing values for that variable.
plot(describe(d))
## $Categorical
##
## $Continuous
plot(describe(d), which="continuous")
dplyr
dplyr
packageThe dplyr
package provides tools for some of the most common data management tasks. Its primary functions are “verbs” to help you think about what you need to do to your dataset:
filter()
: select rows according to conditionsselect()
: select columns (you can rename as you select)arrange()
: sort rowsmutate()
: add new columnssummarize()
: calculate statistics across groupsTwo great conveniences of using dplyr
functions:
dplyr
functions is a data frame, and all variables supplied as arguments are assumed to be column of that data frame, so you only need to specify the dataset name once.The dplyr
package is automatically loaded with library(tidyverse)
.
filter()
We have already seen how to subset using logical vectors and logical subsetting:
# all variables for patients with pain>=9
pain9 <- d[d$pain>=9,]
pain9
## # A tibble: 4 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-177 9-Jun-10 male 58.7 7.62 9.31 9 63.5 1.48
## 2 UCSF 2 2-188 15-Jul-~ fema~ 41.5 4.00 7.57 9 98.3 1.82
## 3 UCSF 2 2-216 11-Apr-~ fema~ 52.0 6.70 9.37 9 78.5 1.70
## 4 UCSF 2 2-177 9-Jun-10 male 58.7 7.62 9.31 9 63.5 1.48
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated, especially witha long data set name.
The dplyr
function filter()
provides a cleaner syntax for subsetting datasets. Conditions separated by ,
are joined by &
(logical AND).
# subset to females with pain >= 9
dfhp <- filter(d, sex=="female", pain>=9)
dfhp
## # A tibble: 2 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-188 15-Jul-~ fema~ 41.5 4.00 7.57 9 98.3 1.82
## 2 UCSF 2 2-216 11-Apr-~ fema~ 52.0 6.70 9.37 9 78.5 1.70
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
R
Logical operators and functionsHere are some operators and functions to help with selection:
==
: equality>
, >=
: greater than, greater than or equal to!
: not&
: AND|
: OR%in%
: matches any of (2 %in% c(1,2,3)
= TRUE)is.na()
: equality to NA
near()
: checking for equality for floating point (decimal) numbers, has a built-in tolerance# select those not in UCLA, who are either younger than 40 or older than 60
filter(d, hospital!="UCLA", age<40 | age>60)
## # A tibble: 3 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-113 4-Feb-10 male 63.9 4.77 1.50 8 75.1 1.56
## 2 UCSF 2 2-178 19-Apr-~ male 34.2 4.63 3.26 3 65.5 1.64
## 3 UCSF 2 2-201 21-Nov-~ fema~ 64.2 2.03 6.20 7 79.7 1.72
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## # remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## # smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## # rbc <dbl>, bmi <dbl>
select()
Use dplyr
function select()
to keep only the variables you need.
# select 4 variables
d_small <- select(d, hospid, docid, age, cancerstage)
head(d_small, n=3)
## # A tibble: 3 x 4
## hospid docid age cancerstage
## <dbl> <chr> <dbl> <chr>
## 1 1 1-1 65.0 II
## 2 1 1-1 53.9 II
## 3 1 1-1 41.4 I
Often we want to select a group of related variables with similar names. dplyr
supplies helper functions to find columns whose names match a specified pattern:
starts_with(x)
: matches names that begin with the string x
ends_with(x)
: matches names that end with the string x
contains(x)
: matches names that contain the string x
matches(re)
: matches regular expression re
num_range(prefix, range)
matches names that contain prefix
and one element of range
# get hospital variables
dhosp <- select(d, starts_with("hosp"))
names(dhosp)
## [1] "hospital" "hospid"
# get "test1" and "test2" variables using num_range
tests <- select(d, num_range("test", 1:2))
names(tests)
## [1] "test1" "test2"
select()
to reorder columnsYou can specify a particular ordering of columns with select()
. To place the remaining columns in their current ordering, use everything()
. You can also rename variables with select.
# put the two test vars first while renaming them,
# then everything else
tests_first <- select(d, t1=test1, t2=test2, everything())
names(tests_first)
## [1] "t1" "t2" "hospital" "hospid" "docid"
## [6] "dis_date" "sex" "age" "pain" "tumorsize"
## [11] "co2" "wound" "mobility" "ntumors" "remission"
## [16] "lungcapacity" "married" "familyhx" "smokinghx" "cancerstage"
## [21] "lengthofstay" "wbc" "rbc" "bmi"
arrange()
Sort the order of rows by variable values using arrange()
from dplyr
.
Be default, ascending order will be used. Surround a sorting variable with desc()
to sort by descending order instead.
# sort, with males before females, then by age, youngest first
arrange(d, desc(sex), age)
## # A tibble: 120 x 24
## hospital hospid docid dis_date sex age test1 test2 pain tumorsize co2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UCSF 2 2-178 19-Apr-~ male 34.2 4.63 3.26 3 65.5 1.64
## 2 UCLA 1 1-58 9-Feb-10 male 35.3 2.14 12.4 6 60.5 1.56
## 3 UCSF 2 2-121 20-Apr-~ male 40.1 NA NA 5 73.5 1.65
## 4 UCLA 1 1-1 4-Sep-10 male 41.4 NA 2.13 3 86.4 1.45
## 5 UCSF 2 2-121 22-Aug-~ male 41.5 2.88 6.80 5 60.2 1.50
## 6 UCSF 2 2-163 19-Apr-~ male 42.5 1.63 0.749 4 54.0 1.72
## 7 UCLA 1 1-11 19-Aug-~ male 43.7 10.5 5.89 7 61.3 1.49
## 8 UCLA 1 1-1 25-Jun-~ male 46.8 3.89 1.35 3 53.4 1.57
## 9 UCLA 1 1-21 22-Apr-~ male 48.0 3.39 NA 8 79.6 1.55
## 10 UCSF 2 2-178 12-Mar-~ male 48.9 4.15 3.04 7 71.8 1.47
## # ... with 110 more rows, and 13 more variables: wound <dbl>, mobility <dbl>,
## # ntumors <dbl>, remission <dbl>, lungcapacity <dbl>, married <dbl>,
## # familyhx <chr>, smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>,
## # wbc <chr>, rbc <dbl>, bmi <dbl>
mutate()
*The function mutate()
allows us to transform many variables in one step without having to respecify the data frame name over and over.
Useful R
functions for transforming:
log()
: logarithmmin_rank()
: rank valuescut()
: cut a continuous variable into intervals with new integer value signifying into which interval original value fallsscale()
: standardizes variable (substracts mean and divides by standard deviation)lag()
, lead()
: lag and lead a variable (from dplyr
)cumsum()
: cumulative sumrowMeans()
, rowSums()
: means and sums of several columnsrecode()
: recode values (from dplyr
)# create age category variable, and highpain binary variable
d <- mutate(d,
agecat = cut(age, breaks=c(30,40,50,60,70,120)),
highpain = pain > mean(pain))
table(d$agecat, d$highpain)
##
## FALSE TRUE
## (30,40] 2 2
## (40,50] 27 15
## (50,60] 37 32
## (60,70] 2 2
## (70,120] 0 0
For many datasets we want to calculate statistics by group, rather than over the whole dataset.
The dplyr
functions group_by()
and summarise()
together greatly simplify group operations.
First, we create a grouped data frame (tibble) with group_by()
. We will group our patient dataset by doctor id.
# group_by creates a grouped_df (grouped data frame) class structure
by_doc <- group_by(d, docid)
class(by_doc)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Next we use summarise()
and statistical or numerical functions to summarize variables by group.
Example of useful functions for summarise()
:
min()
, max()
, mean()
, sum()
, var()
, sd()
n()
: number of observations in the groupn_distinct(x)
: number of distinct values in variable x
# Create summaries of patients by doctor
pat_summ <- summarise(by_doc,
n_pat = n(), # number of patients
longest_los = max(lengthofstay), # longest length of stay
avg_age = mean(age), # average age
n_stage_four = sum(cancerstage=="IV") # number of stage IV patients
)
pat_summ
## # A tibble: 22 x 5
## docid n_pat longest_los avg_age n_stage_four
## <chr> <int> <dbl> <dbl> <int>
## 1 1-1 7 6 52.5 0
## 2 1-100 9 7 NA 0
## 3 1-11 8 7 51.7 1
## 4 1-21 1 4 48.0 0
## 5 1-22 4 6 48.8 0
## 6 1-33 6 7 49.0 0
## 7 1-48 7 7 51.3 1
## 8 1-57 3 5 45.6 0
## 9 1-58 3 7 51.1 0
## 10 1-72 8 8 50.3 0
## # ... with 12 more rows
The dplyr
package provides summarize()
for those who prefer an American spelling to summarise()
. However, the package Hmisc
also contains a summarize()
function, and if loaded into the same session as dplyr()
, a conflict arises between the two summarize()
functions. The function from the package that was loaded first will be masked when the second package is loaded into the session (with an accompanying red warning message). The function loaded second will be the default function used when summarize()
is specified.
So, if Hmisc
is loaded first and dplyr
loaded second, specifying summarize(x)
will call summarize()
from dplyr
.
One can still access the function loaded from the first package in a couple of ways.
package::function()
, for example Hmisc::summarize(x)
detach(package:name)
, for example detach(package:dplyr)
%>%
A data management task may involve many steps to reach the final desired dataset. Often, during intermediate steps, datasets are generated that we don’t care about or plan on keeping. For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand.
Naming datasets takes time to think about and clutters code. Piping makes your code more readable by focusing on the functions used rather than the name of datasets.
The pipe operator is %>%
, , which can be typed with CTRL-SHIFT-M (CMD-SHIFT-M on a Mac). The pipe operator can be expressed as “then” when reading code. The operator is loaded with pacakage magrittr
(automatically loaded with library(tidyverse)
).
The pipe operator “pipes” the dataset on the left of the %>%
operator to the function on the right of the operator.
The code x %>% f(y)
translates to f(x,y)
, that is, x
is treated by default as the first argument of f()
. If the function returns a data frame, we can then pipe this data frame into another function. Thus x %>% f(y) %>% g(z)
translates to g(f(x,y), z)
.
As a first example, perhaps we want to create a dataset of just females under 40, with only the age and pain variables selected. We could do this in 2 steps, like so:
# a dataset of age and pain for females younger than 40
f40 <- filter(d, sex=="female" & age<40)
f40_small <- select(f40, age, pain)
f40_small
## # A tibble: 2 x 2
## age pain
## <dbl> <dbl>
## 1 37.3 8
## 2 39.6 5
While that works fine, the intermediate dataset f40 is not of interest and is cluttering up memory and the workspace unnecessarily.
We could use %>%
instead:
# start with d, then filter rows, then select variables
f40_small <- d %>%
filter(sex=="female" & age<40) %>%
select(age, pain)
f40_small
## # A tibble: 2 x 2
## age pain
## <dbl> <dbl>
## 1 37.3 8
## 2 39.6 5
No intermediate dataset was created, and dataset names are rarely needed in intermediate steps. The dataset transformations, filtering and selecting, are easy to see.
Piping really helps make a long chain of commands more immediately understandable.
# create a plot of average age vs tumor size
# by doctors, for doctors with more than 5 patients
g1 <- d %>%
group_by(docid) %>%
summarise(n_pat=n(),
avg_age=mean(age),
avg_tumor=mean(tumorsize)) %>%
filter(n_pat > 5) %>%
ggplot(aes(x=avg_age, y=avg_tumor)) + geom_point() +
geom_smooth()
g1
.
in piped commandsIf the dataset is not to be specified as the first argument of the function on the right of the %>%
operator, use .
to specify where the dataset should appear as the argument.
x %>% f(y,.)
translates to f(y,x)
.
For example, for modeling functions such as lm()
, the model formula is typically the first argument rather than the dataset, but we can use .
to stand in for the dataset in the piping chain:
d %>%
filter(age < 40) %>%
lm(tumorsize ~ age, data=.) # the . is the filtered dataset
##
## Call:
## lm(formula = tumorsize ~ age, data = .)
##
## Coefficients:
## (Intercept) age
## -78.219 4.116
Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.
rbind()
and bind_rows()
We can use either the base R
function rbind()
or the dplyr
function bind_rows()
to append. They differ in how they handle the situation where the datasets being appended have non-matching columns.
rbind()
will produce and errorbind_rows()
will append the datasets, and fill in with NA
values for missing values in unmatched columnsThey also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).
rbind()
will coerce one type to the other (direction of coercion is logical -> integer -> double -> character)bind_rows()
will produce an errorLet’s load a second dataset that contains the same variables as the original version of our first dataset but with 2 variables omitted.
# new data set that contains the same variables as d, except is missing 2 of them
d2 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt2_dm.csv")
# rows and columns of d2 and d
dim(d2)
## [1] 111 24
dim(d)
## [1] 120 26
Because our first data set d
has a few extra variables we created earlier with mutate()
, the two datasets do not have completely matching columns.
rbind()
on the datasets as they are will produce an error.bind_rows()
will produce NA
values for the observation on the second datasets on the variables that are exclusively in the first dataset.The best solution would be to calculate the new variables for the second dataset and then use either function to append…
However, to demonstrate the behavior of bind_rows()
, we append the datasets as they are.
Use the .id=
argument in bind_rows
to create a variable that identifies source dataset.
# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")
# these are the rows where the datasets meet
# hospital is found in both datasets, agecat and highpain are not
select(d3, source, hospital, agecat, highpain)[118:123,]
## # A tibble: 6 x 4
## source hospital agecat highpain
## <chr> <chr> <fct> <lgl>
## 1 1 UCSF (50,60] TRUE
## 2 1 UCSF (50,60] TRUE
## 3 1 UCSF (50,60] TRUE
## 4 2 Cedars-Sinai <NA> NA
## 5 2 Cedars-Sinai <NA> NA
## 6 2 Cedars-Sinai <NA> NA
# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:24], d2)
dplyr
joinsAppending adds more rows of observations, whereas merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).
The dplyr
“join” functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by=
argument to specify specific matching id variables.
These joins all return a table with all columns from x
and y
, but differ in how they deal with mismatched rows:
inner_join(x, y)
: returns all rows from x
where there is a matching value in y
(returns only matching rows).
left_join(x, y)
: returns all rows from x
, unmatched rows in x
will have NA in the columns from y
. Unmatched rows in y
not returned.
full_join(x, y)
: returns all rows from x
and from y
; unmatched rows in either will have NA in new columns
Each of these joins is possible with base R
function merge()
, but merge()
is many times slower than the joins.
inner_join()
We begin our demonstration of merging by loading a dataset of doctor variables, that describes the doctor’s years of experience, schooling, number of lawsuits, etc.
d_doc <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/doctor_dm.csv")
It is easiest to show how the joins differ if we work on smaller, more manageable datasets.
We will be merging a subset of the d_doc
dataset we just imported into a subset the d3
dataset we created earlier (the two patient datasets appended together). We will select two doctors from d3
, doctors “1-21” and “2-178”, and one matching doctor and one non-matching doctor from d_doc
, “2-178” and “3-407”.
# select one non-matching and one matching doctor from each to demo joins
# just a few variables from d3
d3.1 <- select(filter(d3, docid == "1-21" | docid == "2-178"),
docid, sex, age, test1, test2)
d_doc.1 <- filter(d_doc, docid == "3-407" | docid == "2-178")
Let’s take a look at our small datasets
# first a look at the two datasets to be merged
# 2 rows for docid 2-178 in d3.1
d3.1
## # A tibble: 3 x 5
## docid sex age test1 test2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA
## 2 2-178 male 34.2 4.63 3.26
## 3 2-178 male 48.9 4.15 3.04
d_doc.1
## # A tibble: 2 x 5
## docid experience school lawsuits medicaid
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 2-178 15 average 4 0.817
## 2 3-407 23 average 3 0.343
The merging id variable “docid” appears in both datasets, so we do not need to specify a merging variables.
Notice how inner_join()
, left_join()
and full_join()
differ in their behavior:
# only matching rows returned
# 2-178 from d_doc.1 matched twice to 2-178 in d3.1
inner_join(d3.1, d_doc.1)
## # A tibble: 2 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 2 2-178 male 48.9 4.15 3.04 15 average 4 0.817
# all rows from d3.1 returned
left_join(d3.1, d_doc.1)
## # A tibble: 3 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA NA <NA> NA NA
## 2 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 3 2-178 male 48.9 4.15 3.04 15 average 4 0.817
# all rows from both returned
full_join(d3.1, d_doc.1)
## # A tibble: 4 x 9
## docid sex age test1 test2 experience school lawsuits medicaid
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1-21 male 48.0 3.39 NA NA <NA> NA NA
## 2 2-178 male 34.2 4.63 3.26 15 average 4 0.817
## 3 2-178 male 48.9 4.15 3.04 15 average 4 0.817
## 4 3-407 <NA> NA NA NA 23 average 3 0.343
We now explore the tools in the tidyverse
package tidyr
(loaded with library(tidyverse)
), which make datasets “tidy”.
Tidy datasets adhere to a simple set of structural rules:
Tidy data conventions provide a standard organization that prevents having to “rethink the wheel” each time data need to be organized.
Doing a little more work upfront to make data “tidy” will pay off with time saved later on.
Each row of the dataset should represent one unit of analysis. Units may be subjects, or trials within subjects, or whole groups of subjects.
Some analyses may involve both variables describing individual units and aggregated units. For example, a student-level analysis may involve school-level variables. Longitudinal studies may involve within-subject and between-subject variables. For multilevel analyses, the unit of analysis is generally the lowest level.
# data at multiple levels - patients nested in doctors
d %>%
inner_join(pat_summ) %>%
select(sex, age, lengthofstay, docid, avg_age, longest_los) %>%
head(n=3)
## # A tibble: 3 x 6
## sex age lengthofstay docid avg_age longest_los
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 male 65.0 6 1-1 52.5 6
## 2 female 53.9 6 1-1 52.5 6
## 3 male 41.4 5 1-1 52.5 6
The use of data not collected by the analyst makes encountering untidy data more common than desired.
Data can be messy in many ways, but we focus on two ways that can be tidied with tidyr
tools:
Let’s take a look at some untidy datsets and see how the tidyr
functions pivot_longer()
and pivot_wider()
will help us to tidy them.
We first load a dataset of the number of graduates from 3 academic departments across 3 years. The planned analysis will examine whether graduation has grown over the years.
The unit of analysis here is the graduation rate in a particular department in a particular year. Each row should thus represents one year in one department.
dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 x 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
Year is a predictor in our analysis, so should be a column variable. Each department has 3 years of data so each requires 3 rows of data. Additionally, the entire table of values measures the same thing, graduation numbers in one department in one year, so we can put all values in one column.
pivot_longer()
to create a variable out of column headings and restructure the datasetTo use pivot_longer()
, we select a set of column variable for reshaping:
This process is also known as “reshaping long”.
Arguments to pivot_longer()
:
names_to=
: name of the new column variable that will hold the stacked selected column headingsvalues_to=
: the name of the new column variable that will hold the stacked values of the selected columnsBy default, pivot_longer
will select all columns for stacking. However, columns that do not vary with values in the new names_to=
column should not be stacked.
Here again is the data set we want to reshape:
dept
## # A tibble: 3 x 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
In the newly reshaped dataset, we want to put years and graudation rates into 2 columns. Since years are the column headings, we will name the names_to=
variable “year”. The graudation rate values in the columns 2015
, 2016
, and 2017
will be stacked into another new variable whose name we specify in values_to=
as “grad”.
Department name (id
) does not vary with year, so we do not need to stack it. We can specify which columns are to be reshaped by specifying only the columns that need to be collapsed as c(`2015`, `2016`, `2017`)
* or by specifying the columns not to be collapsed after the symbol -
, as in -id
.
*Note that the ` are needed to specify the column names here because they don’t start with a letter.
# the new column "year" uses the column headings as values,
# the new column "graduates" will be the collapsed values
# we do not want to collapse id
dept_by_year <- dept %>%
pivot_longer(names_to="year", values_to="grad", -id)
pivot_longer(names_to="year", values_to="grad", -id)