This workshop introduces R
tools for data
management.
We focus on functions found in the tidyverse
collection
of packages, but also include tools found in base R
and
other packages.
We will cover the following topics:
Review of dataframes
Review of functions
Data cleaning and missing values
Data management with dplyr
Combining datasets
Tidy data with tidyr
Managing dates with lubridate
Managing strings stringr
Factors with forcats
tidyverse
: a collection of packages with tools for most
aspects of data analysis, particularly strong in data import,
management, and visualization. We will focus on the following packages
within tidyverse
:
readr
- updated data import functions like
read_csv()
dplyr
- subsetting, sorting, transforming variables,
groupingtidyr
- restructuring rows and columnslubridate
- date and time variable processingstringr
- string variable manipulationforcats
- factor variable manipulationHmisc
: another large collection of tools for most
aspects of data analysis, but we use it here for
describe()
, a dataset summary function
When tidyverse
is installed with
install.pacakges()
, all of the tidyverse
packages will be installed.
When loading tidyverse
packages into an R
session with library()
, only the core packages are
loaded.
# load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Non-core tidyverse
packages must be loaded
individually.
# tidyverse package readxl is used to import Excel files
library(readxl)
# tidyverse package haven is used to import Stata, SAS, and SPSS files
library(haven)
Another set of tools comes with Hmisc
package.
Data sets for statistical analysis are typically stored in data
frames in R
.
Data import functions like read.csv()
typically output a
data frame.
# read in data over internet
dex1 <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_1.csv")
# class of dex1 object
class(dex1)
## [1] "data.frame"
# first few rows
head(dex1)
## id sex ses school_prog read write math science social_st
## 1 1 female low public/vocation -99 44 45 39 41
## 2 2 female middle public/general 39 41 33 42 41
## 3 3 male low private/academic 63 65 48 63 56
## 4 4 female low public/academic 44 50 41 39 51
## 5 5 male low public/academic 47 40 43 45 31
## 6 6 female low public/academic 47 41 46 40 41
## honors awards
## 1 not enrolled 0
## 2 enrolled 1
## 3 enrolled 5
## 4 not enrolled 1
## 5 not enrolled 0
## 6 not enrolled 0
We can also create data frames manually using
data.frame()
:
# create data frame manually
dt <- data.frame(
x = runif(5), # 5 random numbers from uniform(0,1)
y = rnorm(5) # 5 random numbers from normal(0,1)
)
dt
## x y
## 1 0.847499404 1.2599723
## 2 0.555330436 -1.3685829
## 3 0.619306767 0.3407421
## 4 0.008869905 -1.1807196
## 5 0.350041412 0.4532787
Most of the tools discussed in this seminar work on data frames.
Tibbles are a special class of data frames used in
tidyverse
but function as regular data frames for most
purposes.
If the input is a data frame, a few tidyverse
functions
will output a tibble, but most will output a regular data frame.
# dplyr function group_by() takes a data frame and outputs a grouped tibble,
# which allows grouped operations
dex1_by_ses <- group_by(dex1, ses)
# tbl_df refers to class tibble
class(dex1_by_ses)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
This page explains how tibbles differ from data frames.
Most users will not notice any functional differences between tibbles and regular data frames.
One noticeable difference is how they print to screen:
R
option value max.print
(1000, by
default)# prints only rows and columns that fit to screen
dex1_by_ses
## # A tibble: 200 × 11
## # Groups: ses [4]
## id sex ses school_prog read write math science social_st honors
## <int> <chr> <chr> <chr> <int> <int> <int> <int> <int> <chr>
## 1 1 female low public/vocati… -99 44 45 39 41 not e…
## 2 2 female middle public/general 39 41 33 42 41 enrol…
## 3 3 male low private/acade… 63 65 48 63 56 enrol…
## 4 4 female low public/academ… 44 50 41 39 51 not e…
## 5 5 male low public/academ… 47 40 43 45 31 not e…
## 6 6 female low public/academ… 47 41 46 40 41 not e…
## 7 7 male middle public/academ… 57 54 59 47 51 not e…
## 8 8 female low public/academ… 39 44 52 44 48 not e…
## 9 9 male middle public/vocati… 48 49 52 44 51 not e…
## 10 10 female middle public/general 47 54 49 53 61 not e…
## # ℹ 190 more rows
## # ℹ 1 more variable: awards <int>
To convert a tibble to a regular data frame, use
as.data.frame()
.
The function tibble()
can be used to create a tibble
manually.
We begin our data management work on a dataset of 120 observations of 14 variables describing simulated university student data. Some of the variables include:
We will be using the function read_csv()
to import the
data, which:
tidyverse
package
readr
R
function
read.csv()
, but with slightly different defaults and
optionsread.csv()
# First workshop dataset
dat <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/student_data.csv")
dat
## # A tibble: 120 × 14
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCLA 1 1 male 35.0 soph… no T1
## 2 UCLA 1 2 fema… 23.9 soph… no T1
## 3 UCLA 1 3 male 21.4 soph… no T1
## 4 UCLA 1 4 male 26.8 seni… no T1
## 5 UCLA 1 5 male 21.9 seni… no T1
## 6 UCLA 1 6 fema… 23.8 juni… no T1
## 7 UCLA 1 7 male 24.4 soph… yes T1
## 8 UCLA 1 8 fema… 27.1 juni… no T3
## 9 UCLA 1 9 fema… 29.3 seni… yes T3
## 10 UCLA 1 10 fema… 27.1 juni… no T3
## # ℹ 110 more rows
## # ℹ 6 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>
View()
data as a spreadsheetView()
opens a spreadsheet-style view of a dataset.
In RStudio, clicking on a dataset in the Environment
pane will View()
it.
We can select a particular row and column of data frame
my_data
, with the syntax
my_data[rows,columns]
, where
rows
and columns
are one of
the following:
a number: my_data[1,1]
a vector of numbers:
my_data[c(1,3,5),1:10]
the name of a column: my_data[,“colname”
]
or omitted
Omitting the row value my_data[,columns]
selects all rows.
Omitting the column value, my_data[rows, ]
,
selects all columns.
# row 3 column 2
dat[3,2]
## # A tibble: 1 × 1
## universityid
## <dbl>
## 1 1
# rows 2 and 4 of columns age
dat[c(2,4), "age"]
## # A tibble: 2 × 1
## age
## <dbl>
## 1 23.9
## 2 26.8
# all columns for row 10
dat[10,]
## # A tibble: 1 × 14
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCLA 1 10 female 27.1 juni… no T3
## # ℹ 6 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>
# all rows of column "sex"
dat[, "sex"]
## # A tibble: 120 × 1
## sex
## <chr>
## 1 male
## 2 female
## 3 male
## 4 male
## 5 male
## 6 female
## 7 male
## 8 female
## 9 female
## 10 female
## # ℹ 110 more rows
$
to extract a column vectorTo extract a column as a vector use the following syntax:
The syntax my_data$colname
extracts the column
with name colname
as a vector. We recommend this
syntax for extracting columns.
The syntax my_data[[“colname”]]
and
my_data[[colnum]]
can also be used to extract
column vectors by name or by number, respectively.
We can also use logical expressions to subset. Logical comparisons
such as x > 0
return TRUE/FALSE vectors.
We can place such an expression inside of []
and values
with TRUE will be selected.
# logical comparison results in a logical vector
dat$age > 25
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
## [13] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [25] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## [37] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [61] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## [73] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [97] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# when placed in [] before the comma, rows with TRUE are selected
# this returns columns age and sex for observations where age > 25
dat[dat$age > 25, c("age", "sex")]
## # A tibble: 41 × 2
## age sex
## <dbl> <chr>
## 1 35.0 male
## 2 26.8 male
## 3 27.1 female
## 4 29.3 female
## 5 27.1 female
## 6 26.2 male
## 7 25.8 male
## 8 29.8 female
## 9 25.5 female
## 10 25.2 female
## # ℹ 31 more rows
Functions perform most of the work on data in R
.
Functions in R
perform some operation on an input
(usually) and return some output. For example, the mean()
function in R
takes a vector of numbers and returns its
mean.
Using library()
loads a package’s functions into the
R
session.
In the help file for a function (?function_name
), we
will find:
{}
=
are default valuesInput arguments to functions can be specified by either name or position.
After importing your data, a good first step is to understand how the variables are stored.
For statistical analysis:
The tidyverse function glimpse()
:
<>
(e.g.,
<dbl>
is double)# tidyverse function
glimpse(dat)
## Rows: 120
## Columns: 14
## $ university <chr> "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UC…
## $ universityid <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ student <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ sex <chr> "male", "female", "male", "male", "male", "female",…
## $ age <dbl> 34.97, 23.92, 21.37, 26.80, 21.93, 23.83, 24.39, 27…
## $ class <chr> "sophomore", "sophomore", "sophomore", "senior", "s…
## $ transfer <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "y…
## $ teaching_assistant <chr> "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T3", "T3…
## $ homework1 <dbl> 4, 3, 5, 4, 1, 2, 8, 1, 5, 1, 10, -99, 3, 5, 5, 3, …
## $ homework2 <dbl> 8, 1, 2, 1, 2, 9, 7, 3, 5, 3, 6, 2, 10, 5, 3, 6, 4,…
## $ midterm <dbl> 60, 57, 78, 45, 44, 71, 55, 73, 65, 73, 70, 49, 47,…
## $ final <dbl> 68, 67, 96, 40, 33, 98, 52, 80, 68, 87, 42, 45, 56,…
## $ total_points <dbl> 140, 128, 176, 90, 80, 180, 122, 157, 138, 164, 58,…
## $ total_proportion <dbl> 0.64, 0.58, 0.80, 0.41, 0.36, 0.82, 0.55, 0.71, 0.6…
The base R
function str()
has similar usage
when used on a data frame, but prints less data.
# base R function
str(dat)
## spc_tbl_ [120 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ university : chr [1:120] "UCLA" "UCLA" "UCLA" "UCLA" ...
## $ universityid : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ student : num [1:120] 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : chr [1:120] "male" "female" "male" "male" ...
## $ age : num [1:120] 35 23.9 21.4 26.8 21.9 ...
## $ class : chr [1:120] "sophomore" "sophomore" "sophomore" "senior" ...
## $ transfer : chr [1:120] "no" "no" "no" "no" ...
## $ teaching_assistant: chr [1:120] "T1" "T1" "T1" "T1" ...
## $ homework1 : num [1:120] 4 3 5 4 1 2 8 1 5 1 ...
## $ homework2 : num [1:120] 8 1 2 1 2 9 7 3 5 3 ...
## $ midterm : num [1:120] 60 57 78 45 44 71 55 73 65 73 ...
## $ final : num [1:120] 68 67 96 40 33 98 52 80 68 87 ...
## $ total_points : num [1:120] 140 128 176 90 80 180 122 157 138 164 ...
## $ total_proportion : num [1:120] 0.64 0.58 0.8 0.41 0.36 0.82 0.55 0.71 0.63 0.75 ...
## - attr(*, "spec")=
## .. cols(
## .. university = col_character(),
## .. universityid = col_double(),
## .. student = col_double(),
## .. sex = col_character(),
## .. age = col_double(),
## .. class = col_character(),
## .. transfer = col_character(),
## .. teaching_assistant = col_character(),
## .. homework1 = col_double(),
## .. homework2 = col_double(),
## .. midterm = col_double(),
## .. final = col_double(),
## .. total_points = col_double(),
## .. total_proportion = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
describe()
We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:
The describe()
function from package Hmisc
will give summary statistics for all variables in a data frame.
The information provided depends on the variable’s type and number of distinct values:
NA
) for all variablesThe results of describe()
can be used to assess each
variable’s distribution and to identify suspicious values:
# load Hmisc
library(Hmisc)
# detailed summaries of variables
describe(dat)
## dat
##
## 14 Variables 120 Observations
## --------------------------------------------------------------------------------
## university
## n missing distinct
## 120 0 2
##
## Value UCLA UCSF
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## universityid
## n missing distinct Info Mean Gmd
## 120 0 2 0.749 1.483 0.5036
##
## Value 1 2
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## student
## n missing distinct Info Mean Gmd .05 .10
## 120 0 120 1 60.5 40.33 6.95 12.90
## .25 .50 .75 .90 .95
## 30.75 60.50 90.25 108.10 114.05
##
## lowest : 1 2 3 4 5, highest: 116 117 118 119 120
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 113 1 26.27 9.051 18.96 19.39
## .25 .50 .75 .90 .95
## 21.21 23.16 26.10 28.61 29.65
##
## lowest : 18.01 18.19 18.33 18.45 18.9 , highest: 33.93 34.16 34.97 35.8 327.89
## --------------------------------------------------------------------------------
## class
## n missing distinct
## 120 0 4
##
## Value junior not assessed senior sophomore
## Frequency 27 5 66 22
## Proportion 0.225 0.042 0.550 0.183
## --------------------------------------------------------------------------------
## transfer
## n missing distinct
## 120 0 3
##
## Value -99 no yes
## Frequency 6 97 17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## teaching_assistant
## n missing distinct
## 120 0 5
##
## Value T1 T2 T3 T4 T5
## Frequency 7 4 45 28 36
## Proportion 0.058 0.033 0.375 0.233 0.300
## --------------------------------------------------------------------------------
## homework1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 11 0.984 -0.1833 10.96 1 1
## .25 .50 .75 .90 .95
## 2 3 6 8 10
##
## Value -99.00 0.19 1.28 2.37 3.46 4.55 5.64 6.73 7.82
## Frequency 5 20 17 21 12 13 11 6 7
## Proportion 0.042 0.167 0.142 0.175 0.100 0.108 0.092 0.050 0.058
##
## Value 8.91 10.00
## Frequency 1 7
## Proportion 0.008 0.058
##
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------
## homework2
## n missing distinct Info Mean Gmd .05 .10
## 120 0 11 0.987 -0.5417 12.7 -4 1
## .25 .50 .75 .90 .95
## 2 4 6 9 10
##
## Value -99.00 0.19 1.28 2.37 3.46 4.55 5.64 6.73 7.82
## Frequency 6 15 12 19 16 15 10 6 6
## Proportion 0.050 0.125 0.100 0.158 0.133 0.125 0.083 0.050 0.050
##
## Value 8.91 10.00
## Frequency 6 9
## Proportion 0.050 0.075
##
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------
## midterm
## n missing distinct Info Mean Gmd .05 .10
## 120 0 42 0.999 53.61 26.45 34.00 45.00
## .25 .50 .75 .90 .95
## 52.75 60.00 68.00 74.10 78.30
##
## lowest : -99 41 42 43 44, highest: 78 84 87 90 91
## --------------------------------------------------------------------------------
## final
## n missing distinct Info Mean Gmd .05 .10
## 120 0 56 0.999 58.29 30.84 19.95 37.50
## .25 .50 .75 .90 .95
## 53.00 67.00 76.25 83.10 93.05
##
## lowest : -99 19 20 22 28, highest: 93 94 96 98 100
## --------------------------------------------------------------------------------
## total_points
## n missing distinct Info Mean Gmd .05 .10
## 120 0 69 0.999 128 37.18 54.85 80.00
## .25 .50 .75 .90 .95
## 116.75 134.50 148.00 164.20 176.00
##
## lowest : 7 24 31 35 40, highest: 177 179 180 193 202
## --------------------------------------------------------------------------------
## total_proportion
## n missing distinct Info Mean Gmd .05 .10
## 120 0 49 0.998 0.5816 0.1689 0.2495 0.3600
## .25 .50 .75 .90 .95
## 0.5300 0.6100 0.6700 0.7500 0.8000
##
## lowest : 0.03 0.11 0.14 0.16 0.18, highest: 0.8 0.81 0.82 0.88 0.92
## --------------------------------------------------------------------------------
Here are some variables with values that we are not sure are correct.
Can you spot the suspicious values?
# inspect sex, age and homework1 variables
describe(dat[,c( "sex", "age", "homework1")])
## dat[, c("sex", "age", "homework1")]
##
## 3 Variables 120 Observations
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 120 0 3
##
## Value 12.2 female male
## Frequency 1 74 45
## Proportion 0.008 0.617 0.375
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 120 0 113 1 26.27 9.051 18.96 19.39
## .25 .50 .75 .90 .95
## 21.21 23.16 26.10 28.61 29.65
##
## lowest : 18.01 18.19 18.33 18.45 18.9 , highest: 33.93 34.16 34.97 35.8 327.89
## --------------------------------------------------------------------------------
## homework1
## n missing distinct Info Mean Gmd .05 .10
## 120 0 11 0.984 -0.1833 10.96 1 1
## .25 .50 .75 .90 .95
## 2 3 6 8 10
##
## Value -99.00 0.19 1.28 2.37 3.46 4.55 5.64 6.73 7.82
## Frequency 5 20 17 21 12 13 11 6 7
## Proportion 0.042 0.167 0.142 0.175 0.100 0.108 0.092 0.050 0.058
##
## Value 8.91 10.00
## Frequency 1 7
## Proportion 0.008 0.058
##
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------
describe()
The results of describe()
can be plotted:
These plots are also useful to inspect distributions of variables and to look for suspicious values.
In the dot plots, look for suspicious labels for categorical
variables (see variables sex
, class
and
transfer
below).
describe()
In the histogram plots, each histogram has its own scale, so lone
spikes and bunched-up histograms often denote extreme values (see
variables age
, homework1
,
homerowk2
, midterm
, final
).
Individual histograms with hist()
allow closer
inspection of actual values.
A look at the output of our dataset summaries, as well as the plot of
describe()
identified several suspicious values:
age
has value 357.89001, likely a data entry errorsex
has value “12.2”, also a data entry errorhomework1
, homework2
,
midterm
, final
have value -99, a missing value
codemidterm
has value -98, another missing value codeclass
has a value “not assessed”transfer
seems to be a character variable but has a
value -99, which causes R
to read -99 in as characterBecause we don’t know any replacement values, we will change
suspicious values to NA
, the missing value indicator in
R
.
Note: Many times, datasets differentiate between
different types of missing. For example, -99=“absent”, while
-98=“refused to answer”. For analysis purposes, using NA
for both will usually suffice. For users that need to distinguish
between different kinds of missing, we suggest looking into the
as.item()
function from package memisc
.
We can use logical subsetting to assign NA
to the data
entry errors:
# change all impossible age values to NA
# assume all adults in this dataset
dat$age[dat$age<18 | dat$age>120] <- NA
# remember to use quotes for character variables
dat$sex[dat$sex == "12.2"] <- NA
Now we want to translate all -99, -98, and “not assessed” values to
NA
. The following syntax can be used to translate all
values of a
to b
in dataset df
:
df[df==a] <- b
describe()
plot after cleaning dataThe describe()
plot for categorical variables looks
better now – no suspicious labels.
Now that we have missing NA
values, the dots are colored
by the number of missing values for that variable.
describe()
plot after cleaning dataThe describe()
plot for continuous also looks better –
histograms are nicely spread out the dot plots and nicely spread out
histograms.
We don’t see any obvious suspicious values.
# inspect variables after cleaning
describe(dat)
## dat
##
## 14 Variables 120 Observations
## --------------------------------------------------------------------------------
## university
## n missing distinct
## 120 0 2
##
## Value UCLA UCSF
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## universityid
## n missing distinct Info Mean Gmd
## 120 0 2 0.749 1.483 0.5036
##
## Value 1 2
## Frequency 62 58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## student
## n missing distinct Info Mean Gmd .05 .10
## 120 0 120 1 60.5 40.33 6.95 12.90
## .25 .50 .75 .90 .95
## 30.75 60.50 90.25 108.10 114.05
##
## lowest : 1 2 3 4 5, highest: 116 117 118 119 120
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 119 1 2
##
## Value female male
## Frequency 74 45
## Proportion 0.622 0.378
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 119 1 112 1 23.74 4.049 18.95 19.38
## .25 .50 .75 .90 .95
## 21.19 23.16 26.01 28.46 29.50
##
## lowest : 18.01 18.19 18.33 18.45 18.9 , highest: 29.84 33.93 34.16 34.97 35.8
## --------------------------------------------------------------------------------
## class
## n missing distinct
## 115 5 3
##
## Value junior senior sophomore
## Frequency 27 66 22
## Proportion 0.235 0.574 0.191
## --------------------------------------------------------------------------------
## transfer
## n missing distinct
## 114 6 2
##
## Value no yes
## Frequency 97 17
## Proportion 0.851 0.149
## --------------------------------------------------------------------------------
## teaching_assistant
## n missing distinct
## 120 0 5
##
## Value T1 T2 T3 T4 T5
## Frequency 7 4 45 28 36
## Proportion 0.058 0.033 0.375 0.233 0.300
## --------------------------------------------------------------------------------
## homework1
## n missing distinct Info Mean Gmd .05 .10
## 115 5 10 0.981 4.113 2.896 1 1
## .25 .50 .75 .90 .95
## 2 3 6 8 10
##
## Value 1.00 1.99 2.98 3.97 4.96 5.95 6.94 7.93 8.92 10.00
## Frequency 20 17 21 12 13 11 6 7 1 7
## Proportion 0.174 0.148 0.183 0.104 0.113 0.096 0.052 0.061 0.009 0.061
##
## For the frequency table, variable is rounded to the nearest 0.09
## --------------------------------------------------------------------------------
## homework2
## n missing distinct Info Mean Gmd .05 .10
## 114 6 10 0.985 4.64 3.069 1 1
## .25 .50 .75 .90 .95
## 3 4 6 9 10
##
## Value 1.00 1.99 2.98 3.97 4.96 5.95 6.94 7.93 8.92 10.00
## Frequency 15 12 19 16 15 10 6 6 6 9
## Proportion 0.132 0.105 0.167 0.140 0.132 0.088 0.053 0.053 0.053 0.079
##
## For the frequency table, variable is rounded to the nearest 0.09
## --------------------------------------------------------------------------------
## midterm
## n missing distinct Info Mean Gmd .05 .10
## 114 6 41 0.998 61.64 12.26 45.00 48.00
## .25 .50 .75 .90 .95
## 53.25 61.50 68.75 74.70 80.10
##
## lowest : 41 42 43 44 45, highest: 78 84 87 90 91
## --------------------------------------------------------------------------------
## final
## n missing distinct Info Mean Gmd .05 .10
## 115 5 55 0.999 65.13 19.2 33.0 39.4
## .25 .50 .75 .90 .95
## 55.5 67.0 77.0 83.6 93.3
##
## lowest : 19 20 22 28 31, highest: 93 94 96 98 100
## --------------------------------------------------------------------------------
## total_points
## n missing distinct Info Mean Gmd .05 .10
## 120 0 69 0.999 128 37.18 54.85 80.00
## .25 .50 .75 .90 .95
## 116.75 134.50 148.00 164.20 176.00
##
## lowest : 7 24 31 35 40, highest: 177 179 180 193 202
## --------------------------------------------------------------------------------
## total_proportion
## n missing distinct Info Mean Gmd .05 .10
## 120 0 49 0.998 0.5816 0.1689 0.2495 0.3600
## .25 .50 .75 .90 .95
## 0.5300 0.6100 0.6700 0.7500 0.8000
##
## lowest : 0.03 0.11 0.14 0.16 0.18, highest: 0.8 0.81 0.82 0.88 0.92
## --------------------------------------------------------------------------------
We will be using the data was collected on 200 high school students.
# Import a dataset for exercise 1
dex1 <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_1.csv")
head(dex1, n=3)
## id sex ses school_prog read write math science social_st
## 1 1 female low public/vocation -99 44 45 39 41
## 2 2 female middle public/general 39 41 33 42 41
## 3 3 male low private/academic 63 65 48 63 56
## honors awards
## 1 not enrolled 0
## 2 enrolled 1
## 3 enrolled 5
The dataset contains the following information:
1.1 Start data cleaning by investigating your dataset and identifying suspicious values.
1.2 If some string values look obviously misspelled, correct the
spelling (Note: a function unique()
can be helpful here. It
returns a vector of unique values of a variable.) Otherwise, change all
of the suspicious values to NA.
1.3 Plot your data after cleaning.
# Exercise 1. 1.1 Identifying suspicious categorical values
plot(describe(dex1), which = "categorical")
Variables sex and ses have suspicious values.
# Exercise 1. 1.1 Identifying suspicious continuous values
plot(describe(dex1), which = "continuous")
Variables read and write have suspicious values. Check what are those suspicious values.
# Exercise 1. 1.1 Checking suspicious values
describe(dex1[,c( "read", "write")])
## dex1[, c("read", "write")]
##
## 2 Variables 200 Observations
## --------------------------------------------------------------------------------
## read
## n missing distinct Info Mean Gmd .05 .10
## 200 0 31 0.995 50.15 15.78 34.95 39.00
## .25 .50 .75 .90 .95
## 44.00 50.00 60.00 66.20 68.00
##
## lowest : -99 28 31 34 35, highest: 66 68 71 73 76
## --------------------------------------------------------------------------------
## write
## n missing distinct Info Mean Gmd .05 .10
## 200 0 30 0.995 51.39 13.5 34.90 39.00
## .25 .50 .75 .90 .95
## 45.75 54.00 60.00 65.00 65.00
##
## lowest : -99 31 33 35 36, highest: 61 62 63 65 67
## --------------------------------------------------------------------------------
Some string values look obviously misspelled.
# Exercise 1. 1.2 Changing suspicious values
# there is "ffemale value and "female " (an extra space)
unique(dex1$sex)
## [1] "female" "male" "ffemale" "female "
# correcting misspelling
dex1$sex[dex1$sex=="ffemale"] <- "female"
dex1$sex[dex1$sex=="female "] <- "female"
# changing ses "not assessed" to NA
dex1$ses[dex1$ses=="not assessed"] <- NA
# changing read and write to NA
dex1[dex1==-99] <- NA
##
## $Continuous
complete.cases(x)
returns a logical vector of whether
each row of x
has no missing (TRUE) or at least one
NA
(FALSE).
# which cases are complete
complete.cases(dat)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [49] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [61] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [73] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## [97] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [109] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
We can use logical subsetting to extract a data set of complete cases.
Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.
|>
A data management task may involve many steps to reach the final result. Piping makes your code more readable and compact by focusing on the functions used rather than the name of datasets.
The pipe operator “pipes” the object (e.g. a data
frame) on the left of the |>
operator to the
function on the right of the operator.
The code x |> f(y)
translates to f(x,y)
,
that is, x
is treated by default as the first argument of
f()
, while y
is another argument of
f()
.
The object returned by the function can then be piped into another
function. Thus x |> f(y) |> g(z)
translates to
g(f(x,y), z)
.
As a first example, perhaps we want to create a data set of just females under 25, with only the age and transfer variables selected. We could do this in 2 steps, like so:
# a dataset of age and transfer status for females only
f25 <- filter(d, sex == "female" & age<25)
f25_small <- select(f25, age, transfer)
f25_small
## # A tibble: 41 × 2
## age transfer
## <dbl> <chr>
## 1 23.9 no
## 2 23.8 no
## 3 24.5 no
## 4 18.4 no
## 5 22.7 no
## 6 18.4 no
## 7 23.2 yes
## 8 22.7 no
## 9 24.4 no
## 10 21.0 no
## # ℹ 31 more rows
While that works fine, the intermediate dataset f25 is not of interest and is cluttering up memory and the workspace unnecessarily.
We could use |>
instead:
# start with d, then filter rows, then select variables
f25_small <- d |>
filter(sex == "female" & age<25) |>
select(age, transfer)
f25_small
## # A tibble: 41 × 2
## age transfer
## <dbl> <chr>
## 1 23.9 no
## 2 23.8 no
## 3 24.5 no
## 4 18.4 no
## 5 22.7 no
## 6 18.4 no
## 7 23.2 yes
## 8 22.7 no
## 9 24.4 no
## 10 21.0 no
## # ℹ 31 more rows
No intermediate dataset was created, and dataset transformations, filtering and selecting, are easy to see.
Piping really helps make a long chain of commands more immediately understandable.
_
in piped commandsIf the dataset is not to be specified as the first argument of the
function on the right of the |>
operator, use
_
to specify where the dataset should appear as the
argument.
x |> f(y,_)
translates to f(y,x)
.
For example, for modeling functions such as lm()
, the
model formula is typically the first argument rather than the dataset,
but we can use _
to stand in for the dataset in the piping
chain:
tidyverse
operator %>%
The predecessor of base R pipe operator |>
is
magrittr
package’s pipe operator %>%
(automatically loaded with library(tidyverse)
).
We will use |>
operator throughout this workshop, but
everything shown here can be substituted by %>%
.
The curators of R and tidyverse
recommend using the new
pipe operator |>
for two reasons:
It is a part of base R so you can use it even if you are not
using tidyverse
.
It was designed to be simpler, more efficient, and some less
important features of the original %>%
were
removed.
The pipe operator an be typed with CMD-SHIFT-M on a Mac (CTRL-SHIFT-M
on a PC). By default, this will results in %>%
operator.
In Rstudio
, you change this going to Tools -> Global
options -> Code -> select option “Use native pipe operator, |>
(requires R 4.1+)”
dplyr
dplyr
packageThe dplyr
package:
library(tidyverse)
Conveniently, when using dplyr
functions:
dplyr
First, we explore dplyr
functions that work with
columns:
select()
: keep and drop columns
rename()
, rename_with()
: rename
columns
relocate()
: change column order
mutate()
: create or modify existing columns
select()
# select 4 variables
d_small <- d |>
select(universityid, student, age, class)
head(d_small, n = 3)
## # A tibble: 3 × 4
## universityid student age class
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 35.0 sophomore
## 2 1 2 23.9 sophomore
## 3 1 3 21.4 sophomore
Often we want to select a group of related variables with similar
names. dplyr
supplies helper functions to find columns
whose names match a specified pattern:
starts_with(x)
: matches names that begin with the
string x
ends_with(x)
: matches names that end with the string
x
contains(x)
: matches names that contain the string
x
matches(re)
: matches regular expression
re
num_range(prefix, range)
matches names that contain
prefix
and one element of range
rename()
and
rename_with()
Use dplyr
function rename()
to change
column names.
List the new variable name on the left hand side and the old name on the right hand side.
# original names
names(d)
## [1] "university" "universityid" "student"
## [4] "sex" "age" "class"
## [7] "transfer" "teaching_assistant" "homework1"
## [10] "homework2" "midterm" "final"
## [13] "total_points" "total_proportion"
# rename "teaching assistant" variable as "TA"
ta <- d |>
rename(TA = teaching_assistant)
names(ta)
## [1] "university" "universityid" "student" "sex"
## [5] "age" "class" "transfer" "TA"
## [9] "homework1" "homework2" "midterm" "final"
## [13] "total_points" "total_proportion"
Sometimes we want to rename many columns using a function that will perform some operation on those names.
We can specify a function inside of dplyr
’s
rename_with()
to change column names using that function’s
operations.
For rename_with()
:
~
For example, we may want to capitalize all of the column names using
function toupper()
.
# capitalize all column names using toupper() inside of rename_with()
cap <- d |>
rename_with(toupper)
names(cap)
## [1] "UNIVERSITY" "UNIVERSITYID" "STUDENT"
## [4] "SEX" "AGE" "CLASS"
## [7] "TRANSFER" "TEACHING_ASSISTANT" "HOMEWORK1"
## [10] "HOMEWORK2" "MIDTERM" "FINAL"
## [13] "TOTAL_POINTS" "TOTAL_PROPORTION"
Or, we may want to add a prefix to all column names using
paste0()
, a string concatenation function. Because we
specify additional arguments to paste0()
, we precede with
~
inside of rename_with()
# add a prefix to each column
pre <- d |>
rename_with(~paste0("d1_", .x)) # .x place holder for all columns
names(pre)
## [1] "d1_university" "d1_universityid" "d1_student"
## [4] "d1_sex" "d1_age" "d1_class"
## [7] "d1_transfer" "d1_teaching_assistant" "d1_homework1"
## [10] "d1_homework2" "d1_midterm" "d1_final"
## [13] "d1_total_points" "d1_total_proportion"
We can use dplyr
’s column selector functions in the
third argument to rename_with()
to rename only a subset of
columns.
# add a suffix to only columns that start with "univsersity"
p_sub <- d |>
rename_with(~paste0(.x, "_2023"), starts_with("university") )
names(p_sub)
## [1] "university_2023" "universityid_2023" "student"
## [4] "sex" "age" "class"
## [7] "transfer" "teaching_assistant" "homework1"
## [10] "homework2" "midterm" "final"
## [13] "total_points" "total_proportion"
relocate()
Specifying just column names inside relocate()
will move
them to the first columns of the data frame:
# make column "total_points" and "total_proportion" to first columns in the dataset
first <- d |>
relocate(total_points, total_proportion)
names(first)
## [1] "total_points" "total_proportion" "university"
## [4] "universityid" "student" "sex"
## [7] "age" "class" "transfer"
## [10] "teaching_assistant" "homework1" "homework2"
## [13] "midterm" "final"
# same as above
first <- d |>
relocate(starts_with("total"))
names(first)
## [1] "total_points" "total_proportion" "university"
## [4] "universityid" "student" "sex"
## [7] "age" "class" "transfer"
## [10] "teaching_assistant" "homework1" "homework2"
## [13] "midterm" "final"
Columns can be renamed as they are moved, using the same syntax as
rename()
.
# make homework1 first column and rename it to h1
h1 <- d |>
relocate(h1 = homework1)
names(h1)
## [1] "h1" "university" "universityid"
## [4] "student" "sex" "age"
## [7] "class" "transfer" "teaching_assistant"
## [10] "homework2" "midterm" "final"
## [13] "total_points" "total_proportion"
Use arguments .before
and .after
to
indicate specifically where the column should be relocated
# relocate homework1 after homework2
h2 <- d |>
relocate(homework1, .after=homework2)
names(h2)
## [1] "university" "universityid" "student"
## [4] "sex" "age" "class"
## [7] "transfer" "teaching_assistant" "homework2"
## [10] "homework1" "midterm" "final"
## [13] "total_points" "total_proportion"
# relocate homework1 before last column
hl <- d |>
relocate(homework1, .before=last_col())
names(hl)
## [1] "university" "universityid" "student"
## [4] "sex" "age" "class"
## [7] "transfer" "teaching_assistant" "homework2"
## [10] "midterm" "final" "total_points"
## [13] "homework1" "total_proportion"
mutate()
The function mutate()
allows us to add new variables,
generally transformations of existing variables, all in one step without
having to re-specify the data frame name over and over.
Useful R
functions for transforming:
log()
: logarithmmin_rank()
: rank valuescut()
: cut a continuous variable into intervals with
new integer value signifying into which interval original value
fallsscale()
: standardizes variable (substracts mean and
divides by standard deviation)cumsum()
: cumulative sumrowMeans()
, rowSums()
: means and sums of
several columns# create age category variable, and highprop binary variable
d <- d |>
mutate(agecat = cut(age, breaks=c(18,20,24,28,32,36)),
highprop = total_proportion > mean(total_proportion))
table(d$agecat, d$highprop)
##
## FALSE TRUE
## (18,20] 10 4
## (20,24] 20 24
## (24,28] 16 11
## (28,32] 3 6
## (32,36] 0 4
dplyr
Now we explore dplyr
functions that operate on rows of
data frames:
filter()
: keep rows that match condition
arrange()
: order rows using columns
dplyr_by
: Per-operation grouping with
.by/by
summarise()
summarize()
: Summarise each
group down to one row
filter()
We have already seen how to subset using logical vectors and logical subsetting:
# all variables for students with total points 180 or more
total_points180 <- d[d$total_points>=180,]
total_points180
## # A tibble: 2 × 16
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCLA 1 6 female 23.8 juni… no T1
## 2 UCSF 2 105 female 21.5 juni… no T3
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## # highprop <lgl>
Complex conditions can require lots of code, especially with long data set names.
The dplyr
function filter()
provides a
cleaner syntax for subsetting rows of data frames. Conditions separated
by ,
are joined by &
(logical AND).
# subset to UCLA students with total_points >= 180
dfhp <- d |>
filter(university=="UCLA", total_points>=180)
dfhp
## # A tibble: 1 × 16
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCLA 1 6 female 23.8 juni… no T1
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## # highprop <lgl>
R
Logical operators and functionsHere are some operators and functions to help with selection:
==
: equality>
, >=
: greater than, greater than or
equal to!
: NOT&
: AND|
: OR%in%
: matches any of (2 %in% c(1,2,3)
=
TRUE)is.na()
: equality to NA
near()
: checking for equality for floating point
(decimal) numbers, has a built-in tolerance# select those not in UCLA, who are either younger than 25 or older than 30
d |>
filter(university!="UCLA", age<25 | age>30)
## # A tibble: 36 × 16
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCSF 2 64 male 33.9 seni… no T3
## 2 UCSF 2 65 male 21.6 soph… yes T3
## 3 UCSF 2 66 fema… 23.6 soph… no T4
## 4 UCSF 2 71 male 21.5 seni… no T4
## 5 UCSF 2 72 male 22.6 seni… no T4
## 6 UCSF 2 73 fema… 24.9 juni… no T4
## 7 UCSF 2 79 male 23.1 seni… yes T4
## 8 UCSF 2 80 fema… 22.1 seni… no T4
## 9 UCSF 2 82 fema… 22.6 juni… no T5
## 10 UCSF 2 83 male 19.5 soph… yes T5
## # ℹ 26 more rows
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## # highprop <lgl>
arrange()
By default, ascending order will be used. Surround a sorting variable
with desc()
to sort by descending order instead.
# sort by sex where females are before males, then by age, oldest first
d_arrange <- d |>
arrange(sex, desc(age))
head(d_arrange, n=20)
## # A tibble: 20 × 16
## university universityid student sex age class transfer teaching_assistant
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 UCSF 2 112 fema… 34.2 seni… no T4
## 2 UCLA 1 14 fema… 29.8 seni… no T3
## 3 UCSF 2 67 fema… 29.6 juni… yes T4
## 4 UCSF 2 69 fema… 29.5 seni… yes T4
## 5 UCLA 1 9 fema… 29.3 seni… yes T3
## 6 UCLA 1 41 fema… 28.6 seni… no T3
## 7 UCSF 2 93 fema… 28.4 seni… no T5
## 8 UCLA 1 58 fema… 28.2 seni… no T5
## 9 UCLA 1 52 fema… 28.0 seni… no T3
## 10 UCLA 1 36 fema… 27.8 seni… no T5
## 11 UCSF 2 63 fema… 27.7 seni… no T3
## 12 UCLA 1 8 fema… 27.1 juni… no T3
## 13 UCLA 1 10 fema… 27.1 juni… no T3
## 14 UCLA 1 31 fema… 26.3 soph… yes T5
## 15 UCLA 1 30 fema… 26.1 seni… no T5
## 16 UCSF 2 89 fema… 25.9 seni… no T5
## 17 UCLA 1 19 fema… 25.5 juni… yes T2
## 18 UCSF 2 68 fema… 25.4 soph… no T4
## 19 UCSF 2 75 fema… 25.2 juni… no T4
## 20 UCLA 1 25 fema… 25.2 soph… yes T3
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## # final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## # highprop <lgl>
.by/by
Often we want to create variables with values aggregated by group, rather than over the whole dataset (e.g. the mean final score by teaching assistant).
Many dplyr
functions allow a grouping variable (or
variables) to be specified using the argument .by
, which
specifies that the function should operate on groups rather than
individual observations.
The dplyr
function summarise()
creates a
data frame resulting from functions operating on each group denoted by
the variable(s) specified .by=
. The output data frame will
have one row per group in the .by
variable.
Example of useful functions for summarise()
:
min()
, max()
, mean()
,
sum()
, var()
, sd()
n()
: number of observations in the groupn_distinct(x)
: number of distinct values in variable
x
# Create summaries of students by teaching assistant using .by
student_summ2 <- d |>
summarise(n_students = n(), # number of students
max_pts = max(total_points), # max points
avg_age = mean(age), # average age
n_seniors = sum(class=="senior"), # number of seniors students
.by = teaching_assistant)
student_summ2
## # A tibble: 5 × 5
## teaching_assistant n_students max_pts avg_age n_seniors
## <chr> <int> <dbl> <dbl> <int>
## 1 T1 7 180 25.3 2
## 2 T3 37 202 24.3 23
## 3 T2 4 148 21.3 3
## 4 T5 28 171 22.7 16
## 5 T4 22 179 24.1 13
The function mutate()
also accepts .by
variables, allowing easy generation of aggregated variables:
# add a variable that is the mean final score by teaching assistant
d_with_ta_summ <- d |>
mutate(avg_final = mean(final),
.by = teaching_assistant)
# first 20 rows of relevant columns
head(select(d_with_ta_summ, student, teaching_assistant, final, avg_final), n=10)
## # A tibble: 10 × 4
## student teaching_assistant final avg_final
## <dbl> <chr> <dbl> <dbl>
## 1 1 T1 68 64.9
## 2 2 T1 67 64.9
## 3 3 T1 96 64.9
## 4 4 T1 40 64.9
## 5 5 T1 33 64.9
## 6 6 T1 98 64.9
## 7 7 T1 52 64.9
## 8 8 T3 80 67.1
## 9 9 T3 68 67.1
## 10 10 T3 87 67.1
When two packages that contain a function with the same name are
loaded into the R
session, a conflict arises. For example,
the packages Hmisc
and dplyr
both contain a
summarize()
function.
The function from the package loaded most recently will mask
the function with the same name from the package loaded earlier and will
be called by default. If dplyr
is loaded after
Hmisc
, when summarize()
is specified,
dplyr
’s summarize()
will be called.
We can still access the function loaded from the first package in a couple of ways.
package::function()
, for example
Hmisc::summarize(x)
detach(package:name)
, for example
detach(package:dplyr)
dplyr
The following functions work on individual vectors, not data frames.
We are going to focus on two functions that are useful for recoding variables.
if_else()
case_when()
if_else()
Use if_else()
to apply a condition, especially binary
conditions, to each element of a vector and return values based on the
condition.
The syntax is if_else(condition, true,
false)
, where:
# create a pass/fail (0/1) variable
d <- d |>
mutate(pass_final = if_else(final > 60, true=1, false=0))
head(select(d, final, pass_final))
## # A tibble: 6 × 2
## final pass_final
## <dbl> <dbl>
## 1 68 1
## 2 67 1
## 3 96 1
## 4 40 0
## 5 33 0
## 6 98 1
if_else()
allows substitution of values from other
vectors of the same length.
# use highest score for best_test
d <- d |>
mutate(best_test = if_else(midterm>final, true=midterm, false=final))
head(select(d, midterm, final, best_test))
## # A tibble: 6 × 3
## midterm final best_test
## <dbl> <dbl> <dbl>
## 1 60 68 68
## 2 57 67 67
## 3 78 96 96
## 4 45 40 45
## 5 44 33 44
## 6 71 98 98
case_when()
Inside case_when()
, each condition to evaluate on an
existing variable is followed by ~
and the value assigned
to the new variable.
# Recode variable total_points
d <- d |>
mutate(letter_grade = case_when(
total_points < 120 ~ "F",
between(total_points, 120, 130) ~ "D", # between(x, a, b) return TRUE if a<=x<=b, FALSE otherwise
between(total_points, 130.01, 140) ~ "C",
between(total_points, 140.01, 150) ~ "B",
total_points > 150 ~ "A"
)
)
head(select(d, total_points, letter_grade))
## # A tibble: 6 × 2
## total_points letter_grade
## <dbl> <chr>
## 1 140 C
## 2 128 D
## 3 176 A
## 4 90 F
## 5 80 F
## 6 180 A
Cases with values that do not satisfy any condtion become missing values.
# create an "upperclassman" variable
d <- d |>
mutate(upper = case_when(
class %in% c("freshman", "sophomore") ~ 0,
class == "senior" ~ 1 # oops, forgot juniors
))
head(select(d, class, upper))
## # A tibble: 6 × 2
## class upper
## <chr> <dbl>
## 1 sophomore 0
## 2 sophomore 0
## 3 sophomore 0
## 4 senior 1
## 5 senior 1
## 6 junior NA
A .default
argument specifies the value to return for
unmatched values.
~
is usedis.na()
, NA
values will be matched to defaultMissing values can be matched using is.na()
, and
.default
can be used to control the value used for
unmatched values of .x
# create a recommend variable based on grade
d <- d |>
mutate(recommend = case_when(
letter_grade == "A" ~ "yes",
letter_grade == "B" ~ "maybe",
is.na(letter_grade) ~ NA, # without this, NA will match to .default
.default = "no" # C, D, and F will be no
))
head(select(d, letter_grade, recommend), n=20)
## # A tibble: 20 × 2
## letter_grade recommend
## <chr> <chr>
## 1 C no
## 2 D no
## 3 A yes
## 4 F no
## 5 F no
## 6 A yes
## 7 D no
## 8 A yes
## 9 C no
## 10 A yes
## 11 F no
## 12 F no
## 13 D no
## 14 F no
## 15 F no
## 16 B maybe
## 17 C no
## 18 D no
## 19 C no
## 20 F no
Use the dataset dex1 from the first exercise.
2.1 Drop columns:
2.2 Keep only the rows of students who received more than 50 points for math test.
2.3 Calculate the mean math and the mean science scores aggregating by honors status.
Try to perform these steps in a series of commands using pipe
operator|>
.
# Exercise 2. 2.1 Dropping columns
dex1 |>
select(-read, -write, -awards, -school_prog) |>
glimpse()
## Rows: 200
## Columns: 7
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ sex <chr> "female", "female", "male", "female", "male", "female", "mal…
## $ ses <chr> "low", "middle", "low", "low", "low", "low", "middle", "low"…
## $ math <int> 45, 33, 48, 41, 43, 46, 59, 52, 52, 49, 45, 45, 39, 54, 44, …
## $ science <int> 39, 42, 63, 39, 45, 40, 47, 44, 44, 53, 39, 39, 47, 42, 26, …
## $ social_st <int> 41, 41, 56, 51, 31, 41, 51, 48, 51, 61, 36, 46, 61, 56, 42, …
## $ honors <chr> "not enrolled", "enrolled", "enrolled", "not enrolled", "not…
Keep only the rows of students who received more than 50 points for math test.
# Exercise 2. 2.2 Dropping rows
dex1 |>
select(-read, -write, -awards, -school_prog) |>
filter(math > 50) |>
distinct(math) |>
arrange(math)
## math
## 1 51
## 2 52
## 3 53
## 4 54
## 5 55
## 6 56
## 7 57
## 8 58
## 9 59
## 10 60
## 11 61
## 12 62
## 13 63
## 14 64
## 15 65
## 16 66
## 17 67
## 18 68
## 19 69
## 20 70
## 21 71
## 22 72
## 23 73
## 24 75
Calculate the mean math and the mean science scores aggregating by honors status.
# Exercise 2. 2.3 Summarising
dex1 |>
select(-read, -write, -awards, -school_prog) |> # dropping columns
filter(math > 50) |> # dropping rows
summarise(mean_math = mean(math), # summarizing by honors
mean_science = mean(science),
.by = honors)
## honors mean_math mean_science
## 1 not enrolled 56.93846 55.04615
## 2 enrolled 62.62500 59.31250
Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.
rbind()
and
bind_rows()
We can use either the base R
function
rbind()
or the dplyr
function
bind_rows()
to append. They differ in how they handle the
situation where the datasets being appended have non-matching
columns.
rbind()
will produce and errorbind_rows()
will append the datasets, and fill in with
NA
values for missing values in unmatched columnsThey also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).
rbind()
will coerce one type to the other (direction of
coercion is logical -> integer -> double -> character)bind_rows()
will produce an errorLet’s load a second dataset that contains the same variables as the original version of our first dataset, but without any variables created during this workshop.
# new data set that contains the same variables as d, except is missing those created during the workshop
d2 <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/student_data2.csv")
# rows and columns of d2 and d
dim(d)
## [1] 98 21
dim(d2)
## [1] 120 14
Because our first data set d
has several extra variables
we created earlier with mutate()
, the two datasets do not
have completely matching columns.
rbind()
and bind_rows()
rbind()
on datasets will produce an error.bind_rows()
will produce NA
values
for the observation in the second datasets in variables that are
exclusively in the first dataset.The best solution would be to calculate the new variables for the second dataset and then use either function to append…
However, to demonstrate the behavior of bind_rows()
, we
append the datasets as they are.
Use the .id=
argument in bind_rows
to
create a variable that identifies source dataset.
# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")
# these are the rows where the datasets meet
# final and total_points are in both datasets, letter_grade and recommend only in d
select(d3[95:105,], final, total_points, letter_grade, recommend)
## # A tibble: 11 × 4
## final total_points letter_grade recommend
## <dbl> <dbl> <chr> <chr>
## 1 55 135 C no
## 2 67 146 B maybe
## 3 81 168 A yes
## 4 75 152 A yes
## 5 68 139 <NA> <NA>
## 6 67 127 <NA> <NA>
## 7 96 180 <NA> <NA>
## 8 40 89 <NA> <NA>
## 9 33 89 <NA> <NA>
## 10 98 179 <NA> <NA>
## 11 52 121 <NA> <NA>
# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:14], d2)
Merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).
dplyr
joinsThe dplyr
“join” functions perform such merges and will
use any same-named variables between the datasets as the id variables by
default. Use the by=
argument to specify specific matching
id variables.
These joins all return a table with all columns from x
and y
, but differ in how they deal with mismatched
rows:
inner_join(x, y)
: returns all rows from
x
where there is a matching value in y
(returns only matching rows).
left_join(x, y)
: returns all rows from
x
, unmatched rows in x
will have NA in the
columns from y
. Unmatched rows in y
not
returned.
full_join(x, y)
: returns all rows from
x
and from y
; unmatched rows in either will
have NA in new columns
Each of these joins is possible with base R
function
merge()
, but merge()
is many times slower than
the joins.
dplyr
joinsWe begin our demonstration of merging by loading a dataset of teaching assistant (TA) variables, that describes the TA’s age, how many year they spent in PhD program so far, etc.
# load data set for merging example
d_teaching_assistants <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/teaching_assistant_data.csv")
d_teaching_assistants
## # A tibble: 40 × 4
## teaching_assistant ta_age school year_in_program
## <chr> <dbl> <chr> <dbl>
## 1 T1 26.8 average 1
## 2 T2 28.1 top 3
## 3 T3 28.5 average 2
## 4 T4 33.9 top 6
## 5 T5 27.9 top 7
## 6 T6 24.6 average 7
## 7 T7 26 average 3
## 8 T8 29.9 average 2
## 9 T9 24.8 average 5
## 10 T10 24.2 average 2
## # ℹ 30 more rows
It is easiest to show how the joins differ if we work on smaller, more manageable datasets.
We will be merging a subset of the d_ta
dataset we just
imported into a subset of the d3
dataset we just created
earlier through appending.
d3
we select two TAs, “T2” and “T7” and just a few
variablesd_ta
, we select one matching TA, “T2”, and one
non-matching TA, “T35”,# select one non-matching and one matching teaching assistant from each to demo joins
# just a few variables from d3 and d_teaching_assistants
d_st <- d3 |>
filter(teaching_assistant == "T2" | teaching_assistant == "T7") |>
select(teaching_assistant, student, midterm, final)
d_ta <- d_teaching_assistants |>
filter(teaching_assistant == "T2" | teaching_assistant == "T35")
Let’s take a look at our small datasets
# first a look at the two datasets to be merged
# student dataset with information about their teaching assistants
d_st
## # A tibble: 8 × 4
## teaching_assistant student midterm final
## <chr> <dbl> <dbl> <dbl>
## 1 T2 17 48 53
## 2 T2 18 62 76
## 3 T2 19 68 64
## 4 T2 20 48 67
## 5 T7 137 48 53
## 6 T7 138 62 76
## 7 T7 139 68 64
## 8 T7 140 48 67
# teaching assistant dataset
d_ta
## # A tibble: 2 × 4
## teaching_assistant ta_age school year_in_program
## <chr> <dbl> <chr> <dbl>
## 1 T2 28.1 top 3
## 2 T35 30.2 average 1
inner_join()
inner_join(d)st, d_ta)
d_st
where there is a matching
value in d_ta
(i.e. teaching_assistant==“T2”).# only matching rows returned
# T2 from d_ta_small matched four times to T2 in d3_small
inner_join(d_st, d_ta)
## # A tibble: 4 × 7
## teaching_assistant student midterm final ta_age school year_in_program
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 T2 17 48 53 28.1 top 3
## 2 T2 18 62 76 28.1 top 3
## 3 T2 19 68 64 28.1 top 3
## 4 T2 20 48 67 28.1 top 3
left_join()
left_join(d_st, d_ta)
d_st
and matched rows from
d_ta
;d_st
will have NA in the columns from
d_ta
.d_ta
will not be returned.# all rows from d3_small returned
left_join(d_st, d_ta)
## # A tibble: 8 × 7
## teaching_assistant student midterm final ta_age school year_in_program
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 T2 17 48 53 28.1 top 3
## 2 T2 18 62 76 28.1 top 3
## 3 T2 19 68 64 28.1 top 3
## 4 T2 20 48 67 28.1 top 3
## 5 T7 137 48 53 NA <NA> NA
## 6 T7 138 62 76 NA <NA> NA
## 7 T7 139 68 64 NA <NA> NA
## 8 T7 140 48 67 NA <NA> NA
full_join()
full_join(d_st, d_ta)
dr_st
and from
d_ta
# all rows from both returned
full_join(d_st, d_ta)
## # A tibble: 9 × 7
## teaching_assistant student midterm final ta_age school year_in_program
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 T2 17 48 53 28.1 top 3
## 2 T2 18 62 76 28.1 top 3
## 3 T2 19 68 64 28.1 top 3
## 4 T2 20 48 67 28.1 top 3
## 5 T7 137 48 53 NA <NA> NA
## 6 T7 138 62 76 NA <NA> NA
## 7 T7 139 68 64 NA <NA> NA
## 8 T7 140 48 67 NA <NA> NA
## 9 T35 NA NA NA 30.2 average 1
We are going to practice merging two datasets, and for this purpose, we will use the dataset dex1 along with a new dataset dex2.
# Excercise 3. Import dataset dex2
dex2 <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_2.csv")
dex2
## # A tibble: 5 × 4
## id school_prog awards city_state
## <dbl> <chr> <dbl> <chr>
## 1 1 public/vocation 0 NY New York
## 2 2 public/general 1 CA Los Angeles
## 3 3 private/academic 5 NY New York
## 4 5 private/academic 0 CA Los Angeles
## 5 100 public/academic 1 CA Los Angeles
3.1 In dataset dex1, keep the variables:
3.2 Subset dex1 to only students with ids 1 through 8.
3.3 Create a new dataset dex3 using dex1 and dex2, containing all rows from dex1 and only matched rows from dex2.
In dataset dex1, keep the variables:
# Excercise 3. 3.1 Keeping a subset of columns
dex1 |>
select(id, ses, school_prog, math, science, honors)
## id ses school_prog math science honors
## 1 1 low public/vocation 45 39 not enrolled
## 2 2 middle public/general 33 42 enrolled
## 3 3 low private/academic 48 63 enrolled
## 4 4 low public/academic 41 39 not enrolled
## 5 5 low public/academic 43 45 not enrolled
## 6 6 low public/academic 46 40 not enrolled
## 7 7 middle public/academic 59 47 not enrolled
## 8 8 low public/academic 52 44 not enrolled
## 9 9 middle public/vocation 52 44 not enrolled
## 10 10 middle public/general 49 53 not enrolled
## 11 11 middle public/academic 45 39 not enrolled
## 12 12 middle public/vocation 45 39 not enrolled
## 13 13 middle public/vocation 39 47 not enrolled
## 14 14 high public/academic 54 42 not enrolled
## 15 15 high public/vocation 44 26 not enrolled
## 16 16 low public/vocation 44 36 not enrolled
## 17 17 middle public/academic 48 44 not enrolled
## 18 18 middle public/vocation 49 44 not enrolled
## 19 19 low public/general 43 44 not enrolled
## 20 20 high public/academic 57 61 not enrolled
## 21 21 middle public/general 61 50 not enrolled
## 22 22 middle public/vocation 39 56 not enrolled
## 23 23 low public/academic 64 58 enrolled
## 24 24 middle public/academic 66 47 enrolled
## 25 25 middle public/general 42 42 not enrolled
## 26 26 high public/academic 62 61 not enrolled
## 27 27 middle public/academic 61 57 enrolled
## 28 28 middle public/general 54 50 not enrolled
## 29 29 low public/general 49 55 not enrolled
## 30 30 high public/academic 42 34 not enrolled
## 31 31 middle private/general 52 42 not enrolled
## 32 32 high public/vocation 66 66 enrolled
## 33 33 low public/academic 72 54 enrolled
## 34 34 high private/academic 57 55 enrolled
## 35 35 low private/general 50 50 not enrolled
## 36 36 low public/general 44 35 not enrolled
## 37 37 low public/vocation 40 39 not enrolled
## 38 38 low public/academic 50 31 not enrolled
## 39 39 high public/academic 67 61 enrolled
## 40 40 low public/general 43 50 not enrolled
## 41 41 middle public/academic 45 55 not enrolled
## 42 42 middle public/vocation 55 44 not enrolled
## 43 43 low public/academic 43 42 not enrolled
## 44 44 low public/vocation 45 34 enrolled
## 45 45 low public/vocation 41 29 enrolled
## 46 46 low public/academic 44 34 not enrolled
## 47 47 low public/academic 49 33 not enrolled
## 48 48 middle public/academic 52 50 not enrolled
## 49 49 high public/vocation 39 49 not enrolled
## 50 50 middle public/general 42 53 not enrolled
## 51 51 high public/general 42 31 not enrolled
## 52 52 low public/academic 53 53 not enrolled
## 53 53 middle public/vocation 46 39 enrolled
## 54 54 low private/general 46 50 not enrolled
## 55 55 middle private/academic 49 44 not enrolled
## 56 56 middle public/vocation 46 58 not enrolled
## 57 57 middle public/academic 72 66 enrolled
## 58 58 middle public/vocation 40 44 not enrolled
## 59 59 middle public/academic 63 55 enrolled
## 60 60 middle public/academic 51 63 enrolled
## 61 61 high public/academic 60 67 enrolled
## 62 62 high public/general 48 63 enrolled
## 63 63 low public/general 60 56 enrolled
## 64 64 high public/vocation 45 58 not enrolled
## 65 65 middle public/academic 66 42 not enrolled
## 66 66 middle public/vocation 56 50 enrolled
## 67 67 low public/vocation 42 33 not enrolled
## 68 68 middle public/academic 71 63 enrolled
## 69 69 low public/vocation 40 40 not enrolled
## 70 70 low public/general 41 47 not enrolled
## 71 71 middle public/general 56 58 enrolled
## 72 72 middle public/vocation 47 47 not enrolled
## 73 73 middle public/academic 53 39 not enrolled
## 74 74 middle public/academic 50 51 not enrolled
## 75 75 middle public/vocation 51 53 not enrolled
## 76 76 high public/academic 51 50 not enrolled
## 77 77 low public/academic 49 44 not enrolled
## 78 78 middle public/academic 54 53 not enrolled
## 79 79 middle public/academic 49 50 enrolled
## 80 80 high public/academic 68 66 enrolled
## 81 81 low public/academic 59 65 not enrolled
## 82 82 high public/academic 65 69 enrolled
## 83 83 middle public/vocation 41 55 enrolled
## 84 84 middle public/general 54 58 not enrolled
## 85 85 middle public/general 57 53 not enrolled
## 86 86 high public/general 54 58 not enrolled
## 87 87 middle public/general 46 50 not enrolled
## 88 88 high public/academic 64 69 enrolled
## 89 89 low public/vocation 40 51 not enrolled
## 90 90 high public/academic 50 50 not enrolled
## 91 91 high public/vocation 56 47 not enrolled
## 92 92 high public/general 57 63 enrolled
## 93 93 high public/academic 62 58 enrolled
## 94 94 high public/academic 61 61 not enrolled
## 95 95 high public/academic 71 61 enrolled
## 96 96 high public/academic 61 58 not enrolled
## 97 97 high public/academic 58 58 not enrolled
## 98 98 low public/vocation 51 53 enrolled
## 99 99 high public/general 56 66 not enrolled
## 100 100 high public/academic 71 69 enrolled
## 101 101 high public/academic 67 50 enrolled
## 102 102 high public/academic 51 53 not enrolled
## 103 103 high public/academic 64 64 not enrolled
## 104 104 high public/academic 57 55 enrolled
## 105 105 middle public/academic 45 44 not enrolled
## 106 106 middle public/vocation 37 42 not enrolled
## 107 107 low public/vocation 47 42 not enrolled
## 108 108 middle public/general 41 36 not enrolled
## 109 109 middle public/general 42 42 not enrolled
## 110 110 middle public/vocation 50 54 not enrolled
## 111 111 low public/general 39 47 not enrolled
## 112 112 middle public/academic 48 55 not enrolled
## 113 113 middle public/academic 51 63 not enrolled
## 114 114 high public/academic 62 55 enrolled
## 115 115 low public/general 43 50 not enrolled
## 116 116 middle public/academic 54 50 not enrolled
## 117 117 high public/vocation 39 42 not enrolled
## 118 118 middle public/general 58 58 enrolled
## 119 119 low public/general 45 50 not enrolled
## 120 120 high public/academic 54 50 not enrolled
## 121 121 middle public/vocation 53 63 not enrolled
## 122 122 middle public/academic 58 53 not enrolled
## 123 123 high public/general 56 63 not enrolled
## 124 124 low public/vocation 41 42 not enrolled
## 125 125 low public/academic 58 59 enrolled
## 126 126 middle public/general 57 47 not enrolled
## 127 127 high public/academic 57 55 not enrolled
## 128 128 <NA> public/academic 38 47 not enrolled
## 129 129 low public/general 46 47 not enrolled
## 130 130 high public/general 55 55 not enrolled
## 131 131 high public/academic 57 46 not enrolled
## 132 132 middle public/academic 73 69 enrolled
## 133 133 middle public/vocation 40 34 enrolled
## 134 134 low public/general 39 34 not enrolled
## 135 135 low public/academic 65 54 enrolled
## 136 136 middle public/academic 70 63 not enrolled
## 137 137 high public/academic 65 53 enrolled
## 138 138 middle public/vocation 40 50 not enrolled
## 139 139 middle public/academic 61 55 not enrolled
## 140 140 middle public/vocation 40 50 not enrolled
## 141 141 high public/vocation 47 53 not enrolled
## 142 142 middle public/vocation 52 39 not enrolled
## 143 143 middle public/vocation 75 72 enrolled
## 144 144 high public/general 58 61 enrolled
## 145 145 middle public/vocation 38 36 not enrolled
## 146 146 high public/academic 64 63 enrolled
## 147 147 low public/academic 53 53 enrolled
## 148 148 middle public/vocation 51 47 not enrolled
## 149 149 low public/general 49 66 not enrolled
## 150 150 middle public/vocation 57 72 not enrolled
## 151 151 middle public/vocation 52 48 not enrolled
## 152 152 high public/academic 56 58 not enrolled
## 153 153 middle public/vocation 40 39 not enrolled
## 154 154 high public/academic 66 61 enrolled
## 155 155 middle public/general 46 39 not enrolled
## 156 156 middle public/academic 53 61 not enrolled
## 157 157 middle public/general 58 74 not enrolled
## 158 158 middle public/general 55 53 not enrolled
## 159 159 high public/academic 54 49 enrolled
## 160 160 middle public/academic 55 50 enrolled
## 161 161 low public/academic 72 61 enrolled
## 162 162 middle public/vocation 40 61 not enrolled
## 163 163 low public/academic 64 58 not enrolled
## 164 164 middle public/vocation 46 39 enrolled
## 165 165 low public/vocation 54 61 not enrolled
## 166 166 middle public/academic 53 61 not enrolled
## 167 167 middle public/general 35 66 not enrolled
## 168 168 middle public/academic 57 55 not enrolled
## 169 169 low public/general 63 69 not enrolled
## 170 170 high public/academic 61 69 enrolled
## 171 171 middle public/academic 60 55 not enrolled
## 172 172 middle public/academic 57 53 not enrolled
## 173 173 low public/general 61 63 enrolled
## 174 174 middle private/academic 71 66 not enrolled
## 175 175 high private/general 42 50 not enrolled
## 176 176 middle private/academic 41 42 not enrolled
## 177 177 middle private/academic 62 58 not enrolled
## 178 178 middle private/vocation 57 58 not enrolled
## 179 179 middle private/academic 60 50 enrolled
## 180 180 high private/academic 69 58 enrolled
## 181 181 middle private/academic 45 58 not enrolled
## 182 182 middle private/academic 43 44 not enrolled
## 183 183 middle private/academic 49 55 not enrolled
## 184 184 middle private/vocation 53 55 not enrolled
## 185 185 middle private/academic 55 58 not enrolled
## 186 186 middle private/academic 63 55 enrolled
## 187 187 middle private/general 57 55 not enrolled
## 188 188 high private/academic 56 55 enrolled
## 189 189 middle private/academic 63 53 not enrolled
## 190 190 middle private/academic 54 58 not enrolled
## 191 191 high private/academic 43 48 not enrolled
## 192 192 high private/academic 63 66 enrolled
## 193 193 middle private/academic 48 39 not enrolled
## 194 194 high private/academic 69 61 enrolled
## 195 195 middle private/general 60 58 not enrolled
## 196 196 high private/academic 49 39 not enrolled
## 197 197 high private/academic 50 36 not enrolled
## 198 198 high private/academic 51 63 enrolled
## 199 199 high private/academic 50 61 not enrolled
## 200 200 middle private/academic 75 66 not enrolled
head(dex1)
## id sex ses school_prog read write math science social_st
## 1 1 female low public/vocation NA 44 45 39 41
## 2 2 female middle public/general 39 41 33 42 41
## 3 3 male low private/academic 63 65 48 63 56
## 4 4 female low public/academic 44 50 41 39 51
## 5 5 male low public/academic 47 40 43 45 31
## 6 6 female low public/academic 47 41 46 40 41
## honors awards
## 1 not enrolled 0
## 2 enrolled 1
## 3 enrolled 5
## 4 not enrolled 1
## 5 not enrolled 0
## 6 not enrolled 0
Subset dex1 to only students with ids 1 through 8.
#Exercise 3. 3.2 dropping rows
dex1 |>
select(id, ses, school_prog, math, science, honors) |>
filter(id %in% 1:8)
## id ses school_prog math science honors
## 1 1 low public/vocation 45 39 not enrolled
## 2 2 middle public/general 33 42 enrolled
## 3 3 low private/academic 48 63 enrolled
## 4 4 low public/academic 41 39 not enrolled
## 5 5 low public/academic 43 45 not enrolled
## 6 6 low public/academic 46 40 not enrolled
## 7 7 middle public/academic 59 47 not enrolled
## 8 8 low public/academic 52 44 not enrolled
Create a new dataset dex3 using dex1 and dex2, containing all rows from dex1 and only matched rows from dex2.
# Exercise 3. 3.3 creating a new dataset dex3
dex3 <- dex1 |>
select(id, ses, school_prog, math, science, honors) |> # dropping columns
filter(id %in% c(1:8)) |> # dropping rows
left_join(dex2) # merging two datasets
dex3
## id ses school_prog math science honors awards city_state
## 1 1 low public/vocation 45 39 not enrolled 0 NY New York
## 2 2 middle public/general 33 42 enrolled 1 CA Los Angeles
## 3 3 low private/academic 48 63 enrolled 5 NY New York
## 4 4 low public/academic 41 39 not enrolled NA <NA>
## 5 5 low public/academic 43 45 not enrolled NA <NA>
## 6 6 low public/academic 46 40 not enrolled NA <NA>
## 7 7 middle public/academic 59 47 not enrolled NA <NA>
## 8 8 low public/academic 52 44 not enrolled NA <NA>
We now explore the tools in the tidyverse
package
tidyr
(loaded with library(tidyverse)
), which
make datasets “tidy”.
Tidy datasets adhere to a simple set of structural rules:
Doing a little more work upfront to make data “tidy” will pay off with time saved later on.
Data can be messy in many ways, but we focus on two ways that can be
tidied with tidyr
tools:
pivot_longer()
).pivot_wider()
).We first load a dataset of the number of graduates from 3 academic departments across 3 years.
The planned analysis will examine how graduation has changed differently over the years by department. For such an analysis, we will need a variables that represents:
# import department admissions data for tidying up
dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 × 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
However, we see in the data that year is represented as column headings, and graduation numbers are spread over 3 columns. We need to stack the graduation numbers into one column and create a variable representing year.
pivot_longer()
to create a variable out of column
headings and restructure the datasetWith pivot_longer()
, we lengthen the data by stacking
columns so that the final dataset has more rows and fewer columns:
This process is also known as “reshaping long”.
Arguments to pivot_longer()
:
cols=
: the column variables to stack into a single
column
names_to=
: name of the new variable that will hold the
repeated column headingsvalues_to=
: name of the new variable that will hold the
stacked values of the selected columnsHere again is the data set we want to reshape:
# inspect department admissions data
dept
## # A tibble: 3 × 4
## id `2015` `2016` `2017`
## <chr> <dbl> <dbl> <dbl>
## 1 biology 207 211 259
## 2 math 96 75 99
## 3 physics 112 126 125
In the newly reshaped dataset, we want to put years and graduation rates into 2 columns.
cols=-id
*names_to="year"
values_to="grad"
*We could also have specified each column with
c(`2015`, `2016`, `2017`)
; the ` are needed to specify the
column names here because they don’t start with a letter.
# the new column "year" uses the column headings as values,
# the new column "graduates" will be the collapsed values
# we do not want to collapse id
dept_by_year <- dept |>
pivot_longer(cols=-id, names_to="year", values_to="grad")
dept_by_year
## # A tibble: 9 × 3
## id year grad
## <chr> <chr> <dbl>
## 1 biology 2015 207
## 2 biology 2016 211
## 3 biology 2017 259
## 4 math 2015 96
## 5 math 2016 75
## 6 math 2017 99
## 7 physics 2015 112
## 8 physics 2016 126
## 9 physics 2017 125
pivot_longer(names_to="year", values_to="grad", -id)
names_to="year"
values_to="grad"
-id
Columns should contain values that represent one variable, but we often encounter datasets where multiple variables are stored in the same column.
Let’s take a look at a dataset of worms, who have had their age, length, and weight measured, but all stored in one column.
# import data on worms for tidying up
worms <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/worms.csv")
worms
## # A tibble: 9 × 3
## worm feature measure
## <dbl> <chr> <dbl>
## 1 1 age 5
## 2 1 length 3.2
## 3 1 weight 4.1
## 4 2 age 4
## 5 2 length 2.6
## 6 2 weight 3.5
## 7 3 age 5
## 8 3 length 3.6
## 9 3 weight 5.5
We are interested in whether age predicts length and weight, which would make the unit of analysis the worm. So each row should be one worm.
We want the feature values to be spread across three columns.
pivot_wider()
to spread a single column into
multiple columnsHere, we use pivot_wider()
to “widen” the data by
dividing single columns into multiple columns, which reverses the
operations of pivot_longer()
.
This process is sometimes known as “reshaping wide”.
Now, we want “age”, “length”, and “weight” to be their own columns, and the values in the “measure” column to be spread across these three new columns.
Arguments to pivot_wider()
:
names_from=
: name of existing column variable that
holds the names of the new columnsvalues_from=
: name of existing column variable that
holds the values to be spread across the new columnsIn dataset “worms”, the names of the new columns are found in “feature” and the values to spread across columns are found in “measure.”
pivot_wider(names_from=feature, values_from=measure)
Two other issues that make datasets untidy are columns where multiple variables are stored in the same cell, or where one variable is stored across multiple cells.
The table5
data that is loaded with the
tidyr
package exhibits both of these problems.
# table5 for tidying up
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
separate()
and unite()
to split and
concatenate columns, respectivelyseparate()
and unite()
are complements of
each other.
For separate()
:
col=
is the name of the column to splitinto=
are the names of the new character variable
columnssep=
separator character between columnsremove=
remove original column from data frame; default
is TRUEFor unite()
:
col=
name of new, concatenated columnsep=
separator, “_” by default# remember that new variable names must be quoted, while old variables do not need to be
table5 |>
separate(col=rate, into=c("cases", "population"), sep="/") |>
unite(col="year", century, year, sep="")
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
tidyverse
cheatsheetsIn the Help menu for RStudio, there is a submenu for “Cheatsheets”,
which are documents that summarize the usage and syntax of various
widely-used R
tools. You’ll find a cheatsheet for
dplyr
linked in this menu.
Before dates in a dataset are loaded into R
, typically
they are stored as a column of character (string) values. However, dates
are inherently numeric and we lose this information when they are stored
as strings.
# string dates have no numeric value, so this errors
"2018-03-05" + 1
## Error in "2018-03-05" + 1: non-numeric argument to binary operator
In R
, we want to convert our string dates to
R
class Date
.
Date
are numbers representing the
number of days since January 1, 1970 (1970-01-01)
Date
variables, such
as calculating the number of days between 2 datesDate
class
variablesas.Date()
Base R
provides as.Date()
to convert
strings to dates, but it can be intimidating and unwieldy to use,
particularly if your dates are not stored in one of 2 default formats
(see ?strptime
for more about date formats).
# as.Date only accepts a couple of formats by default
# good
as.Date("2015-02-14")
## [1] "2015-02-14"
# bad
as.Date("02/14/2014")
## Error in charToDate(x): character string is not in a standard unambiguous format
# specify a format to fix
as.Date("02/14/2014", format="%m/%d/%Y")
## [1] "2014-02-14"
Once our dates are class Date
, we can perform date
arithmetic.
# a data class variable
a <- as.Date("1971-01-01")
class(a)
## [1] "Date"
# days since 1970-01-01
as.numeric(a)
## [1] 365
# date arithmetic
a - as.Date("1971-12-31")
## Time difference of -364 days
a + 2
## [1] "1971-01-03"
We just need something easier to use to convert strings of various
formats to class Date
!
lubridate
The amusingly-named tidyverse
package
lubridate
provides easy-to-use functions to convert string
dates to R
Date
format and then to process
those dates.
The Date
conversion functions in lubridate
accept a wide variety of string date formats.
lubridate
is not automatically loaded
with library(tidyverse)
, though is installed with
tidyverse
, so we load it now.
lubridate
We’ll first load a dataset with various date formats to demonstrate
the flexibility of lubridate
functions.
# import dates datset
d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dates.csv")
d
## # A tibble: 3 × 5
## fmt1 fmt2 fmt3 fmt4 decision_time
## <chr> <chr> <date> <dbl> <chr>
## 1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
## 2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23
## 3 03/15/84 March 3, 2010 2017-12-25 20071011 Approved 04-24-15 02:03:03
chr
,
chr
, date
, and dbl
Date
by
read_csv()
because its particular formatting signals a date
to read_csv()
.To identify which string-to-Date
conversion
lubridate
function you’ll need:
Date
(e.g. ymd()
,
mdy()
, dmy()
).The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.
So, we want mdy()
and ymd()
.
Marvel at the flexibility!
# no format specifications needed
# just ordering y,m, and d
dates <- data.frame(f1=mdy(d$fmt1), f2=mdy(d$fmt2),
f3=ymd(d$fmt3), f4=ymd(d$fmt4))
dates
## f1 f2 f3 f4
## 1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
## 2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
## 3 1984-03-15 2010-03-03 2017-12-25 2007-10-11
Date-time variables include a time in addition to a date.
In R
, date-time variables can be stored as class
POSIXct
, representing the number of seconds since midnight
(00:00:00), January 1, 1970.
lubridate
provides functions to convert string date-time
variables to class POSIXct
. Simply add one or more of
h, m, and s (for hours, months, and seconds,
respecitvely) to y, m, and d to form the
function name.
Examples:
# month, day, year, hour and minute with PM
mdy_hm("7/19/23 11:30PM")
## [1] "2023-07-19 23:30:00 UTC"
# year, month, day, hour, minute, second
ymd_hms("2023-4-19 08:25:30")
## [1] "2023-04-19 08:25:30 UTC"
# time difference between those 2 date-times
mdy_hm("7/19/23 1:30PM") - ymd_hms("2023-4-19 21:25:30")
## Time difference of 90.66979 days
# class POSIXct is a standard way of representing calendar time
class(mdy_hm("7/19/23 11:30PM"))
## [1] "POSIXct" "POSIXt"
The fifth column of our dates
dataset,
decision_time
, is a date-time variable recorded as month,
day, year, hour, minute, second, so we want mdy_hms()
.
(Notice how the “Approved/Denied” string at the beginning is
ignored!).
# month day year hour minute second
mdy_hms(d$decision_time)
## [1] "2015-10-10 07:15:55 UTC" "2011-09-27 14:57:23 UTC"
## [3] "2015-04-24 02:03:03 UTC"
The “time zone” standard UTC
(Coordinated Universal
Time) is used as the default zone. Use the argument, tz=
to
assign a zone. To see a list of valid time zone specifications, run
OlsonNames()
.
# Assign a time zone to a time variable
dates$decision_time <- mdy_hms(d$decision_time, tz="US/Pacific")
dates$decision_time
## [1] "2015-10-10 07:15:55 PDT" "2011-09-27 14:57:23 PDT"
## [3] "2015-04-24 02:03:03 PDT"
# first 20 valid time zones
head(OlsonNames(), n=20)
## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
## [7] "Africa/Bamako" "Africa/Bangui" "Africa/Banjul"
## [10] "Africa/Bissau" "Africa/Blantyre" "Africa/Brazzaville"
## [13] "Africa/Bujumbura" "Africa/Cairo" "Africa/Casablanca"
## [16] "Africa/Ceuta" "Africa/Conakry" "Africa/Dakar"
## [19] "Africa/Dar_es_Salaam" "Africa/Djibouti"
Date
variableslubridate
provides several functions to extract specific
information from Date
variables including:
day()
: day of the monthwday()
: weekdayyday()
: day of the yearmonth()
: month of the yearyear()
: yearSome examples of extracting information from Date
variables.
# we'll use the first column of our dates dataset
dates$f1
## [1] "1989-01-15" "1992-02-13" "1984-03-15"
# day of the month
day(dates$f1)
## [1] 15 13 15
# day of the year
yday(dates$f1)
## [1] 15 44 75
Some more date information extraction:
POSIXct
variableslubridate
also has functions to extract time information
from POSIXct
date-time variables.
hour()
minute()
second()
#break up the time variable decision time
#display as a data.frame with 4 columns
with(dates, # with() tells R to look for variables in object "dates"
data.frame(time=decision_time, h=hour(decision_time),
m=minute(decision_time), s=second(decision_time)))
## time h m s
## 1 2015-10-10 07:15:55 7 15 55
## 2 2011-09-27 14:57:23 14 57 23
## 3 2015-04-24 02:03:03 2 3 3
If you need to add or subtract time to your date variables,
lubridate
provides 2 sets of functions.
One set gives “intuitive” results, and will ignore conventions like
leap year. These include seconds()
, minutes()
,
hours()
, days()
, weeks()
,
years()
.
The other set will adhere to those conventions, and are named by
adding d
to the names of the previous functions,
dseconds()
, dminutes()
, dhours()
,
ddays()
, dweeks()
, dyears()
.
Strings are character values.
Columns in external datasets that contain any non-number value will
generally be read into R
as string variables.
When specifying a specific string value, the value should be
surrounded by double quotes, ""
, or single quotes,
''
.
In this section we will cover functions that:
stringr
The tidyverse
package stringr
provides many
functions that have nearly the same functionality as similar base
R
functions.
stringr
functions tend to have a more consistent
syntaxR
string functions are still used widely, so we
cover them as wellThe package stringr
is automatically loaded with
library(tidyverse)
.
We will use a dataset of locations of partially fictitious statistical consulting centers.
# import centers data
centers <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/stats_centers.csv")
centers
## # A tibble: 4 × 6
## NAME ROOM BUILDING UNIVERSITY CITY STATE
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE UCLA LOS ANGELES CA
## 2 OIT CONSULTING Y521 HOUSTON HALL LONE STAR UNIV. DALLAS TX
## 3 ATS STAT HELP M234 ORANGE BUILDING SUNSHINE COLLEGE MIAMI FL
## 4 OAC STAT CENTER B2431 SNOW HALL GRIZZLY UNIVERSITY JUNEAU AK
At least 2 functions loaded in the current session (base
R
and tidyverse
) can concatenate strings
together. We highlight what makes each unique.
paste()
from base R
str_c()
from stringr
(tidyverse
)paste()
to concatenate stringWhen used on vectors, base R
function
paste()
concatenates element-wise. By default, joined
elements are separated by a space.
# element-by-element concatenation
paste(centers$ROOM, centers$BUILDING)
## [1] "X4919 MATH SCIENCE" "Y521 HOUSTON HALL" "M234 ORANGE BUILDING"
## [4] "B2431 SNOW HALL"
You can change the separating character with sep=
.
# separate with comma and space
paste(centers$CITY, centers$STATE, sep=", ")
## [1] "LOS ANGELES, CA" "DALLAS, TX" "MIAMI, FL" "JUNEAU, AK"
Vectors of numbers used in paste()
are converted to
character before pasting.
# separate with comma and space
paste(1:4, centers$UNIVERSITY)
## [1] "1 UCLA" "2 LONE STAR UNIV." "3 SUNSHINE COLLEGE"
## [4] "4 GRIZZLY UNIVERSITY"
Interestingly, paste()
will coerce the missing value
NA
to the string "NA"
. This usually is not
desirable.
# separate with comma and space
paste(centers$CITY, centers$STATE, c("USA", "USA","USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA" "MIAMI, FL, USA"
## [4] "JUNEAU, AK, NA"
If you need to paste the elements within a vector together into a
single string, specify a separating character with
collapse=
instead of sep=
.
str_c()
from stringr
to concatenate
stringsstr_c()
from the package stringr
is quite
similar to paste()
. It has both sep=
and
collapse=
arguments that function that same as in
paste()
. It differs in a couple of ways.
First, its default separator is the empty character ““:
# default is no separating character
str_c(centers$CITY, centers$STATE)
## [1] "LOS ANGELESCA" "DALLASTX" "MIAMIFL" "JUNEAUAK"
Second, concatenating a NA
into a string will result in
NA
:
To extract characters, or a “substring”, from a string by position, we can use either of two similar functions:
substr()
from base R
str_sub()
from stringr
For both functions, we specify a string variable, a start position, and an end position.
# start at first, end at third character
substr(centers$BUILDING, 1, 3)
## [1] "MAT" "HOU" "ORA" "SNO"
# specifying an end longer than the entirety of the string
# results in truncation
str_sub(centers$UNIVERSITY, 2, 6)
## [1] "CLA" "ONE S" "UNSHI" "RIZZL"
One difference between substr()
and
str_sub()
is that str_sub()
accepts negative
numbers, which count from the end of the string.
Using the assignment operator <-
with
substr()
and str_sub()
allows us to replace a
substring in string variables by position.
Use base R
functions tolower()
and
toupper()
to change the case of string variables.
# lower case city names
tolower(centers$CITY)
## [1] "los angeles" "dallas" "miami" "juneau"
# change variable names to lower case
names(centers) <- tolower(names(centers))
names(centers)
## [1] "name" "room" "building" "university" "city"
## [6] "state"
The stringr
functions str_to_lower()
and
str_to_upper()
work very similarly.
grep()
Often, we need to scan through a vector of strings and find values that match a text pattern (e.g. finding all phone numbers beginning with area code “310”)
The primary base R
function for pattern matching is
grep()
. Several other related functions also match
patterns, but return slightly different output
(e.g. grepl()
and regexpr()
).
The basic syntax is grep(pattern, x)
, where
pattern
is the text pattern to match in character vector
x
.
By default, grep()
returns the index (row) number of the
matches, but we can get the strings themselves returned with
value=TRUE
.
# stat center names
centers$name
## [1] "IDRE STAT CONSULT" "OIT CONSULTING" "ATS STAT HELP"
## [4] "OAC STAT CENTER"
# row numbers of names with "STAT"
grep(pattern="STAT", centers$name)
## [1] 1 3 4
# we can use those indices to subset
centers$name[grep(pattern="STAT", centers$name)]
## [1] "IDRE STAT CONSULT" "ATS STAT HELP" "OAC STAT CENTER"
# or just use value=TRUE to get the words directly
grep(pattern="STAT", centers$name, value=TRUE)
## [1] "IDRE STAT CONSULT" "ATS STAT HELP" "OAC STAT CENTER"
stringr
has a very similar function called
str_detect()
, which returns TRUE/FALSE for string matching
(equivalent to base R grepl()
).
One advantage of the stringr
functions is that the
vector to use for matching is the first argument, which will make piping
simpler.
sub()
We can also substitute based on pattern matches with base
R
function sub()
. Now the syntax is
sub(pattern, replacement, x)
, where pattern
is
the pattern to be matched, replacement
is the replacement
text, and x
is the string vector.
# change STAT to STATISTICAL
centers$name <- sub("STAT", "STATISTICAL", centers$name)
centers$name
## [1] "IDRE STATISTICAL CONSULT" "OIT CONSULTING"
## [3] "ATS STATISTICAL HELP" "OAC STATISTICAL CENTER"
The stringr
function for pattern substitution is
str_replace()
.
# stringr substitution
str_replace(centers$name, "STAT", "STATISTICAL")
## [1] "IDRE STATISTICALISTICAL CONSULT" "OIT CONSULTING"
## [3] "ATS STATISTICALISTICAL HELP" "OAC STATISTICALISTICAL CENTER"
Note: All of grep()
,
str_detect()
, sub()
, and
str_replace()
accept regular expressions, a very powerful
coding system for string matching that is beyond the scope of this
workshop.
Use the dataset dex3 that we created and create a new dataset dex4.
4.1 Keep rows where student id is 1, 2 and 3.
4.2 Generate a new variable total_pts that contains a sum of math and science scores for each student.
4.3 Recode the total_pts variable into “high” if a student scored 100 or higher, “medium” if their score falls between 80 and 99, and “low” if they got less than 80.
4.4 Create a variable state to display the state in which each student’s school is located.
4.5 Finally, notice that school_prog variable has two pieces of information: type of school (public or private) and program (academic, general, vocational). Split this variable into two separate columns school and program.
Keep rows where student id is 1, 2 and 3.
Generate a new variable total_pts that contains a sum of math and science scores for each student.
# Exercise 4. 4.2 creating total_pts variable
dex3 |>
mutate( total_pts = sum(math, science), .by=id) |>
relocate(total_pts, .after = id)
## id total_pts ses school_prog math science honors awards
## 1 1 84 low public/vocation 45 39 not enrolled 0
## 2 2 75 middle public/general 33 42 enrolled 1
## 3 3 111 low private/academic 48 63 enrolled 5
## city_state
## 1 NY New York
## 2 CA Los Angeles
## 3 NY New York
Recode the total_pts variable into “high” if a student scored 100 or higher, “medium” if their score falls between 80 and 99, and “low” if they got less than 80.
# Exercise 4. 4.3 recoding total_pts
dex3 |>
mutate( total_pts = sum(math, science), .by=id) |>
mutate( total_pts_recoded = case_when(
total_pts >= 100 ~ "high",
total_pts >= 80 & total_pts < 100 ~ "medium",
total_pts < 80 ~ "low" ) ) |>
relocate(total_pts, total_pts_recoded, .after = id)
## id total_pts total_pts_recoded ses school_prog math science
## 1 1 84 medium low public/vocation 45 39
## 2 2 75 low middle public/general 33 42
## 3 3 111 high low private/academic 48 63
## honors awards city_state
## 1 not enrolled 0 NY New York
## 2 enrolled 1 CA Los Angeles
## 3 enrolled 5 NY New York
Create a variable state to display the state in which each student’s school is located.
# Excercise 4. 4.4 splitting school_prog into school and program
dex3 |>
mutate( total_pts = sum(math,science), .by = id) |>
mutate(total_pts_recoded = case_when(
total_pts >= 100 ~ "high",
total_pts >= 80 & total_pts < 100 ~ "medium",
total_pts < 80 ~ "low" ),
state = str_sub(city_state, 1, 2) ) |>
relocate(city_state, state, state, .after = id)
## id city_state state ses school_prog math science honors
## 1 1 NY New York NY low public/vocation 45 39 not enrolled
## 2 2 CA Los Angeles CA middle public/general 33 42 enrolled
## 3 3 NY New York NY low private/academic 48 63 enrolled
## awards total_pts total_pts_recoded
## 1 0 84 medium
## 2 1 75 low
## 3 5 111 high
school_prog variable has two pieces of information: type of school (public or private) and program (academic, general, vocational). Split this variable into two separate columns school and program.
# Excercise 4. 4.5 Extracting state from state_city
dex4 <- dex3 |>
mutate( total_pts = sum(math,science), .by = id) |> # creating a variable total_pts
mutate(total_pts_recoded = case_when( # recoding total_pts variable
total_pts >= 100 ~ "high",
total_pts >= 80 & total_pts < 100 ~ "medium",
total_pts < 80 ~ "low" ),
state = str_sub(city_state, 1, 2) ) |> # extracting characters from the city_state variable
separate(school_prog, into=c("school", "program"), sep="/") |> # separating school_prog variable
relocate(school, program, state, total_pts, total_pts_recoded, .after = id)
dex4
## id school program state total_pts total_pts_recoded ses math science
## 1 1 public vocation NY 84 medium low 45 39
## 2 2 public general CA 75 low middle 33 42
## 3 3 private academic NY 111 high low 48 63
## honors awards city_state
## 1 not enrolled 0 NY New York
## 2 enrolled 1 CA Los Angeles
## 3 enrolled 5 NY New York
Factors can be used to represent categorical variables in R and are particulalry useful in statistical modeling.
Factors allow reordering character (or integer) vectors when needed to improve display.
Both base R and and forcats
package in
tidyverse
handle factors.
We use the factor()
function to convert a character (or
integer) variables to a factor.
By default, factor()
function will use alphabetical
ordering to order the variable.
# character vector
dex4$program
## [1] "vocation" "general" "academic"
# convert to factor
dex4$program_fct <- factor(dex4$program)
# categories in alphabetical order
dex4$program_fct
## [1] vocation general academic
## Levels: academic general vocation
We can specify the order using levels=
argument in
factor
.
# specifying the order of levels
dex4$program_fct <- factor(dex4$program_fct, levels = c("vocation","academic", "general"))
dex4$program_fct
## [1] vocation general academic
## Levels: vocation academic general
We can also see the underlying integer values by converting the
factors to a numeric
vector. Essentially, factors are
integer variables with labels for each integer value.
A vector of integers can be converted to a factor
by
using labels=
option in factor()
function.
Text labels will then become levels themselves.
# honors: 0="not enrolled", 1="enrolled"
dex4$honors_int <- if_else(dex4$honors == "not enrolled", 0, 1)
dex4$honors_int
## [1] 0 1 1
# convert integer vector into a factor and assign levels
# labels become the levels
factor(dex4$honors_int, labels=c("not enrolled","enrolled"))
## [1] not enrolled enrolled enrolled
## Levels: not enrolled enrolled
In regression models, R
will automatically create dummy
variables for each level of factor except for the first, which will thus
be the reference group.
forcats
The tidyverse
package forcats
provides many
functions that help with dealing with categorical variables.
The package is automatically loaded with
library(tidyverse)
.
We will discuss a handful of forecats
functions.
fct_recode()
Sometimes you need to change factor levels, e.g. collapse a couple of
categories into one. Use fact_recode
function to do this by
hand.
The order in which new and old variables are coded is
new variable name = old variable name
.
fct_relevel()
fct_relevel()
function allows to move any number of
factor levels to any location. It is useful when you want to change
reference group in a regression model.
# factor variable
dex4$program_fct
## [1] vocation general academic
## Levels: vocation academic general
# put academic at the first position
fct_relevel(dex4$program_fct, "academic")
## [1] vocation general academic
## Levels: academic vocation general
# put "academic" at the second position
fct_relevel(dex4$program_fct, "academic", after=1)
## [1] vocation general academic
## Levels: vocation academic general
# reverse
fct_relevel(dex4$program_fct, rev)
## [1] vocation general academic
## Levels: general academic vocation
Function fct_add
allows for adding new levels to a
factor variable.
# create a factor vector
dex4$state_fct <- factor( dex4$state )
dex4$state_fct
## [1] NY CA NY
## Levels: CA NY
# add level to a factor
fct_expand(dex4$state_fct, c("TX", "IL"))
## [1] NY CA NY
## Levels: CA NY TX IL
# specify where new factors should be added
dex4$state_fct <- fct_expand(dex4$state_fct, c("TX","IL"), after=1)
dex4$state_fct
## [1] NY CA NY
## Levels: CA TX IL NY
When we want to drop factor levels, we can use fct_drop
function.
Wickham, Hadley (2014). Tidy Data. Journal of Statistical Software. 59(10).
R for Data Science, an online book written by the people who make RStudio and the tidyverse, that discusses how to use R for all steps of data analysis, with special emphasis on using the tidyverse
R for Data Science (2e), the 2nd edition of “R for Data Science”
Handling
and Processing Strings in R by Gaston Sanchez, a comprehensive
and free guide to working with strings in R
tidyverse
vignettes: vignette("dplyr")
,
vignette("tidy-data")
, vignette("lubridate")
,
vignette("stringr")
,
vignette("forcats")