R Data Management
UCLA Office of Advanced Research Computing
Statistical Methods and Data Analytics

Introduction

Purpose

This workshop introduces R tools for data management.

We focus on functions found in the tidyverse collection of packages, but also include tools found in base R and other packages.

We will cover the following topics:

Data management packages used in this seminar

tidyverse: a collection of packages with tools for most aspects of data analysis, particularly strong in data import, management, and visualization. We will focus on the following packages within tidyverse:

Hmisc: another large collection of tools for most aspects of data analysis, but we use it here for describe(), a dataset summary function

Installing packages in R

When tidyverse is installed with install.pacakges(), all of the tidyverse packages will be installed.

# Install packages
install.packages("tidyverse", dependencies=TRUE)
install.packages("Hmisc", dependencies=TRUE)

Loading packages

When loading tidyverse packages into an R session with library(), only the core packages are loaded.

# load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Non-core tidyverse packages must be loaded individually.

# tidyverse package readxl is used to import Excel files
library(readxl)

# tidyverse package haven is used to import Stata, SAS, and SPSS files
library(haven)


Another set of tools comes with Hmisc package.

# load Hmisc package for describe() function
library(Hmisc)

Review of data frames

Data frames

Data sets for statistical analysis are typically stored in data frames in R.


Generating data frames

Data import functions like read.csv() typically output a data frame.

# read in data over internet
dex1 <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_1.csv")

# class of dex1 object
class(dex1) 
## [1] "data.frame"

# first few rows
head(dex1)
##   id    sex    ses      school_prog read write math science social_st
## 1  1 female    low  public/vocation  -99    44   45      39        41
## 2  2 female middle   public/general   39    41   33      42        41
## 3  3   male    low private/academic   63    65   48      63        56
## 4  4 female    low  public/academic   44    50   41      39        51
## 5  5   male    low  public/academic   47    40   43      45        31
## 6  6 female    low  public/academic   47    41   46      40        41
##         honors awards
## 1 not enrolled      0
## 2     enrolled      1
## 3     enrolled      5
## 4 not enrolled      1
## 5 not enrolled      0
## 6 not enrolled      0

We can also create data frames manually using data.frame():

# create data frame manually
dt <- data.frame(
  x = runif(5), # 5 random numbers from uniform(0,1)
  y = rnorm(5)  # 5 random numbers from normal(0,1)
)

dt
##             x          y
## 1 0.847499404  1.2599723
## 2 0.555330436 -1.3685829
## 3 0.619306767  0.3407421
## 4 0.008869905 -1.1807196
## 5 0.350041412  0.4532787

Most of the tools discussed in this seminar work on data frames.

Tibbles

Tibbles are a special class of data frames used in tidyverse but function as regular data frames for most purposes.

If the input is a data frame, a few tidyverse functions will output a tibble, but most will output a regular data frame.

# dplyr function group_by() takes a data frame and outputs a grouped tibble,
#  which allows grouped operations 
dex1_by_ses <- group_by(dex1, ses)

# tbl_df refers to class tibble
class(dex1_by_ses)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

This page explains how tibbles differ from data frames.

Most users will not notice any functional differences between tibbles and regular data frames.

One noticeable difference is how they print to screen:

# prints only rows and columns that fit to screen
dex1_by_ses
## # A tibble: 200 × 11
## # Groups:   ses [4]
##       id sex    ses    school_prog     read write  math science social_st honors
##    <int> <chr>  <chr>  <chr>          <int> <int> <int>   <int>     <int> <chr> 
##  1     1 female low    public/vocati…   -99    44    45      39        41 not e…
##  2     2 female middle public/general    39    41    33      42        41 enrol…
##  3     3 male   low    private/acade…    63    65    48      63        56 enrol…
##  4     4 female low    public/academ…    44    50    41      39        51 not e…
##  5     5 male   low    public/academ…    47    40    43      45        31 not e…
##  6     6 female low    public/academ…    47    41    46      40        41 not e…
##  7     7 male   middle public/academ…    57    54    59      47        51 not e…
##  8     8 female low    public/academ…    39    44    52      44        48 not e…
##  9     9 male   middle public/vocati…    48    49    52      44        51 not e…
## 10    10 female middle public/general    47    54    49      53        61 not e…
## # ℹ 190 more rows
## # ℹ 1 more variable: awards <int>

To convert a tibble to a regular data frame, use as.data.frame().

# now just a data.frame
class(as.data.frame(dt))
## [1] "data.frame"

The function tibble() can be used to create a tibble manually.

First workshop dataset

We begin our data management work on a dataset of 120 observations of 14 variables describing simulated university student data. Some of the variables include:


We will be using the function read_csv() to import the data, which:

# First workshop dataset
dat <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/student_data.csv")
dat
## # A tibble: 120 × 14
##    university universityid student sex     age class transfer teaching_assistant
##    <chr>             <dbl>   <dbl> <chr> <dbl> <chr> <chr>    <chr>             
##  1 UCLA                  1       1 male   35.0 soph… no       T1                
##  2 UCLA                  1       2 fema…  23.9 soph… no       T1                
##  3 UCLA                  1       3 male   21.4 soph… no       T1                
##  4 UCLA                  1       4 male   26.8 seni… no       T1                
##  5 UCLA                  1       5 male   21.9 seni… no       T1                
##  6 UCLA                  1       6 fema…  23.8 juni… no       T1                
##  7 UCLA                  1       7 male   24.4 soph… yes      T1                
##  8 UCLA                  1       8 fema…  27.1 juni… no       T3                
##  9 UCLA                  1       9 fema…  29.3 seni… yes      T3                
## 10 UCLA                  1      10 fema…  27.1 juni… no       T3                
## # ℹ 110 more rows
## # ℹ 6 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>

View() data as a spreadsheet

View() opens a spreadsheet-style view of a dataset.

In RStudio, clicking on a dataset in the Environment pane will View() it.

View(dat)

Subsetting data frames

We can select a particular row and column of data frame my_data, with the syntax my_data[rows,columns], where rows and columns are one of the following:

# row 3 column 2
dat[3,2]
## # A tibble: 1 × 1
##   universityid
##          <dbl>
## 1            1

# rows 2 and 4 of columns age
dat[c(2,4), "age"]
## # A tibble: 2 × 1
##     age
##   <dbl>
## 1  23.9
## 2  26.8

# all columns for row 10
dat[10,]
## # A tibble: 1 × 14
##   university universityid student sex      age class transfer teaching_assistant
##   <chr>             <dbl>   <dbl> <chr>  <dbl> <chr> <chr>    <chr>             
## 1 UCLA                  1      10 female  27.1 juni… no       T3                
## # ℹ 6 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>

# all rows of column "sex"
dat[, "sex"]
## # A tibble: 120 × 1
##    sex   
##    <chr> 
##  1 male  
##  2 female
##  3 male  
##  4 male  
##  5 male  
##  6 female
##  7 male  
##  8 female
##  9 female
## 10 female
## # ℹ 110 more rows

Use $ to extract a column vector

To extract a column as a vector use the following syntax:

The syntax my_data$colname extracts the column with name colname as a vector. We recommend this syntax for extracting columns.

The syntax my_data[[“colname”]] and my_data[[colnum]] can also be used to extract column vectors by name or by number, respectively.

# subsetting with $ creates a numeric vector, which itself can be subset
dat$age[2:3]
## [1] 23.92 21.37

# same as above
dat[["age"]][2:3]
## [1] 23.92 21.37

Logical subsetting

We can also use logical expressions to subset. Logical comparisons such as x > 0 return TRUE/FALSE vectors.

We can place such an expression inside of [] and values with TRUE will be selected.

# logical comparison results in a logical vector 
dat$age > 25
##   [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [13]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
##  [25]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [37] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [61]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [73] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
##  [97] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

# when placed in [] before the comma, rows with TRUE are selected
#  this returns columns age and sex for observations where age > 25
dat[dat$age > 25, c("age", "sex")]
## # A tibble: 41 × 2
##      age sex   
##    <dbl> <chr> 
##  1  35.0 male  
##  2  26.8 male  
##  3  27.1 female
##  4  29.3 female
##  5  27.1 female
##  6  26.2 male  
##  7  25.8 male  
##  8  29.8 female
##  9  25.5 female
## 10  25.2 female
## # ℹ 31 more rows

Review of functions

Functions

Functions perform most of the work on data in R.

Functions in R perform some operation on an input (usually) and return some output. For example, the mean() function in R takes a vector of numbers and returns its mean.

Using library() loads a package’s functions into the R session.

Help files for functions

In the help file for a function (?function_name), we will find:

Specifying arguments by name and position

Input arguments to functions can be specified by either name or position.

# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5

# specifying arguments by position
seq(10, 0, -2)
## [1] 10  8  6  4  2  0

Data cleaning and missing values

See how variables are stored with glimpse() or str()

After importing your data, a good first step is to understand how the variables are stored.

For statistical analysis:

The tidyverse function glimpse():

# tidyverse function
glimpse(dat)
## Rows: 120
## Columns: 14
## $ university         <chr> "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UC…
## $ universityid       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ student            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ sex                <chr> "male", "female", "male", "male", "male", "female",…
## $ age                <dbl> 34.97, 23.92, 21.37, 26.80, 21.93, 23.83, 24.39, 27…
## $ class              <chr> "sophomore", "sophomore", "sophomore", "senior", "s…
## $ transfer           <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "y…
## $ teaching_assistant <chr> "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T3", "T3…
## $ homework1          <dbl> 4, 3, 5, 4, 1, 2, 8, 1, 5, 1, 10, -99, 3, 5, 5, 3, …
## $ homework2          <dbl> 8, 1, 2, 1, 2, 9, 7, 3, 5, 3, 6, 2, 10, 5, 3, 6, 4,…
## $ midterm            <dbl> 60, 57, 78, 45, 44, 71, 55, 73, 65, 73, 70, 49, 47,…
## $ final              <dbl> 68, 67, 96, 40, 33, 98, 52, 80, 68, 87, 42, 45, 56,…
## $ total_points       <dbl> 140, 128, 176, 90, 80, 180, 122, 157, 138, 164, 58,…
## $ total_proportion   <dbl> 0.64, 0.58, 0.80, 0.41, 0.36, 0.82, 0.55, 0.71, 0.6…

The base R function str() has similar usage when used on a data frame, but prints less data.

# base R function
str(dat)
## spc_tbl_ [120 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ university        : chr [1:120] "UCLA" "UCLA" "UCLA" "UCLA" ...
##  $ universityid      : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
##  $ student           : num [1:120] 1 2 3 4 5 6 7 8 9 10 ...
##  $ sex               : chr [1:120] "male" "female" "male" "male" ...
##  $ age               : num [1:120] 35 23.9 21.4 26.8 21.9 ...
##  $ class             : chr [1:120] "sophomore" "sophomore" "sophomore" "senior" ...
##  $ transfer          : chr [1:120] "no" "no" "no" "no" ...
##  $ teaching_assistant: chr [1:120] "T1" "T1" "T1" "T1" ...
##  $ homework1         : num [1:120] 4 3 5 4 1 2 8 1 5 1 ...
##  $ homework2         : num [1:120] 8 1 2 1 2 9 7 3 5 3 ...
##  $ midterm           : num [1:120] 60 57 78 45 44 71 55 73 65 73 ...
##  $ final             : num [1:120] 68 67 96 40 33 98 52 80 68 87 ...
##  $ total_points      : num [1:120] 140 128 176 90 80 180 122 157 138 164 ...
##  $ total_proportion  : num [1:120] 0.64 0.58 0.8 0.41 0.36 0.82 0.55 0.71 0.63 0.75 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   university = col_character(),
##   ..   universityid = col_double(),
##   ..   student = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   class = col_character(),
##   ..   transfer = col_character(),
##   ..   teaching_assistant = col_character(),
##   ..   homework1 = col_double(),
##   ..   homework2 = col_double(),
##   ..   midterm = col_double(),
##   ..   final = col_double(),
##   ..   total_points = col_double(),
##   ..   total_proportion = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Quick summaries of variables with describe()

We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:

The describe() function from package Hmisc will give summary statistics for all variables in a data frame.

The information provided depends on the variable’s type and number of distinct values:

The results of describe() can be used to assess each variable’s distribution and to identify suspicious values:

# load Hmisc
library(Hmisc)

# detailed summaries of variables
describe(dat)
## dat 
## 
##  14  Variables      120  Observations
## --------------------------------------------------------------------------------
## university 
##        n  missing distinct 
##      120        0        2 
##                       
## Value       UCLA  UCSF
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## universityid 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        2    0.749    1.483   0.5036 
##                       
## Value          1     2
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## student 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      120        1     60.5    40.33     6.95    12.90 
##      .25      .50      .75      .90      .95 
##    30.75    60.50    90.25   108.10   114.05 
## 
## lowest :   1   2   3   4   5, highest: 116 117 118 119 120
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      113        1    26.27    9.051    18.96    19.39 
##      .25      .50      .75      .90      .95 
##    21.21    23.16    26.10    28.61    29.65 
## 
## lowest : 18.01  18.19  18.33  18.45  18.9  , highest: 33.93  34.16  34.97  35.8   327.89
## --------------------------------------------------------------------------------
## class 
##        n  missing distinct 
##      120        0        4 
##                                                               
## Value            junior not assessed       senior    sophomore
## Frequency            27            5           66           22
## Proportion        0.225        0.042        0.550        0.183
## --------------------------------------------------------------------------------
## transfer 
##        n  missing distinct 
##      120        0        3 
##                             
## Value        -99    no   yes
## Frequency      6    97    17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## teaching_assistant 
##        n  missing distinct 
##      120        0        5 
##                                         
## Value         T1    T2    T3    T4    T5
## Frequency      7     4    45    28    36
## Proportion 0.058 0.033 0.375 0.233 0.300
## --------------------------------------------------------------------------------
## homework1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       11    0.984  -0.1833    10.96        1        1 
##      .25      .50      .75      .90      .95 
##        2        3        6        8       10 
##                                                                          
## Value      -99.00   0.19   1.28   2.37   3.46   4.55   5.64   6.73   7.82
## Frequency       5     20     17     21     12     13     11      6      7
## Proportion  0.042  0.167  0.142  0.175  0.100  0.108  0.092  0.050  0.058
##                         
## Value        8.91  10.00
## Frequency       1      7
## Proportion  0.008  0.058
## 
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------
## homework2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       11    0.987  -0.5417     12.7       -4        1 
##      .25      .50      .75      .90      .95 
##        2        4        6        9       10 
##                                                                          
## Value      -99.00   0.19   1.28   2.37   3.46   4.55   5.64   6.73   7.82
## Frequency       6     15     12     19     16     15     10      6      6
## Proportion  0.050  0.125  0.100  0.158  0.133  0.125  0.083  0.050  0.050
##                         
## Value        8.91  10.00
## Frequency       6      9
## Proportion  0.050  0.075
## 
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------
## midterm 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       42    0.999    53.61    26.45    34.00    45.00 
##      .25      .50      .75      .90      .95 
##    52.75    60.00    68.00    74.10    78.30 
## 
## lowest : -99  41  42  43  44, highest:  78  84  87  90  91
## --------------------------------------------------------------------------------
## final 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       56    0.999    58.29    30.84    19.95    37.50 
##      .25      .50      .75      .90      .95 
##    53.00    67.00    76.25    83.10    93.05 
## 
## lowest : -99  19  20  22  28, highest:  93  94  96  98 100
## --------------------------------------------------------------------------------
## total_points 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       69    0.999      128    37.18    54.85    80.00 
##      .25      .50      .75      .90      .95 
##   116.75   134.50   148.00   164.20   176.00 
## 
## lowest :   7  24  31  35  40, highest: 177 179 180 193 202
## --------------------------------------------------------------------------------
## total_proportion 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       49    0.998   0.5816   0.1689   0.2495   0.3600 
##      .25      .50      .75      .90      .95 
##   0.5300   0.6100   0.6700   0.7500   0.8000 
## 
## lowest : 0.03 0.11 0.14 0.16 0.18, highest: 0.8  0.81 0.82 0.88 0.92
## --------------------------------------------------------------------------------

Suspicious values

Here are some variables with values that we are not sure are correct.

Can you spot the suspicious values?

# inspect sex, age and homework1 variables
describe(dat[,c( "sex", "age", "homework1")])
## dat[, c("sex", "age", "homework1")] 
## 
##  3  Variables      120  Observations
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      113        1    26.27    9.051    18.96    19.39 
##      .25      .50      .75      .90      .95 
##    21.21    23.16    26.10    28.61    29.65 
## 
## lowest : 18.01  18.19  18.33  18.45  18.9  , highest: 33.93  34.16  34.97  35.8   327.89
## --------------------------------------------------------------------------------
## homework1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       11    0.984  -0.1833    10.96        1        1 
##      .25      .50      .75      .90      .95 
##        2        3        6        8       10 
##                                                                          
## Value      -99.00   0.19   1.28   2.37   3.46   4.55   5.64   6.73   7.82
## Frequency       5     20     17     21     12     13     11      6      7
## Proportion  0.042  0.167  0.142  0.175  0.100  0.108  0.092  0.050  0.058
##                         
## Value        8.91  10.00
## Frequency       1      7
## Proportion  0.008  0.058
## 
## For the frequency table, variable is rounded to the nearest 1.09
## --------------------------------------------------------------------------------

Plot of describe()

The results of describe() can be plotted:


These plots are also useful to inspect distributions of variables and to look for suspicious values.

In the dot plots, look for suspicious labels for categorical variables (see variables sex, class and transfer below).

# describe plot for categorical variables
plot(describe(dat), which = "categorical")

Plot of describe()

In the histogram plots, each histogram has its own scale, so lone spikes and bunched-up histograms often denote extreme values (see variables age, homework1, homerowk2, midterm, final).

# describe plot for continuous variables
plot(describe(dat), which = "continuous")

Visualizing one variable

Individual histograms with hist() allow closer inspection of actual values.

# histogram of age with a suspicious extreme value
hist(dat$age, breaks = 50)


# histogram of homework1 with a suspicious extreme value
hist(dat$homework1, breaks = 50)

Converting suspicious values to missing

A look at the output of our dataset summaries, as well as the plot of describe() identified several suspicious values:

Because we don’t know any replacement values, we will change suspicious values to NA, the missing value indicator in R.

Note: Many times, datasets differentiate between different types of missing. For example, -99=“absent”, while -98=“refused to answer”. For analysis purposes, using NA for both will usually suffice. For users that need to distinguish between different kinds of missing, we suggest looking into the as.item() function from package memisc.

Converting suspicious values to missing

We can use logical subsetting to assign NA to the data entry errors:

# change all impossible age values to NA 
# assume all adults in this dataset
dat$age[dat$age<18 | dat$age>120] <- NA

# remember to use quotes for character variables
dat$sex[dat$sex == "12.2"] <- NA

Now we want to translate all -99, -98, and “not assessed” values to NA. The following syntax can be used to translate all values of a to b in dataset df: df[df==a] <- b

# change all of the following values in the data set to NA 
dat[dat == -99] <- NA
dat[dat == -98] <- NA
dat[dat == "not assessed"] <- NA
dat[dat == "-99"] <- NA

describe() plot after cleaning data

The describe() plot for categorical variables looks better now – no suspicious labels.

Now that we have missing NA values, the dots are colored by the number of missing values for that variable.

# describe plot for categorical variables after cleaning
# now colored by missing values
plot(describe(dat), which = "categorical")

describe() plot after cleaning data

The describe() plot for continuous also looks better – histograms are nicely spread out the dot plots and nicely spread out histograms.

# describe plot for continuous variables after cleaning
# scale_color_gradient() changes the colors used to denote missing
plot(describe(dat), which = "continuous") + scale_color_gradient(low="blue", high="red")

Inspect your data after cleaning

We don’t see any obvious suspicious values.

# inspect variables after cleaning
describe(dat)
## dat 
## 
##  14  Variables      120  Observations
## --------------------------------------------------------------------------------
## university 
##        n  missing distinct 
##      120        0        2 
##                       
## Value       UCLA  UCSF
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## universityid 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        2    0.749    1.483   0.5036 
##                       
## Value          1     2
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## student 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      120        1     60.5    40.33     6.95    12.90 
##      .25      .50      .75      .90      .95 
##    30.75    60.50    90.25   108.10   114.05 
## 
## lowest :   1   2   3   4   5, highest: 116 117 118 119 120
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      119        1        2 
##                         
## Value      female   male
## Frequency      74     45
## Proportion  0.622  0.378
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      119        1      112        1    23.74    4.049    18.95    19.38 
##      .25      .50      .75      .90      .95 
##    21.19    23.16    26.01    28.46    29.50 
## 
## lowest : 18.01 18.19 18.33 18.45 18.9 , highest: 29.84 33.93 34.16 34.97 35.8 
## --------------------------------------------------------------------------------
## class 
##        n  missing distinct 
##      115        5        3 
##                                         
## Value         junior    senior sophomore
## Frequency         27        66        22
## Proportion     0.235     0.574     0.191
## --------------------------------------------------------------------------------
## transfer 
##        n  missing distinct 
##      114        6        2 
##                       
## Value         no   yes
## Frequency     97    17
## Proportion 0.851 0.149
## --------------------------------------------------------------------------------
## teaching_assistant 
##        n  missing distinct 
##      120        0        5 
##                                         
## Value         T1    T2    T3    T4    T5
## Frequency      7     4    45    28    36
## Proportion 0.058 0.033 0.375 0.233 0.300
## --------------------------------------------------------------------------------
## homework1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      115        5       10    0.981    4.113    2.896        1        1 
##      .25      .50      .75      .90      .95 
##        2        3        6        8       10 
##                                                                       
## Value       1.00  1.99  2.98  3.97  4.96  5.95  6.94  7.93  8.92 10.00
## Frequency     20    17    21    12    13    11     6     7     1     7
## Proportion 0.174 0.148 0.183 0.104 0.113 0.096 0.052 0.061 0.009 0.061
## 
## For the frequency table, variable is rounded to the nearest 0.09
## --------------------------------------------------------------------------------
## homework2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      114        6       10    0.985     4.64    3.069        1        1 
##      .25      .50      .75      .90      .95 
##        3        4        6        9       10 
##                                                                       
## Value       1.00  1.99  2.98  3.97  4.96  5.95  6.94  7.93  8.92 10.00
## Frequency     15    12    19    16    15    10     6     6     6     9
## Proportion 0.132 0.105 0.167 0.140 0.132 0.088 0.053 0.053 0.053 0.079
## 
## For the frequency table, variable is rounded to the nearest 0.09
## --------------------------------------------------------------------------------
## midterm 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      114        6       41    0.998    61.64    12.26    45.00    48.00 
##      .25      .50      .75      .90      .95 
##    53.25    61.50    68.75    74.70    80.10 
## 
## lowest : 41 42 43 44 45, highest: 78 84 87 90 91
## --------------------------------------------------------------------------------
## final 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      115        5       55    0.999    65.13     19.2     33.0     39.4 
##      .25      .50      .75      .90      .95 
##     55.5     67.0     77.0     83.6     93.3 
## 
## lowest :  19  20  22  28  31, highest:  93  94  96  98 100
## --------------------------------------------------------------------------------
## total_points 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       69    0.999      128    37.18    54.85    80.00 
##      .25      .50      .75      .90      .95 
##   116.75   134.50   148.00   164.20   176.00 
## 
## lowest :   7  24  31  35  40, highest: 177 179 180 193 202
## --------------------------------------------------------------------------------
## total_proportion 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       49    0.998   0.5816   0.1689   0.2495   0.3600 
##      .25      .50      .75      .90      .95 
##   0.5300   0.6100   0.6700   0.7500   0.8000 
## 
## lowest : 0.03 0.11 0.14 0.16 0.18, highest: 0.8  0.81 0.82 0.88 0.92
## --------------------------------------------------------------------------------

Visualize your data after cleaning


# histogram of "age" variable after cleaning
hist(dat$age, breaks = 50)


# histogram of "age" variable after cleaning
hist(dat$homework1, breaks = 50)

In-class exercise 1

We will be using the data was collected on 200 high school students.

# Import a dataset for exercise 1
dex1 <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_1.csv")

head(dex1, n=3)
##   id    sex    ses      school_prog read write math science social_st
## 1  1 female    low  public/vocation  -99    44   45      39        41
## 2  2 female middle   public/general   39    41   33      42        41
## 3  3   male    low private/academic   63    65   48      63        56
##         honors awards
## 1 not enrolled      0
## 2     enrolled      1
## 3     enrolled      5

The dataset contains the following information:


1.1 Start data cleaning by investigating your dataset and identifying suspicious values.

1.2 If some string values look obviously misspelled, correct the spelling (Note: a function unique() can be helpful here. It returns a vector of unique values of a variable.) Otherwise, change all of the suspicious values to NA.

1.3 Plot your data after cleaning.

1.1 Suspicious categorical values

# Exercise 1. 1.1 Identifying suspicious categorical values
plot(describe(dex1), which = "categorical")

Variables sex and ses have suspicious values.

1.1 Suspicious continuous variables

# Exercise 1. 1.1 Identifying suspicious continuous values
plot(describe(dex1), which = "continuous")

Variables read and write have suspicious values. Check what are those suspicious values.

# Exercise 1. 1.1 Checking suspicious values
describe(dex1[,c( "read", "write")])
## dex1[, c("read", "write")] 
## 
##  2  Variables      200  Observations
## --------------------------------------------------------------------------------
## read 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0       31    0.995    50.15    15.78    34.95    39.00 
##      .25      .50      .75      .90      .95 
##    44.00    50.00    60.00    66.20    68.00 
## 
## lowest : -99  28  31  34  35, highest:  66  68  71  73  76
## --------------------------------------------------------------------------------
## write 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0       30    0.995    51.39     13.5    34.90    39.00 
##      .25      .50      .75      .90      .95 
##    45.75    54.00    60.00    65.00    65.00 
## 
## lowest : -99  31  33  35  36, highest:  61  62  63  65  67
## --------------------------------------------------------------------------------

1.2 Change suspicious values

Some string values look obviously misspelled.

# Exercise 1. 1.2 Changing suspicious values

# there is "ffemale value and "female " (an extra space)
unique(dex1$sex)
## [1] "female"  "male"    "ffemale" "female "

# correcting misspelling
dex1$sex[dex1$sex=="ffemale"] <- "female"
dex1$sex[dex1$sex=="female "] <- "female"

# changing ses "not assessed" to NA
dex1$ses[dex1$ses=="not assessed"] <- NA

# changing read and write to NA
dex1[dex1==-99] <- NA

1.3 Plot your data after cleaning

# Exercise 1. 1.3 Plotting after cleaning
plot(describe(dex1))
## $Categorical

## 
## $Continuous

Complete cases

complete.cases(x) returns a logical vector of whether each row of x has no missing (TRUE) or at least one NA (FALSE).

# which cases are complete
complete.cases(dat)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [49] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [61]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [73]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
##  [97] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [109] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

We can use logical subsetting to extract a data set of complete cases.

# create a dataset of complete cases
d <- dat[complete.cases(dat),]

Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.

# number of complete cases
sum(complete.cases(dat))
## [1] 98

Piping

Piping with pipe operator |>

A data management task may involve many steps to reach the final result. Piping makes your code more readable and compact by focusing on the functions used rather than the name of datasets.

The pipe operator “pipes” the object (e.g. a data frame) on the left of the |> operator to the function on the right of the operator.

The code x |> f(y) translates to f(x,y), that is, x is treated by default as the first argument of f(), while y is another argument of f().

The object returned by the function can then be piped into another function. Thus x |> f(y) |> g(z) translates to g(f(x,y), z).

Examples of using the pipe operator

As a first example, perhaps we want to create a data set of just females under 25, with only the age and transfer variables selected. We could do this in 2 steps, like so:

# a dataset of age and transfer status for females only
f25 <- filter(d, sex == "female" & age<25)
f25_small <- select(f25, age, transfer)
f25_small
## # A tibble: 41 × 2
##      age transfer
##    <dbl> <chr>   
##  1  23.9 no      
##  2  23.8 no      
##  3  24.5 no      
##  4  18.4 no      
##  5  22.7 no      
##  6  18.4 no      
##  7  23.2 yes     
##  8  22.7 no      
##  9  24.4 no      
## 10  21.0 no      
## # ℹ 31 more rows

While that works fine, the intermediate dataset f25 is not of interest and is cluttering up memory and the workspace unnecessarily.

We could use |> instead:

# start with d, then filter rows, then select variables
f25_small <- d |>   
  filter(sex == "female" & age<25) |>
  select(age, transfer)
f25_small
## # A tibble: 41 × 2
##      age transfer
##    <dbl> <chr>   
##  1  23.9 no      
##  2  23.8 no      
##  3  24.5 no      
##  4  18.4 no      
##  5  22.7 no      
##  6  18.4 no      
##  7  23.2 yes     
##  8  22.7 no      
##  9  24.4 no      
## 10  21.0 no      
## # ℹ 31 more rows

No intermediate dataset was created, and dataset transformations, filtering and selecting, are easy to see.

Another pipe example

Piping really helps make a long chain of commands more immediately understandable.

# create a dataset of mean scores of midterm and final
# by teaching assistant, for teaching assistants with more than 15 students

g1 <- d |>   
  summarise(n_students = n(),   
   avg_midterm = mean(midterm), 
    avg_final = mean(final),
   .by = teaching_assistant) |>  
    filter(n_students>15)
g1
## # A tibble: 3 × 4
##   teaching_assistant n_students avg_midterm avg_final
##   <chr>                   <int>       <dbl>     <dbl>
## 1 T3                         37        61.3      67.1
## 2 T5                         28        58.4      64.3
## 3 T4                         22        63.8      70.5

Using _ in piped commands

If the dataset is not to be specified as the first argument of the function on the right of the |> operator, use _ to specify where the dataset should appear as the argument.

x |> f(y,_) translates to f(y,x).

For example, for modeling functions such as lm(), the model formula is typically the first argument rather than the dataset, but we can use _ to stand in for the dataset in the piping chain:

# In base R uses "_" represents the input data set
d |>
  filter(age<30) |>
   lm(total_proportion ~ age, data=_ ) 
## 
## Call:
## lm(formula = total_proportion ~ age, data = filter(d, age < 30))
## 
## Coefficients:
## (Intercept)          age  
##   0.6276767   -0.0006733

Piping with tidyverse operator %>%

The predecessor of base R pipe operator |> is magrittr package’s pipe operator %>% (automatically loaded with library(tidyverse)).

We will use |> operator throughout this workshop, but everything shown here can be substituted by %>%.

The curators of R and tidyverse recommend using the new pipe operator |> for two reasons:

The pipe operator an be typed with CMD-SHIFT-M on a Mac (CTRL-SHIFT-M on a PC). By default, this will results in %>% operator. In Rstudio, you change this going to Tools -> Global options -> Code -> select option “Use native pipe operator, |> (requires R 4.1+)”

Read more

# NOTE: in tidyverse "." represents the input data set instead of "_"  
d %>%
  filter(age<30) %>%
  lm(total_proportion ~ age, data=.) # the . is the filtered dataset
## 
## Call:
## lm(formula = total_proportion ~ age, data = .)
## 
## Coefficients:
## (Intercept)          age  
##   0.6276767   -0.0006733

Data management with dplyr

The dplyr package

The dplyr package:


Conveniently, when using dplyr functions:

Column manipulation with dplyr

First, we explore dplyr functions that work with columns:

Keep only the variables you need with select()

# select 4 variables
d_small <- d |>
  select(universityid, student, age, class)
head(d_small, n = 3)
## # A tibble: 3 × 4
##   universityid student   age class    
##          <dbl>   <dbl> <dbl> <chr>    
## 1            1       1  35.0 sophomore
## 2            1       2  23.9 sophomore
## 3            1       3  21.4 sophomore

Often we want to select a group of related variables with similar names. dplyr supplies helper functions to find columns whose names match a specified pattern:

# get university variables
duniversity <- d |>
  select(starts_with("university"))
names(duniversity)
## [1] "university"   "universityid"

# get "homework1" and "homework2" variables using num_range()
homeworks <- d |>
  select(num_range("homework", 1:2))
names(homeworks)
## [1] "homework1" "homework2"

Rename columns with rename() and rename_with()

Use dplyr function rename() to change column names.

List the new variable name on the left hand side and the old name on the right hand side.

# original names
names(d)
##  [1] "university"         "universityid"       "student"           
##  [4] "sex"                "age"                "class"             
##  [7] "transfer"           "teaching_assistant" "homework1"         
## [10] "homework2"          "midterm"            "final"             
## [13] "total_points"       "total_proportion"

# rename "teaching assistant" variable as "TA"
ta <- d |>
  rename(TA = teaching_assistant) 
names(ta)
##  [1] "university"       "universityid"     "student"          "sex"             
##  [5] "age"              "class"            "transfer"         "TA"              
##  [9] "homework1"        "homework2"        "midterm"          "final"           
## [13] "total_points"     "total_proportion"

Sometimes we want to rename many columns using a function that will perform some operation on those names.

We can specify a function inside of dplyr’s rename_with() to change column names using that function’s operations.

For rename_with():

For example, we may want to capitalize all of the column names using function toupper().

# capitalize all column names using toupper() inside of rename_with()
cap <- d |>
  rename_with(toupper)
names(cap)
##  [1] "UNIVERSITY"         "UNIVERSITYID"       "STUDENT"           
##  [4] "SEX"                "AGE"                "CLASS"             
##  [7] "TRANSFER"           "TEACHING_ASSISTANT" "HOMEWORK1"         
## [10] "HOMEWORK2"          "MIDTERM"            "FINAL"             
## [13] "TOTAL_POINTS"       "TOTAL_PROPORTION"

Or, we may want to add a prefix to all column names using paste0(), a string concatenation function. Because we specify additional arguments to paste0(), we precede with ~ inside of rename_with()

# add a prefix to each column
pre <- d |>
  rename_with(~paste0("d1_", .x)) # .x place holder for all columns
names(pre)
##  [1] "d1_university"         "d1_universityid"       "d1_student"           
##  [4] "d1_sex"                "d1_age"                "d1_class"             
##  [7] "d1_transfer"           "d1_teaching_assistant" "d1_homework1"         
## [10] "d1_homework2"          "d1_midterm"            "d1_final"             
## [13] "d1_total_points"       "d1_total_proportion"

We can use dplyr’s column selector functions in the third argument to rename_with() to rename only a subset of columns.

# add a suffix to only columns that start with "univsersity"
p_sub <- d |>
  rename_with(~paste0(.x, "_2023"), starts_with("university") )
names(p_sub)
##  [1] "university_2023"    "universityid_2023"  "student"           
##  [4] "sex"                "age"                "class"             
##  [7] "transfer"           "teaching_assistant" "homework1"         
## [10] "homework2"          "midterm"            "final"             
## [13] "total_points"       "total_proportion"

Reorder columns with relocate()

Specifying just column names inside relocate() will move them to the first columns of the data frame:

# make column "total_points" and "total_proportion" to first columns in the dataset
first <- d |>
  relocate(total_points, total_proportion)
names(first)
##  [1] "total_points"       "total_proportion"   "university"        
##  [4] "universityid"       "student"            "sex"               
##  [7] "age"                "class"              "transfer"          
## [10] "teaching_assistant" "homework1"          "homework2"         
## [13] "midterm"            "final"

# same as above
first <- d |>
  relocate(starts_with("total"))
names(first)
##  [1] "total_points"       "total_proportion"   "university"        
##  [4] "universityid"       "student"            "sex"               
##  [7] "age"                "class"              "transfer"          
## [10] "teaching_assistant" "homework1"          "homework2"         
## [13] "midterm"            "final"

Columns can be renamed as they are moved, using the same syntax as rename().

# make homework1 first column and rename it to h1
h1 <- d |>
  relocate(h1 = homework1)
names(h1)
##  [1] "h1"                 "university"         "universityid"      
##  [4] "student"            "sex"                "age"               
##  [7] "class"              "transfer"           "teaching_assistant"
## [10] "homework2"          "midterm"            "final"             
## [13] "total_points"       "total_proportion"

Use arguments .before and .after to indicate specifically where the column should be relocated

# relocate homework1 after homework2
h2 <- d |>
  relocate(homework1, .after=homework2)
names(h2)
##  [1] "university"         "universityid"       "student"           
##  [4] "sex"                "age"                "class"             
##  [7] "transfer"           "teaching_assistant" "homework2"         
## [10] "homework1"          "midterm"            "final"             
## [13] "total_points"       "total_proportion"

# relocate homework1 before last column
hl <- d |>
  relocate(homework1, .before=last_col())
names(hl)
##  [1] "university"         "universityid"       "student"           
##  [4] "sex"                "age"                "class"             
##  [7] "transfer"           "teaching_assistant" "homework2"         
## [10] "midterm"            "final"              "total_points"      
## [13] "homework1"          "total_proportion"

Create, modify and delete columns with mutate()

The function mutate() allows us to add new variables, generally transformations of existing variables, all in one step without having to re-specify the data frame name over and over.

Useful R functions for transforming:

# create age category variable, and highprop binary variable
d <- d |>
  mutate(agecat = cut(age, breaks=c(18,20,24,28,32,36)),
         highprop = total_proportion > mean(total_proportion))
table(d$agecat, d$highprop)
##          
##           FALSE TRUE
##   (18,20]    10    4
##   (20,24]    20   24
##   (24,28]    16   11
##   (28,32]     3    6
##   (32,36]     0    4

Row manipulation with dplyr

Now we explore dplyr functions that operate on rows of data frames:

Drop rows with filter()

We have already seen how to subset using logical vectors and logical subsetting:

# all variables for students with total points 180 or more
total_points180 <- d[d$total_points>=180,]
total_points180
## # A tibble: 2 × 16
##   university universityid student sex      age class transfer teaching_assistant
##   <chr>             <dbl>   <dbl> <chr>  <dbl> <chr> <chr>    <chr>             
## 1 UCLA                  1       6 female  23.8 juni… no       T1                
## 2 UCSF                  2     105 female  21.5 juni… no       T3                
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## #   highprop <lgl>

Complex conditions can require lots of code, especially with long data set names.

The dplyr function filter() provides a cleaner syntax for subsetting rows of data frames. Conditions separated by , are joined by & (logical AND).

# subset to UCLA students with total_points >= 180
dfhp <- d |>
  filter(university=="UCLA", total_points>=180)
dfhp
## # A tibble: 1 × 16
##   university universityid student sex      age class transfer teaching_assistant
##   <chr>             <dbl>   <dbl> <chr>  <dbl> <chr> <chr>    <chr>             
## 1 UCLA                  1       6 female  23.8 juni… no       T1                
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## #   highprop <lgl>

R Logical operators and functions

Here are some operators and functions to help with selection:

# select those not in UCLA, who are either younger than 25 or older than 30
d |>
  filter(university!="UCLA", age<25 | age>30)
## # A tibble: 36 × 16
##    university universityid student sex     age class transfer teaching_assistant
##    <chr>             <dbl>   <dbl> <chr> <dbl> <chr> <chr>    <chr>             
##  1 UCSF                  2      64 male   33.9 seni… no       T3                
##  2 UCSF                  2      65 male   21.6 soph… yes      T3                
##  3 UCSF                  2      66 fema…  23.6 soph… no       T4                
##  4 UCSF                  2      71 male   21.5 seni… no       T4                
##  5 UCSF                  2      72 male   22.6 seni… no       T4                
##  6 UCSF                  2      73 fema…  24.9 juni… no       T4                
##  7 UCSF                  2      79 male   23.1 seni… yes      T4                
##  8 UCSF                  2      80 fema…  22.1 seni… no       T4                
##  9 UCSF                  2      82 fema…  22.6 juni… no       T5                
## 10 UCSF                  2      83 male   19.5 soph… yes      T5                
## # ℹ 26 more rows
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## #   highprop <lgl>

Sort the order of rows by variable values with arrange()

By default, ascending order will be used. Surround a sorting variable with desc() to sort by descending order instead.

# sort by sex where females are before males, then by age, oldest first
d_arrange <- d |>
  arrange(sex, desc(age)) 
head(d_arrange, n=20)
## # A tibble: 20 × 16
##    university universityid student sex     age class transfer teaching_assistant
##    <chr>             <dbl>   <dbl> <chr> <dbl> <chr> <chr>    <chr>             
##  1 UCSF                  2     112 fema…  34.2 seni… no       T4                
##  2 UCLA                  1      14 fema…  29.8 seni… no       T3                
##  3 UCSF                  2      67 fema…  29.6 juni… yes      T4                
##  4 UCSF                  2      69 fema…  29.5 seni… yes      T4                
##  5 UCLA                  1       9 fema…  29.3 seni… yes      T3                
##  6 UCLA                  1      41 fema…  28.6 seni… no       T3                
##  7 UCSF                  2      93 fema…  28.4 seni… no       T5                
##  8 UCLA                  1      58 fema…  28.2 seni… no       T5                
##  9 UCLA                  1      52 fema…  28.0 seni… no       T3                
## 10 UCLA                  1      36 fema…  27.8 seni… no       T5                
## 11 UCSF                  2      63 fema…  27.7 seni… no       T3                
## 12 UCLA                  1       8 fema…  27.1 juni… no       T3                
## 13 UCLA                  1      10 fema…  27.1 juni… no       T3                
## 14 UCLA                  1      31 fema…  26.3 soph… yes      T5                
## 15 UCLA                  1      30 fema…  26.1 seni… no       T5                
## 16 UCSF                  2      89 fema…  25.9 seni… no       T5                
## 17 UCLA                  1      19 fema…  25.5 juni… yes      T2                
## 18 UCSF                  2      68 fema…  25.4 soph… no       T4                
## 19 UCSF                  2      75 fema…  25.2 juni… no       T4                
## 20 UCLA                  1      25 fema…  25.2 soph… yes      T3                
## # ℹ 8 more variables: homework1 <dbl>, homework2 <dbl>, midterm <dbl>,
## #   final <dbl>, total_points <dbl>, total_proportion <dbl>, agecat <fct>,
## #   highprop <lgl>

Grouped data operations with .by/by

Often we want to create variables with values aggregated by group, rather than over the whole dataset (e.g. the mean final score by teaching assistant).

Many dplyr functions allow a grouping variable (or variables) to be specified using the argument .by, which specifies that the function should operate on groups rather than individual observations.

The dplyr function summarise() creates a data frame resulting from functions operating on each group denoted by the variable(s) specified .by=. The output data frame will have one row per group in the .by variable.

Example of useful functions for summarise():

# Create summaries of students by teaching assistant using .by
student_summ2 <- d |>
  summarise(n_students = n(), # number of students
            max_pts = max(total_points), # max points 
            avg_age = mean(age), # average age
            n_seniors = sum(class=="senior"), # number of seniors students 
            .by = teaching_assistant) 
student_summ2
## # A tibble: 5 × 5
##   teaching_assistant n_students max_pts avg_age n_seniors
##   <chr>                   <int>   <dbl>   <dbl>     <int>
## 1 T1                          7     180    25.3         2
## 2 T3                         37     202    24.3        23
## 3 T2                          4     148    21.3         3
## 4 T5                         28     171    22.7        16
## 5 T4                         22     179    24.1        13

The function mutate() also accepts .by variables, allowing easy generation of aggregated variables:

# add a variable that is the mean final score by teaching assistant
d_with_ta_summ <- d |>
  mutate(avg_final = mean(final),
         .by = teaching_assistant)
# first 20 rows of relevant columns
head(select(d_with_ta_summ, student, teaching_assistant, final, avg_final), n=10)
## # A tibble: 10 × 4
##    student teaching_assistant final avg_final
##      <dbl> <chr>              <dbl>     <dbl>
##  1       1 T1                    68      64.9
##  2       2 T1                    67      64.9
##  3       3 T1                    96      64.9
##  4       4 T1                    40      64.9
##  5       5 T1                    33      64.9
##  6       6 T1                    98      64.9
##  7       7 T1                    52      64.9
##  8       8 T3                    80      67.1
##  9       9 T3                    68      67.1
## 10      10 T3                    87      67.1

Package conflicts and masking

When two packages that contain a function with the same name are loaded into the R session, a conflict arises. For example, the packages Hmisc and dplyr both contain a summarize() function.

The function from the package loaded most recently will mask the function with the same name from the package loaded earlier and will be called by default. If dplyr is loaded after Hmisc, when summarize() is specified, dplyr’s summarize() will be called.

We can still access the function loaded from the first package in a couple of ways.

Vector functions in dplyr

The following functions work on individual vectors, not data frames.

We are going to focus on two functions that are useful for recoding variables.

Recode variables using binary conditions with if_else()

Use if_else() to apply a condition, especially binary conditions, to each element of a vector and return values based on the condition.

The syntax is if_else(condition, true, false), where:

# create a pass/fail (0/1) variable
d <- d |>
  mutate(pass_final = if_else(final > 60, true=1, false=0))

head(select(d, final, pass_final))
## # A tibble: 6 × 2
##   final pass_final
##   <dbl>      <dbl>
## 1    68          1
## 2    67          1
## 3    96          1
## 4    40          0
## 5    33          0
## 6    98          1

if_else() allows substitution of values from other vectors of the same length.

# use highest score for best_test
d <- d |>
  mutate(best_test = if_else(midterm>final, true=midterm, false=final))

head(select(d, midterm, final, best_test))
## # A tibble: 6 × 3
##   midterm final best_test
##     <dbl> <dbl>     <dbl>
## 1      60    68        68
## 2      57    67        67
## 3      78    96        96
## 4      45    40        45
## 5      44    33        44
## 6      71    98        98

Recode variables under multiple conditions with case_when()

Inside case_when(), each condition to evaluate on an existing variable is followed by ~ and the value assigned to the new variable.

# Recode variable total_points
d <- d |>
  mutate(letter_grade = case_when(
    total_points < 120 ~ "F",
    between(total_points, 120, 130) ~ "D", # between(x, a, b) return TRUE if a<=x<=b, FALSE otherwise
    between(total_points, 130.01, 140) ~ "C",
    between(total_points, 140.01, 150) ~ "B",
    total_points > 150 ~ "A"
  )
  )

head(select(d, total_points, letter_grade))
## # A tibble: 6 × 2
##   total_points letter_grade
##          <dbl> <chr>       
## 1          140 C           
## 2          128 D           
## 3          176 A           
## 4           90 F           
## 5           80 F           
## 6          180 A

Cases with values that do not satisfy any condtion become missing values.

# create an "upperclassman" variable
d <- d |>
  mutate(upper = case_when(
    class %in% c("freshman", "sophomore") ~ 0,
    class == "senior" ~ 1 # oops, forgot juniors
  ))

head(select(d, class, upper))
## # A tibble: 6 × 2
##   class     upper
##   <chr>     <dbl>
## 1 sophomore     0
## 2 sophomore     0
## 3 sophomore     0
## 4 senior        1
## 5 senior        1
## 6 junior       NA

A .default argument specifies the value to return for unmatched values.

Missing values can be matched using is.na(), and .default can be used to control the value used for unmatched values of .x

# create a recommend variable based on grade
d <- d |>
  mutate(recommend = case_when(
    letter_grade == "A" ~ "yes",
    letter_grade == "B" ~ "maybe",
    is.na(letter_grade) ~ NA, # without this, NA will match to .default
    .default = "no"  # C, D, and F will be no
  ))

head(select(d, letter_grade, recommend), n=20)
## # A tibble: 20 × 2
##    letter_grade recommend
##    <chr>        <chr>    
##  1 C            no       
##  2 D            no       
##  3 A            yes      
##  4 F            no       
##  5 F            no       
##  6 A            yes      
##  7 D            no       
##  8 A            yes      
##  9 C            no       
## 10 A            yes      
## 11 F            no       
## 12 F            no       
## 13 D            no       
## 14 F            no       
## 15 F            no       
## 16 B            maybe    
## 17 C            no       
## 18 D            no       
## 19 C            no       
## 20 F            no

In-class exercise 2

Use the dataset dex1 from the first exercise.

2.1 Drop columns:

2.2 Keep only the rows of students who received more than 50 points for math test.

2.3 Calculate the mean math and the mean science scores aggregating by honors status.


Try to perform these steps in a series of commands using pipe operator|>.

2.1 Drop columns

# Exercise 2. 2.1 Dropping columns
dex1 |>
  select(-read, -write, -awards, -school_prog) |>
  glimpse() 
## Rows: 200
## Columns: 7
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ sex       <chr> "female", "female", "male", "female", "male", "female", "mal…
## $ ses       <chr> "low", "middle", "low", "low", "low", "low", "middle", "low"…
## $ math      <int> 45, 33, 48, 41, 43, 46, 59, 52, 52, 49, 45, 45, 39, 54, 44, …
## $ science   <int> 39, 42, 63, 39, 45, 40, 47, 44, 44, 53, 39, 39, 47, 42, 26, …
## $ social_st <int> 41, 41, 56, 51, 31, 41, 51, 48, 51, 61, 36, 46, 61, 56, 42, …
## $ honors    <chr> "not enrolled", "enrolled", "enrolled", "not enrolled", "not…

2.2 Drop rows

Keep only the rows of students who received more than 50 points for math test.

# Exercise 2. 2.2 Dropping rows
dex1 |>
  select(-read, -write, -awards, -school_prog) |>
  filter(math > 50) |>
  distinct(math) |>
  arrange(math)
##    math
## 1    51
## 2    52
## 3    53
## 4    54
## 5    55
## 6    56
## 7    57
## 8    58
## 9    59
## 10   60
## 11   61
## 12   62
## 13   63
## 14   64
## 15   65
## 16   66
## 17   67
## 18   68
## 19   69
## 20   70
## 21   71
## 22   72
## 23   73
## 24   75

2.3 Summarise

Calculate the mean math and the mean science scores aggregating by honors status.

# Exercise 2. 2.3 Summarising
dex1 |>
  select(-read, -write, -awards, -school_prog) |> # dropping columns
  filter(math > 50) |> # dropping rows
  summarise(mean_math = mean(math), # summarizing by honors
            mean_science = mean(science),
            .by = honors) 
##         honors mean_math mean_science
## 1 not enrolled  56.93846     55.04615
## 2     enrolled  62.62500     59.31250

Combining data sets

Appending

Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.

Two choices for appending, rbind() and bind_rows()

We can use either the base R function rbind() or the dplyr function bind_rows() to append. They differ in how they handle the situation where the datasets being appended have non-matching columns.

They also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).

Example of appending data sets with non-matching columns

Let’s load a second dataset that contains the same variables as the original version of our first dataset, but without any variables created during this workshop.

# new data set that contains the same variables as d, except is missing those created during the workshop
d2 <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/student_data2.csv") 

# rows and columns of d2 and d
dim(d)
## [1] 98 21
dim(d2)
## [1] 120  14

Because our first data set d has several extra variables we created earlier with mutate(), the two datasets do not have completely matching columns.

Examples of rbind() and bind_rows()

The best solution would be to calculate the new variables for the second dataset and then use either function to append…

However, to demonstrate the behavior of bind_rows(), we append the datasets as they are.

Use the .id= argument in bind_rows to create a variable that identifies source dataset.

# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")

# these are the rows where the datasets meet
#  final and total_points are in both datasets, letter_grade and recommend only in d
select(d3[95:105,], final, total_points, letter_grade, recommend)
## # A tibble: 11 × 4
##    final total_points letter_grade recommend
##    <dbl>        <dbl> <chr>        <chr>    
##  1    55          135 C            no       
##  2    67          146 B            maybe    
##  3    81          168 A            yes      
##  4    75          152 A            yes      
##  5    68          139 <NA>         <NA>     
##  6    67          127 <NA>         <NA>     
##  7    96          180 <NA>         <NA>     
##  8    40           89 <NA>         <NA>     
##  9    33           89 <NA>         <NA>     
## 10    98          179 <NA>         <NA>     
## 11    52          121 <NA>         <NA>

# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:14], d2)

Merging datasets

Merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).

Merging datasets with dplyr joins

The dplyr “join” functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by= argument to specify specific matching id variables.

These joins all return a table with all columns from x and y, but differ in how they deal with mismatched rows:

Each of these joins is possible with base R function merge(), but merge() is many times slower than the joins.

Examples of merging with dplyr joins

We begin our demonstration of merging by loading a dataset of teaching assistant (TA) variables, that describes the TA’s age, how many year they spent in PhD program so far, etc.

# load data set for merging example
d_teaching_assistants <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/teaching_assistant_data.csv")

d_teaching_assistants
## # A tibble: 40 × 4
##    teaching_assistant ta_age school  year_in_program
##    <chr>               <dbl> <chr>             <dbl>
##  1 T1                   26.8 average               1
##  2 T2                   28.1 top                   3
##  3 T3                   28.5 average               2
##  4 T4                   33.9 top                   6
##  5 T5                   27.9 top                   7
##  6 T6                   24.6 average               7
##  7 T7                   26   average               3
##  8 T8                   29.9 average               2
##  9 T9                   24.8 average               5
## 10 T10                  24.2 average               2
## # ℹ 30 more rows

Subset datasets for joins

It is easiest to show how the joins differ if we work on smaller, more manageable datasets.

We will be merging a subset of the d_ta dataset we just imported into a subset of the d3 dataset we just created earlier through appending.

# select one non-matching and one matching teaching assistant from each to demo joins
#   just a few variables from d3 and d_teaching_assistants
d_st <- d3 |>
  filter(teaching_assistant == "T2" | teaching_assistant == "T7") |>
  select(teaching_assistant, student, midterm, final)

d_ta <- d_teaching_assistants |>
  filter(teaching_assistant == "T2" | teaching_assistant == "T35")

Let’s take a look at our small datasets

# first a look at the two datasets to be merged
# student dataset with information about their teaching assistants
d_st 
## # A tibble: 8 × 4
##   teaching_assistant student midterm final
##   <chr>                <dbl>   <dbl> <dbl>
## 1 T2                      17      48    53
## 2 T2                      18      62    76
## 3 T2                      19      68    64
## 4 T2                      20      48    67
## 5 T7                     137      48    53
## 6 T7                     138      62    76
## 7 T7                     139      68    64
## 8 T7                     140      48    67

# teaching assistant dataset
d_ta
## # A tibble: 2 × 4
##   teaching_assistant ta_age school  year_in_program
##   <chr>               <dbl> <chr>             <dbl>
## 1 T2                   28.1 top                   3
## 2 T35                  30.2 average               1

Example of merging with inner_join()

inner_join(d)st, d_ta)

# only matching rows returned
# T2 from d_ta_small matched four times to T2 in d3_small
inner_join(d_st, d_ta)
## # A tibble: 4 × 7
##   teaching_assistant student midterm final ta_age school year_in_program
##   <chr>                <dbl>   <dbl> <dbl>  <dbl> <chr>            <dbl>
## 1 T2                      17      48    53   28.1 top                  3
## 2 T2                      18      62    76   28.1 top                  3
## 3 T2                      19      68    64   28.1 top                  3
## 4 T2                      20      48    67   28.1 top                  3

Example of merging with left_join()

left_join(d_st, d_ta)

# all rows from d3_small returned
left_join(d_st, d_ta)
## # A tibble: 8 × 7
##   teaching_assistant student midterm final ta_age school year_in_program
##   <chr>                <dbl>   <dbl> <dbl>  <dbl> <chr>            <dbl>
## 1 T2                      17      48    53   28.1 top                  3
## 2 T2                      18      62    76   28.1 top                  3
## 3 T2                      19      68    64   28.1 top                  3
## 4 T2                      20      48    67   28.1 top                  3
## 5 T7                     137      48    53   NA   <NA>                NA
## 6 T7                     138      62    76   NA   <NA>                NA
## 7 T7                     139      68    64   NA   <NA>                NA
## 8 T7                     140      48    67   NA   <NA>                NA

Example of merging with full_join()

full_join(d_st, d_ta)

# all rows from both returned
full_join(d_st, d_ta)
## # A tibble: 9 × 7
##   teaching_assistant student midterm final ta_age school  year_in_program
##   <chr>                <dbl>   <dbl> <dbl>  <dbl> <chr>             <dbl>
## 1 T2                      17      48    53   28.1 top                   3
## 2 T2                      18      62    76   28.1 top                   3
## 3 T2                      19      68    64   28.1 top                   3
## 4 T2                      20      48    67   28.1 top                   3
## 5 T7                     137      48    53   NA   <NA>                 NA
## 6 T7                     138      62    76   NA   <NA>                 NA
## 7 T7                     139      68    64   NA   <NA>                 NA
## 8 T7                     140      48    67   NA   <NA>                 NA
## 9 T35                     NA      NA    NA   30.2 average               1

In-class excercise 3

We are going to practice merging two datasets, and for this purpose, we will use the dataset dex1 along with a new dataset dex2.

# Excercise 3. Import dataset dex2
dex2 <- read_csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/08/excercise_2.csv")
dex2
## # A tibble: 5 × 4
##      id school_prog      awards city_state    
##   <dbl> <chr>             <dbl> <chr>         
## 1     1 public/vocation       0 NY New York   
## 2     2 public/general        1 CA Los Angeles
## 3     3 private/academic      5 NY New York   
## 4     5 private/academic      0 CA Los Angeles
## 5   100 public/academic       1 CA Los Angeles

3.1 In dataset dex1, keep the variables:

3.2 Subset dex1 to only students with ids 1 through 8.

3.3 Create a new dataset dex3 using dex1 and dex2, containing all rows from dex1 and only matched rows from dex2.

3.1 Keep a subset of columns

In dataset dex1, keep the variables:

# Excercise 3. 3.1 Keeping a subset of columns
dex1 |>
  select(id, ses, school_prog, math, science, honors)
##      id    ses      school_prog math science       honors
## 1     1    low  public/vocation   45      39 not enrolled
## 2     2 middle   public/general   33      42     enrolled
## 3     3    low private/academic   48      63     enrolled
## 4     4    low  public/academic   41      39 not enrolled
## 5     5    low  public/academic   43      45 not enrolled
## 6     6    low  public/academic   46      40 not enrolled
## 7     7 middle  public/academic   59      47 not enrolled
## 8     8    low  public/academic   52      44 not enrolled
## 9     9 middle  public/vocation   52      44 not enrolled
## 10   10 middle   public/general   49      53 not enrolled
## 11   11 middle  public/academic   45      39 not enrolled
## 12   12 middle  public/vocation   45      39 not enrolled
## 13   13 middle  public/vocation   39      47 not enrolled
## 14   14   high  public/academic   54      42 not enrolled
## 15   15   high  public/vocation   44      26 not enrolled
## 16   16    low  public/vocation   44      36 not enrolled
## 17   17 middle  public/academic   48      44 not enrolled
## 18   18 middle  public/vocation   49      44 not enrolled
## 19   19    low   public/general   43      44 not enrolled
## 20   20   high  public/academic   57      61 not enrolled
## 21   21 middle   public/general   61      50 not enrolled
## 22   22 middle  public/vocation   39      56 not enrolled
## 23   23    low  public/academic   64      58     enrolled
## 24   24 middle  public/academic   66      47     enrolled
## 25   25 middle   public/general   42      42 not enrolled
## 26   26   high  public/academic   62      61 not enrolled
## 27   27 middle  public/academic   61      57     enrolled
## 28   28 middle   public/general   54      50 not enrolled
## 29   29    low   public/general   49      55 not enrolled
## 30   30   high  public/academic   42      34 not enrolled
## 31   31 middle  private/general   52      42 not enrolled
## 32   32   high  public/vocation   66      66     enrolled
## 33   33    low  public/academic   72      54     enrolled
## 34   34   high private/academic   57      55     enrolled
## 35   35    low  private/general   50      50 not enrolled
## 36   36    low   public/general   44      35 not enrolled
## 37   37    low  public/vocation   40      39 not enrolled
## 38   38    low  public/academic   50      31 not enrolled
## 39   39   high  public/academic   67      61     enrolled
## 40   40    low   public/general   43      50 not enrolled
## 41   41 middle  public/academic   45      55 not enrolled
## 42   42 middle  public/vocation   55      44 not enrolled
## 43   43    low  public/academic   43      42 not enrolled
## 44   44    low  public/vocation   45      34     enrolled
## 45   45    low  public/vocation   41      29     enrolled
## 46   46    low  public/academic   44      34 not enrolled
## 47   47    low  public/academic   49      33 not enrolled
## 48   48 middle  public/academic   52      50 not enrolled
## 49   49   high  public/vocation   39      49 not enrolled
## 50   50 middle   public/general   42      53 not enrolled
## 51   51   high   public/general   42      31 not enrolled
## 52   52    low  public/academic   53      53 not enrolled
## 53   53 middle  public/vocation   46      39     enrolled
## 54   54    low  private/general   46      50 not enrolled
## 55   55 middle private/academic   49      44 not enrolled
## 56   56 middle  public/vocation   46      58 not enrolled
## 57   57 middle  public/academic   72      66     enrolled
## 58   58 middle  public/vocation   40      44 not enrolled
## 59   59 middle  public/academic   63      55     enrolled
## 60   60 middle  public/academic   51      63     enrolled
## 61   61   high  public/academic   60      67     enrolled
## 62   62   high   public/general   48      63     enrolled
## 63   63    low   public/general   60      56     enrolled
## 64   64   high  public/vocation   45      58 not enrolled
## 65   65 middle  public/academic   66      42 not enrolled
## 66   66 middle  public/vocation   56      50     enrolled
## 67   67    low  public/vocation   42      33 not enrolled
## 68   68 middle  public/academic   71      63     enrolled
## 69   69    low  public/vocation   40      40 not enrolled
## 70   70    low   public/general   41      47 not enrolled
## 71   71 middle   public/general   56      58     enrolled
## 72   72 middle  public/vocation   47      47 not enrolled
## 73   73 middle  public/academic   53      39 not enrolled
## 74   74 middle  public/academic   50      51 not enrolled
## 75   75 middle  public/vocation   51      53 not enrolled
## 76   76   high  public/academic   51      50 not enrolled
## 77   77    low  public/academic   49      44 not enrolled
## 78   78 middle  public/academic   54      53 not enrolled
## 79   79 middle  public/academic   49      50     enrolled
## 80   80   high  public/academic   68      66     enrolled
## 81   81    low  public/academic   59      65 not enrolled
## 82   82   high  public/academic   65      69     enrolled
## 83   83 middle  public/vocation   41      55     enrolled
## 84   84 middle   public/general   54      58 not enrolled
## 85   85 middle   public/general   57      53 not enrolled
## 86   86   high   public/general   54      58 not enrolled
## 87   87 middle   public/general   46      50 not enrolled
## 88   88   high  public/academic   64      69     enrolled
## 89   89    low  public/vocation   40      51 not enrolled
## 90   90   high  public/academic   50      50 not enrolled
## 91   91   high  public/vocation   56      47 not enrolled
## 92   92   high   public/general   57      63     enrolled
## 93   93   high  public/academic   62      58     enrolled
## 94   94   high  public/academic   61      61 not enrolled
## 95   95   high  public/academic   71      61     enrolled
## 96   96   high  public/academic   61      58 not enrolled
## 97   97   high  public/academic   58      58 not enrolled
## 98   98    low  public/vocation   51      53     enrolled
## 99   99   high   public/general   56      66 not enrolled
## 100 100   high  public/academic   71      69     enrolled
## 101 101   high  public/academic   67      50     enrolled
## 102 102   high  public/academic   51      53 not enrolled
## 103 103   high  public/academic   64      64 not enrolled
## 104 104   high  public/academic   57      55     enrolled
## 105 105 middle  public/academic   45      44 not enrolled
## 106 106 middle  public/vocation   37      42 not enrolled
## 107 107    low  public/vocation   47      42 not enrolled
## 108 108 middle   public/general   41      36 not enrolled
## 109 109 middle   public/general   42      42 not enrolled
## 110 110 middle  public/vocation   50      54 not enrolled
## 111 111    low   public/general   39      47 not enrolled
## 112 112 middle  public/academic   48      55 not enrolled
## 113 113 middle  public/academic   51      63 not enrolled
## 114 114   high  public/academic   62      55     enrolled
## 115 115    low   public/general   43      50 not enrolled
## 116 116 middle  public/academic   54      50 not enrolled
## 117 117   high  public/vocation   39      42 not enrolled
## 118 118 middle   public/general   58      58     enrolled
## 119 119    low   public/general   45      50 not enrolled
## 120 120   high  public/academic   54      50 not enrolled
## 121 121 middle  public/vocation   53      63 not enrolled
## 122 122 middle  public/academic   58      53 not enrolled
## 123 123   high   public/general   56      63 not enrolled
## 124 124    low  public/vocation   41      42 not enrolled
## 125 125    low  public/academic   58      59     enrolled
## 126 126 middle   public/general   57      47 not enrolled
## 127 127   high  public/academic   57      55 not enrolled
## 128 128   <NA>  public/academic   38      47 not enrolled
## 129 129    low   public/general   46      47 not enrolled
## 130 130   high   public/general   55      55 not enrolled
## 131 131   high  public/academic   57      46 not enrolled
## 132 132 middle  public/academic   73      69     enrolled
## 133 133 middle  public/vocation   40      34     enrolled
## 134 134    low   public/general   39      34 not enrolled
## 135 135    low  public/academic   65      54     enrolled
## 136 136 middle  public/academic   70      63 not enrolled
## 137 137   high  public/academic   65      53     enrolled
## 138 138 middle  public/vocation   40      50 not enrolled
## 139 139 middle  public/academic   61      55 not enrolled
## 140 140 middle  public/vocation   40      50 not enrolled
## 141 141   high  public/vocation   47      53 not enrolled
## 142 142 middle  public/vocation   52      39 not enrolled
## 143 143 middle  public/vocation   75      72     enrolled
## 144 144   high   public/general   58      61     enrolled
## 145 145 middle  public/vocation   38      36 not enrolled
## 146 146   high  public/academic   64      63     enrolled
## 147 147    low  public/academic   53      53     enrolled
## 148 148 middle  public/vocation   51      47 not enrolled
## 149 149    low   public/general   49      66 not enrolled
## 150 150 middle  public/vocation   57      72 not enrolled
## 151 151 middle  public/vocation   52      48 not enrolled
## 152 152   high  public/academic   56      58 not enrolled
## 153 153 middle  public/vocation   40      39 not enrolled
## 154 154   high  public/academic   66      61     enrolled
## 155 155 middle   public/general   46      39 not enrolled
## 156 156 middle  public/academic   53      61 not enrolled
## 157 157 middle   public/general   58      74 not enrolled
## 158 158 middle   public/general   55      53 not enrolled
## 159 159   high  public/academic   54      49     enrolled
## 160 160 middle  public/academic   55      50     enrolled
## 161 161    low  public/academic   72      61     enrolled
## 162 162 middle  public/vocation   40      61 not enrolled
## 163 163    low  public/academic   64      58 not enrolled
## 164 164 middle  public/vocation   46      39     enrolled
## 165 165    low  public/vocation   54      61 not enrolled
## 166 166 middle  public/academic   53      61 not enrolled
## 167 167 middle   public/general   35      66 not enrolled
## 168 168 middle  public/academic   57      55 not enrolled
## 169 169    low   public/general   63      69 not enrolled
## 170 170   high  public/academic   61      69     enrolled
## 171 171 middle  public/academic   60      55 not enrolled
## 172 172 middle  public/academic   57      53 not enrolled
## 173 173    low   public/general   61      63     enrolled
## 174 174 middle private/academic   71      66 not enrolled
## 175 175   high  private/general   42      50 not enrolled
## 176 176 middle private/academic   41      42 not enrolled
## 177 177 middle private/academic   62      58 not enrolled
## 178 178 middle private/vocation   57      58 not enrolled
## 179 179 middle private/academic   60      50     enrolled
## 180 180   high private/academic   69      58     enrolled
## 181 181 middle private/academic   45      58 not enrolled
## 182 182 middle private/academic   43      44 not enrolled
## 183 183 middle private/academic   49      55 not enrolled
## 184 184 middle private/vocation   53      55 not enrolled
## 185 185 middle private/academic   55      58 not enrolled
## 186 186 middle private/academic   63      55     enrolled
## 187 187 middle  private/general   57      55 not enrolled
## 188 188   high private/academic   56      55     enrolled
## 189 189 middle private/academic   63      53 not enrolled
## 190 190 middle private/academic   54      58 not enrolled
## 191 191   high private/academic   43      48 not enrolled
## 192 192   high private/academic   63      66     enrolled
## 193 193 middle private/academic   48      39 not enrolled
## 194 194   high private/academic   69      61     enrolled
## 195 195 middle  private/general   60      58 not enrolled
## 196 196   high private/academic   49      39 not enrolled
## 197 197   high private/academic   50      36 not enrolled
## 198 198   high private/academic   51      63     enrolled
## 199 199   high private/academic   50      61 not enrolled
## 200 200 middle private/academic   75      66 not enrolled
head(dex1)
##   id    sex    ses      school_prog read write math science social_st
## 1  1 female    low  public/vocation   NA    44   45      39        41
## 2  2 female middle   public/general   39    41   33      42        41
## 3  3   male    low private/academic   63    65   48      63        56
## 4  4 female    low  public/academic   44    50   41      39        51
## 5  5   male    low  public/academic   47    40   43      45        31
## 6  6 female    low  public/academic   47    41   46      40        41
##         honors awards
## 1 not enrolled      0
## 2     enrolled      1
## 3     enrolled      5
## 4 not enrolled      1
## 5 not enrolled      0
## 6 not enrolled      0

3.2 Drop rows by id

Subset dex1 to only students with ids 1 through 8.

#Exercise 3. 3.2 dropping rows
dex1 |>
  select(id, ses, school_prog, math, science, honors) |>
  filter(id %in% 1:8)
##   id    ses      school_prog math science       honors
## 1  1    low  public/vocation   45      39 not enrolled
## 2  2 middle   public/general   33      42     enrolled
## 3  3    low private/academic   48      63     enrolled
## 4  4    low  public/academic   41      39 not enrolled
## 5  5    low  public/academic   43      45 not enrolled
## 6  6    low  public/academic   46      40 not enrolled
## 7  7 middle  public/academic   59      47 not enrolled
## 8  8    low  public/academic   52      44 not enrolled

3.3 Create a new dataset dex3

Create a new dataset dex3 using dex1 and dex2, containing all rows from dex1 and only matched rows from dex2.

# Exercise 3. 3.3 creating a new dataset dex3
dex3 <- dex1 |>
  select(id, ses, school_prog, math, science, honors) |> # dropping columns
  filter(id %in% c(1:8)) |> # dropping rows
  left_join(dex2) # merging two datasets

dex3
##   id    ses      school_prog math science       honors awards     city_state
## 1  1    low  public/vocation   45      39 not enrolled      0    NY New York
## 2  2 middle   public/general   33      42     enrolled      1 CA Los Angeles
## 3  3    low private/academic   48      63     enrolled      5    NY New York
## 4  4    low  public/academic   41      39 not enrolled     NA           <NA>
## 5  5    low  public/academic   43      45 not enrolled     NA           <NA>
## 6  6    low  public/academic   46      40 not enrolled     NA           <NA>
## 7  7 middle  public/academic   59      47 not enrolled     NA           <NA>
## 8  8    low  public/academic   52      44 not enrolled     NA           <NA>

Tidy data

Tidy data

We now explore the tools in the tidyverse package tidyr (loaded with library(tidyverse)), which make datasets “tidy”.

Tidy datasets adhere to a simple set of structural rules:

Figure from Wickham and Grolemun “R for Data Science”
Figure from Wickham and Grolemun “R for Data Science”

Doing a little more work upfront to make data “tidy” will pay off with time saved later on.

Data can be messy in many ways, but we focus on two ways that can be tidied with tidyr tools:

A more detailed explanation of tidy data.

Column headers are values, not variable names

We first load a dataset of the number of graduates from 3 academic departments across 3 years.

The planned analysis will examine how graduation has changed differently over the years by department. For such an analysis, we will need a variables that represents:

# import department admissions data for tidying up
dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 × 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

However, we see in the data that year is represented as column headings, and graduation numbers are spread over 3 columns. We need to stack the graduation numbers into one column and create a variable representing year.

Use pivot_longer() to create a variable out of column headings and restructure the dataset

With pivot_longer(), we lengthen the data by stacking columns so that the final dataset has more rows and fewer columns:

This process is also known as “reshaping long”.

Arguments to pivot_longer():

Example of using pivot_longer()

Here again is the data set we want to reshape:

# inspect department admissions data
dept
## # A tibble: 3 × 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

In the newly reshaped dataset, we want to put years and graduation rates into 2 columns.

*We could also have specified each column with c(`2015`, `2016`, `2017`); the ` are needed to specify the column names here because they don’t start with a letter.

# the new column "year" uses the column headings as values,
#  the new column "graduates" will be the collapsed values
#  we do not want to collapse id
dept_by_year <- dept |>
  pivot_longer(cols=-id, names_to="year", values_to="grad")

dept_by_year
## # A tibble: 9 × 3
##   id      year   grad
##   <chr>   <chr> <dbl>
## 1 biology 2015    207
## 2 biology 2016    211
## 3 biology 2017    259
## 4 math    2015     96
## 5 math    2016     75
## 6 math    2017     99
## 7 physics 2015    112
## 8 physics 2016    126
## 9 physics 2017    125

pivot_longer(names_to="year", values_to="grad", -id)

names_to="year"

values_to="grad"

-id

Multiple variables in one column

Columns should contain values that represent one variable, but we often encounter datasets where multiple variables are stored in the same column.

Let’s take a look at a dataset of worms, who have had their age, length, and weight measured, but all stored in one column.

# import data on worms for tidying up
worms <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/worms.csv")

worms
## # A tibble: 9 × 3
##    worm feature measure
##   <dbl> <chr>     <dbl>
## 1     1 age         5  
## 2     1 length      3.2
## 3     1 weight      4.1
## 4     2 age         4  
## 5     2 length      2.6
## 6     2 weight      3.5
## 7     3 age         5  
## 8     3 length      3.6
## 9     3 weight      5.5

We are interested in whether age predicts length and weight, which would make the unit of analysis the worm. So each row should be one worm.

We want the feature values to be spread across three columns.

Use pivot_wider() to spread a single column into multiple columns

Here, we use pivot_wider() to “widen” the data by dividing single columns into multiple columns, which reverses the operations of pivot_longer().

This process is sometimes known as “reshaping wide”.

Now, we want “age”, “length”, and “weight” to be their own columns, and the values in the “measure” column to be spread across these three new columns.

Arguments to pivot_wider():

In dataset “worms”, the names of the new columns are found in “feature” and the values to spread across columns are found in “measure.”

# create new columns with names from feature variable and values from measure variable
by_worm <- worms |>
  pivot_wider(names_from=feature, values_from=measure)

by_worm
## # A tibble: 3 × 4
##    worm   age length weight
##   <dbl> <dbl>  <dbl>  <dbl>
## 1     1     5    3.2    4.1
## 2     2     4    2.6    3.5
## 3     3     5    3.6    5.5

pivot_wider(names_from=feature, values_from=measure)

Multiple variables stored in the same cell, or one variable split into multiple cells

Two other issues that make datasets untidy are columns where multiple variables are stored in the same cell, or where one variable is stored across multiple cells.

The table5 data that is loaded with the tidyr package exhibits both of these problems.

# table5 for tidying up
table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Use separate() and unite() to split and concatenate columns, respectively

separate() and unite() are complements of each other.

For separate():

For unite():

# remember that new variable names must be quoted, while old variables do not need to be
table5 |>
  separate(col=rate, into=c("cases", "population"), sep="/") |> 
  unite(col="year", century, year, sep="")
## # A tibble: 6 × 4
##   country     year  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

tidyverse cheatsheets

In the Help menu for RStudio, there is a submenu for “Cheatsheets”, which are documents that summarize the usage and syntax of various widely-used R tools. You’ll find a cheatsheet for dplyr linked in this menu.

Clicking on “Browse Cheat Sheets” brings you to this page with more cheatsheets, including those for most of the packages in tidyverse.

Dates

Dates in R

Before dates in a dataset are loaded into R, typically they are stored as a column of character (string) values. However, dates are inherently numeric and we lose this information when they are stored as strings.

# string dates have no numeric value, so this errors
"2018-03-05" + 1
## Error in "2018-03-05" + 1: non-numeric argument to binary operator

In R, we want to convert our string dates to R class Date.

as.Date()

Base R provides as.Date() to convert strings to dates, but it can be intimidating and unwieldy to use, particularly if your dates are not stored in one of 2 default formats (see ?strptime for more about date formats).

# as.Date only accepts a couple of formats by default
# good
as.Date("2015-02-14")
## [1] "2015-02-14"

# bad
as.Date("02/14/2014")
## Error in charToDate(x): character string is not in a standard unambiguous format

# specify a format to fix
as.Date("02/14/2014", format="%m/%d/%Y")
## [1] "2014-02-14"

Once our dates are class Date, we can perform date arithmetic.

# a data class variable
a <- as.Date("1971-01-01")
class(a)
## [1] "Date"

# days since 1970-01-01
as.numeric(a)
## [1] 365

# date arithmetic
a - as.Date("1971-12-31")
## Time difference of -364 days

a + 2
## [1] "1971-01-03"

We just need something easier to use to convert strings of various formats to class Date!

Package lubridate

The amusingly-named tidyverse package lubridate provides easy-to-use functions to convert string dates to R Date format and then to process those dates.

The Date conversion functions in lubridate accept a wide variety of string date formats.

lubridate is not automatically loaded with library(tidyverse), though is installed with tidyverse, so we load it now.

# load package lubridate
library(lubridate)

Using lubridate

We’ll first load a dataset with various date formats to demonstrate the flexibility of lubridate functions.

# import dates datset
d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dates.csv")
d
## # A tibble: 3 × 5
##   fmt1     fmt2             fmt3           fmt4 decision_time             
##   <chr>    <chr>            <date>        <dbl> <chr>                     
## 1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
## 2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23  
## 3 03/15/84 March 3, 2010    2017-12-25 20071011 Approved 04-24-15 02:03:03


To identify which string-to-Date conversion lubridate function you’ll need:

  1. Take the letters y, m, and d and place them in the order of the year, month and day, respectively, as they are stored in the date column
  2. That ordering produces the name of the function to convert that column to Date (e.g. ymd(), mdy(), dmy()).


The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.

So, we want mdy() and ymd().

Marvel at the flexibility!

# no format specifications needed
# just ordering y,m, and d
dates <- data.frame(f1=mdy(d$fmt1), f2=mdy(d$fmt2),
                    f3=ymd(d$fmt3), f4=ymd(d$fmt4))
dates
##           f1         f2         f3         f4
## 1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
## 2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
## 3 1984-03-15 2010-03-03 2017-12-25 2007-10-11

Date-time variables

Date-time variables include a time in addition to a date.

In R, date-time variables can be stored as class POSIXct, representing the number of seconds since midnight (00:00:00), January 1, 1970.

lubridate provides functions to convert string date-time variables to class POSIXct. Simply add one or more of h, m, and s (for hours, months, and seconds, respecitvely) to y, m, and d to form the function name.

Examples:

# month, day, year, hour and minute with PM
mdy_hm("7/19/23 11:30PM")
## [1] "2023-07-19 23:30:00 UTC"

# year, month, day, hour, minute, second
ymd_hms("2023-4-19 08:25:30")
## [1] "2023-04-19 08:25:30 UTC"

# time difference between those 2 date-times
mdy_hm("7/19/23 1:30PM") - ymd_hms("2023-4-19 21:25:30") 
## Time difference of 90.66979 days

# class POSIXct is a standard way of representing calendar time
class(mdy_hm("7/19/23 11:30PM"))
## [1] "POSIXct" "POSIXt"

The fifth column of our dates dataset, decision_time, is a date-time variable recorded as month, day, year, hour, minute, second, so we want mdy_hms(). (Notice how the “Approved/Denied” string at the beginning is ignored!).

# month day year hour minute second
mdy_hms(d$decision_time)
## [1] "2015-10-10 07:15:55 UTC" "2011-09-27 14:57:23 UTC"
## [3] "2015-04-24 02:03:03 UTC"

The “time zone” standard UTC (Coordinated Universal Time) is used as the default zone. Use the argument, tz= to assign a zone. To see a list of valid time zone specifications, run OlsonNames().

# Assign a time zone to a time variable
dates$decision_time <- mdy_hms(d$decision_time, tz="US/Pacific")
dates$decision_time
## [1] "2015-10-10 07:15:55 PDT" "2011-09-27 14:57:23 PDT"
## [3] "2015-04-24 02:03:03 PDT"

# first 20 valid time zones
head(OlsonNames(),  n=20)
##  [1] "Africa/Abidjan"       "Africa/Accra"         "Africa/Addis_Ababa"  
##  [4] "Africa/Algiers"       "Africa/Asmara"        "Africa/Asmera"       
##  [7] "Africa/Bamako"        "Africa/Bangui"        "Africa/Banjul"       
## [10] "Africa/Bissau"        "Africa/Blantyre"      "Africa/Brazzaville"  
## [13] "Africa/Bujumbura"     "Africa/Cairo"         "Africa/Casablanca"   
## [16] "Africa/Ceuta"         "Africa/Conakry"       "Africa/Dakar"        
## [19] "Africa/Dar_es_Salaam" "Africa/Djibouti"

Extracting information from Date variables

lubridate provides several functions to extract specific information from Date variables including:

Some examples of extracting information from Date variables.

# we'll use the first column of our dates dataset
dates$f1
## [1] "1989-01-15" "1992-02-13" "1984-03-15"

# day of the month
day(dates$f1)
## [1] 15 13 15

# day of the year
yday(dates$f1)
## [1] 15 44 75

Some more date information extraction:

# weekday as numbers
wday(dates$f1)
## [1] 1 5 5

# weekday with labels
wday(dates$f1, label=TRUE)
## [1] Sun Thu Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

# month of the year
month(dates$f1)
## [1] 1 2 3

Extracting information from POSIXct variables

lubridate also has functions to extract time information from POSIXct date-time variables.

#break up the time variable decision time
#display as a data.frame with 4 columns
with(dates,  # with() tells R to look for variables in object "dates"
     data.frame(time=decision_time, h=hour(decision_time),
           m=minute(decision_time), s=second(decision_time)))
##                  time  h  m  s
## 1 2015-10-10 07:15:55  7 15 55
## 2 2011-09-27 14:57:23 14 57 23
## 3 2015-04-24 02:03:03  2  3  3

Two sets of functions for date-time arithmetic

If you need to add or subtract time to your date variables, lubridate provides 2 sets of functions.

One set gives “intuitive” results, and will ignore conventions like leap year. These include seconds(), minutes(), hours(), days(), weeks(), years().

The other set will adhere to those conventions, and are named by adding d to the names of the previous functions, dseconds(), dminutes(), dhours(), ddays(), dweeks(), dyears().

#2016 is a leap year
# the intuitive result
ymd("2015-02-14") + years(2) 
## [1] "2017-02-14"

# the exact result
ymd("2015-02-14") + dyears(2)
## [1] "2017-02-13 12:00:00 UTC"

Managing strings

Strings

Strings are character values.

Columns in external datasets that contain any non-number value will generally be read into R as string variables.

When specifying a specific string value, the value should be surrounded by double quotes, "", or single quotes, ''.

In this section we will cover functions that:

Package stringr

The tidyverse package stringr provides many functions that have nearly the same functionality as similar base R functions.

The package stringr is automatically loaded with library(tidyverse).

We will use a dataset of locations of partially fictitious statistical consulting centers.

# import centers data
centers <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/stats_centers.csv")
centers
## # A tibble: 4 × 6
##   NAME              ROOM  BUILDING        UNIVERSITY         CITY        STATE
##   <chr>             <chr> <chr>           <chr>              <chr>       <chr>
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE    UCLA               LOS ANGELES CA   
## 2 OIT CONSULTING    Y521  HOUSTON HALL    LONE STAR UNIV.    DALLAS      TX   
## 3 ATS STAT HELP     M234  ORANGE BUILDING SUNSHINE COLLEGE   MIAMI       FL   
## 4 OAC STAT CENTER   B2431 SNOW HALL       GRIZZLY UNIVERSITY JUNEAU      AK

Concatenating strings

At least 2 functions loaded in the current session (base R and tidyverse) can concatenate strings together. We highlight what makes each unique.

paste() to concatenate string

When used on vectors, base R function paste() concatenates element-wise. By default, joined elements are separated by a space.

# element-by-element concatenation
paste(centers$ROOM, centers$BUILDING)
## [1] "X4919 MATH SCIENCE"   "Y521 HOUSTON HALL"    "M234 ORANGE BUILDING"
## [4] "B2431 SNOW HALL"

You can change the separating character with sep=.

# separate with comma and space
paste(centers$CITY, centers$STATE, sep=", ")
## [1] "LOS ANGELES, CA" "DALLAS, TX"      "MIAMI, FL"       "JUNEAU, AK"

Vectors of numbers used in paste() are converted to character before pasting.

# separate with comma and space
paste(1:4, centers$UNIVERSITY)
## [1] "1 UCLA"               "2 LONE STAR UNIV."    "3 SUNSHINE COLLEGE"  
## [4] "4 GRIZZLY UNIVERSITY"

Interestingly, paste() will coerce the missing value NA to the string "NA". This usually is not desirable.

# separate with comma and space
paste(centers$CITY, centers$STATE, c("USA", "USA","USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA"      "MIAMI, FL, USA"      
## [4] "JUNEAU, AK, NA"

If you need to paste the elements within a vector together into a single string, specify a separating character with collapse= instead of sep=.

# put everything into a single string with collapse
paste(1:4, centers$UNIVERSITY, collapse="; ")
## [1] "1 UCLA; 2 LONE STAR UNIV.; 3 SUNSHINE COLLEGE; 4 GRIZZLY UNIVERSITY"

str_c() from stringr to concatenate strings

str_c() from the package stringr is quite similar to paste(). It has both sep= and collapse= arguments that function that same as in paste(). It differs in a couple of ways.

First, its default separator is the empty character ““:

# default is no separating character
str_c(centers$CITY, centers$STATE)
## [1] "LOS ANGELESCA" "DALLASTX"      "MIAMIFL"       "JUNEAUAK"

Second, concatenating a NA into a string will result in NA:

# separate with comma and space
str_c(centers$CITY, centers$STATE, c("USA", "USA", "USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA"      "MIAMI, FL, USA"      
## [4] NA

Extracting characters from strings by position

To extract characters, or a “substring”, from a string by position, we can use either of two similar functions:

For both functions, we specify a string variable, a start position, and an end position.

# start at first, end at third character
substr(centers$BUILDING, 1, 3)
## [1] "MAT" "HOU" "ORA" "SNO"

# specifying an end longer than the entirety of the string
#   results in truncation
str_sub(centers$UNIVERSITY, 2, 6)
## [1] "CLA"   "ONE S" "UNSHI" "RIZZL"

One difference between substr() and str_sub() is that str_sub() accepts negative numbers, which count from the end of the string.

# start from third to last, end at last
str_sub(centers$UNIVERSITY, -3, -1)
## [1] "CLA" "IV." "EGE" "ITY"

Replacing characters strings by position

Using the assignment operator <- with substr() and str_sub() allows us to replace a substring in string variables by position.

# change the first character of the ROOM variable to "R"
substr(centers$ROOM, 1, 1) <- "R"
centers$ROOM
## [1] "R4919" "R521"  "R234"  "R2431"

Changing case

Use base R functions tolower() and toupper() to change the case of string variables.

# lower case city names
tolower(centers$CITY)
## [1] "los angeles" "dallas"      "miami"       "juneau"

# change variable names to lower case
names(centers) <- tolower(names(centers))
names(centers)
## [1] "name"       "room"       "building"   "university" "city"      
## [6] "state"

The stringr functions str_to_lower() and str_to_upper() work very similarly.

Pattern Matching with grep()

Often, we need to scan through a vector of strings and find values that match a text pattern (e.g. finding all phone numbers beginning with area code “310”)

The primary base R function for pattern matching is grep(). Several other related functions also match patterns, but return slightly different output (e.g. grepl() and regexpr()).

The basic syntax is grep(pattern, x), where pattern is the text pattern to match in character vector x.

By default, grep() returns the index (row) number of the matches, but we can get the strings themselves returned with value=TRUE.

# stat center names
centers$name
## [1] "IDRE STAT CONSULT" "OIT CONSULTING"    "ATS STAT HELP"    
## [4] "OAC STAT CENTER"

# row numbers of names with "STAT"
grep(pattern="STAT", centers$name)
## [1] 1 3 4

# we can use those indices to subset
centers$name[grep(pattern="STAT", centers$name)]
## [1] "IDRE STAT CONSULT" "ATS STAT HELP"     "OAC STAT CENTER"

# or just use value=TRUE to get the words directly
grep(pattern="STAT", centers$name, value=TRUE)
## [1] "IDRE STAT CONSULT" "ATS STAT HELP"     "OAC STAT CENTER"

stringr has a very similar function called str_detect(), which returns TRUE/FALSE for string matching (equivalent to base R grepl()).

One advantage of the stringr functions is that the vector to use for matching is the first argument, which will make piping simpler.

# stringr version returns TRUE/FALSE, first argument is the string vector
str_detect(centers$name, "STAT")
## [1]  TRUE FALSE  TRUE  TRUE

# can be used to subset like grep
centers$name[str_detect(centers$name, "STAT")]
## [1] "IDRE STAT CONSULT" "ATS STAT HELP"     "OAC STAT CENTER"

Substitution with sub()

We can also substitute based on pattern matches with base R function sub(). Now the syntax is sub(pattern, replacement, x), where pattern is the pattern to be matched, replacement is the replacement text, and x is the string vector.

# change STAT to STATISTICAL
centers$name <- sub("STAT", "STATISTICAL", centers$name)
centers$name 
## [1] "IDRE STATISTICAL CONSULT" "OIT CONSULTING"          
## [3] "ATS STATISTICAL HELP"     "OAC STATISTICAL CENTER"

The stringr function for pattern substitution is str_replace().

# stringr substitution
str_replace(centers$name, "STAT", "STATISTICAL")
## [1] "IDRE STATISTICALISTICAL CONSULT" "OIT CONSULTING"                 
## [3] "ATS STATISTICALISTICAL HELP"     "OAC STATISTICALISTICAL CENTER"

Note: All of grep(), str_detect(), sub(), and str_replace() accept regular expressions, a very powerful coding system for string matching that is beyond the scope of this workshop.

In-class exercise 4

Use the dataset dex3 that we created and create a new dataset dex4.

4.1 Keep rows where student id is 1, 2 and 3.

4.2 Generate a new variable total_pts that contains a sum of math and science scores for each student.

4.3 Recode the total_pts variable into “high” if a student scored 100 or higher, “medium” if their score falls between 80 and 99, and “low” if they got less than 80.

4.4 Create a variable state to display the state in which each student’s school is located.

4.5 Finally, notice that school_prog variable has two pieces of information: type of school (public or private) and program (academic, general, vocational). Split this variable into two separate columns school and program.

4.1 Drop rows

Keep rows where student id is 1, 2 and 3.

# Exercise 4. 4.1 keeping rows with student ids 1, 2 and 3. 
dex3 <- dex3 |>
  filter(id %in% 1:3)

4.2 Create a new variable total_pts

Generate a new variable total_pts that contains a sum of math and science scores for each student.

# Exercise 4. 4.2 creating total_pts variable
dex3 |>
  mutate( total_pts = sum(math, science), .by=id) |>
  relocate(total_pts, .after = id)
##   id total_pts    ses      school_prog math science       honors awards
## 1  1        84    low  public/vocation   45      39 not enrolled      0
## 2  2        75 middle   public/general   33      42     enrolled      1
## 3  3       111    low private/academic   48      63     enrolled      5
##       city_state
## 1    NY New York
## 2 CA Los Angeles
## 3    NY New York

4.3 Recode a variable total_pts

Recode the total_pts variable into “high” if a student scored 100 or higher, “medium” if their score falls between 80 and 99, and “low” if they got less than 80.

# Exercise 4. 4.3 recoding total_pts
dex3 |>
  mutate( total_pts = sum(math, science), .by=id) |>
  mutate( total_pts_recoded = case_when(
    total_pts >= 100 ~ "high",
    total_pts >= 80 & total_pts < 100 ~ "medium",
    total_pts < 80 ~ "low" ) ) |>
  relocate(total_pts, total_pts_recoded, .after = id) 
##   id total_pts total_pts_recoded    ses      school_prog math science
## 1  1        84            medium    low  public/vocation   45      39
## 2  2        75               low middle   public/general   33      42
## 3  3       111              high    low private/academic   48      63
##         honors awards     city_state
## 1 not enrolled      0    NY New York
## 2     enrolled      1 CA Los Angeles
## 3     enrolled      5    NY New York

4.4 Split a variable city_state

Create a variable state to display the state in which each student’s school is located.

# Excercise 4. 4.4 splitting school_prog into school and program
dex3 |>
  mutate( total_pts = sum(math,science), .by = id) |>
  mutate(total_pts_recoded = case_when(
    total_pts >= 100 ~ "high",
    total_pts >= 80 & total_pts < 100 ~ "medium",
    total_pts < 80 ~ "low" ),
    state = str_sub(city_state, 1, 2) ) |>
  relocate(city_state, state, state, .after = id)
##   id     city_state state    ses      school_prog math science       honors
## 1  1    NY New York    NY    low  public/vocation   45      39 not enrolled
## 2  2 CA Los Angeles    CA middle   public/general   33      42     enrolled
## 3  3    NY New York    NY    low private/academic   48      63     enrolled
##   awards total_pts total_pts_recoded
## 1      0        84            medium
## 2      1        75               low
## 3      5       111              high

4.5 Extract characters from school_prog variable

school_prog variable has two pieces of information: type of school (public or private) and program (academic, general, vocational). Split this variable into two separate columns school and program.

# Excercise 4. 4.5 Extracting state from state_city
dex4 <- dex3 |>
  mutate( total_pts = sum(math,science), .by = id) |> # creating a variable total_pts
  mutate(total_pts_recoded = case_when( # recoding total_pts variable
    total_pts >= 100 ~ "high",
    total_pts >= 80 & total_pts < 100 ~ "medium",
    total_pts < 80 ~ "low" ),
    state = str_sub(city_state, 1, 2) ) |> # extracting characters from the city_state variable
  separate(school_prog, into=c("school", "program"), sep="/") |> # separating school_prog variable
  relocate(school, program, state, total_pts, total_pts_recoded, .after = id)
dex4
##   id  school  program state total_pts total_pts_recoded    ses math science
## 1  1  public vocation    NY        84            medium    low   45      39
## 2  2  public  general    CA        75               low middle   33      42
## 3  3 private academic    NY       111              high    low   48      63
##         honors awards     city_state
## 1 not enrolled      0    NY New York
## 2     enrolled      1 CA Los Angeles
## 3     enrolled      5    NY New York

Working with factors

Factors

Factors can be used to represent categorical variables in R and are particulalry useful in statistical modeling.

Factors allow reordering character (or integer) vectors when needed to improve display.

Both base R and and forcats package in tidyverse handle factors.

Example of factors

We use the factor() function to convert a character (or integer) variables to a factor.

By default, factor() function will use alphabetical ordering to order the variable.

# character vector
dex4$program
## [1] "vocation" "general"  "academic"

# convert to factor
dex4$program_fct <- factor(dex4$program)

# categories in alphabetical order
dex4$program_fct
## [1] vocation general  academic
## Levels: academic general vocation

We can specify the order using levels= argument in factor.

# specifying the order of levels
dex4$program_fct <- factor(dex4$program_fct, levels = c("vocation","academic", "general"))
dex4$program_fct
## [1] vocation general  academic
## Levels: vocation academic general

We can also see the underlying integer values by converting the factors to a numeric vector. Essentially, factors are integer variables with labels for each integer value.

# underlying integer values for factors
as.numeric(dex4$program_fct)
## [1] 1 3 2

A vector of integers can be converted to a factor by using labels= option in factor() function. Text labels will then become levels themselves.

# honors: 0="not enrolled", 1="enrolled"
dex4$honors_int <- if_else(dex4$honors == "not enrolled", 0, 1)
dex4$honors_int
## [1] 0 1 1

# convert integer vector into a factor and assign levels
# labels become the levels
factor(dex4$honors_int, labels=c("not enrolled","enrolled"))
## [1] not enrolled enrolled     enrolled    
## Levels: not enrolled enrolled

In regression models, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.

Package forcats

The tidyverse package forcats provides many functions that help with dealing with categorical variables.

The package is automatically loaded with library(tidyverse).

We will discuss a handful of forecats functions.

Change factor levels with fct_recode()

Sometimes you need to change factor levels, e.g. collapse a couple of categories into one. Use fact_recode function to do this by hand.

The order in which new and old variables are coded is new variable name = old variable name.

# recode
fct_recode(dex4$program_fct, general = "general", other = "academic", other = "vocation")
## [1] other   general other  
## Levels: other general

Move factor levels fct_relevel()

fct_relevel() function allows to move any number of factor levels to any location. It is useful when you want to change reference group in a regression model.

# factor variable
dex4$program_fct
## [1] vocation general  academic
## Levels: vocation academic general

# put academic at the first position
fct_relevel(dex4$program_fct, "academic")
## [1] vocation general  academic
## Levels: academic vocation general

# put "academic" at the second position
fct_relevel(dex4$program_fct, "academic", after=1)
## [1] vocation general  academic
## Levels: vocation academic general


# reverse
fct_relevel(dex4$program_fct, rev)
## [1] vocation general  academic
## Levels: general academic vocation

Adding and dropping factor levels

Function fct_add allows for adding new levels to a factor variable.

# create a factor vector
dex4$state_fct <- factor( dex4$state )
dex4$state_fct
## [1] NY CA NY
## Levels: CA NY

# add level to a factor
fct_expand(dex4$state_fct, c("TX", "IL"))
## [1] NY CA NY
## Levels: CA NY TX IL


# specify where new factors should be added
dex4$state_fct <- fct_expand(dex4$state_fct, c("TX","IL"), after=1)
dex4$state_fct 
## [1] NY CA NY
## Levels: CA TX IL NY

When we want to drop factor levels, we can use fct_drop function.

# drop unused levels
fct_drop(dex4$state_fct)
## [1] NY CA NY
## Levels: CA NY


# drop only specified levels
fct_drop(dex4$state_fct, only="TX")
## [1] NY CA NY
## Levels: CA IL NY

References

Thank you!