R Data Management

Introduction

Purpose

This seminar introduces tools in R useful for data management.

The seminar is focused on the tools found in the tidyverse collection of packages, but also includes various tools pulled from other packages that complement tidyverse tools.

Data management packages used in this seminar

tidyverse: a collection of packages with tools for most aspects of data analysis, particularly strong in data import, management, and visualization. We will focus on the following packages within tidyverse:

Hmisc: another large collection of tools for most aspects of data analysis, but we use it here for describe(), a dataset summary function

install.packages("tidyverse", dependencies=TRUE)
install.packages("Hmisc", dependencies=TRUE)
library(tidyverse)
library(Hmisc)

Review of data frames and functions

Data frames

Datasets for statistical analysis are typically stored in data frames in R.

Data frames are rectangular, where the columns are variables and the rows are observations of those variables. Columns can be of different types (double, character, etc.) but must be of equal length.

When we import data from a file, such as with read.csv(), the resulting structure is typically a data frame.

Most of the tools discussed in this seminar work on data frames.

Example data frame

First dataset for this seminar

We begin our data mangement work on a dataset of 120 observations of 25 variables describing simulated cancer patient data. Each patient is also linked to a doctor in the dataset. We will merge this doctor dataset into this dataset later in the seminar.

We will be using the function read_csv(), from the tidyverse package readr to import our data for this seminar. Although very similar to the base R function read.csv(), read_csv() has slightly different defaults and options, but the two can be used interchangeably for the most part. One of the primary differences is that read_csv() will import the data as a tibble, an updated class of data frames, which we will introduce shortly.

d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt1_dm.csv")

Viewing data as a spreadsheet with View()

Use View() on a dataset to open a spreadsheet-style view of a dataset. In RStudio, clicking on a dataset in the Environment pane will View() it.

View(d)

Tibbles

Throughout this seminar, we will be using tidyverse functions for data management. These functions typically accept data frames as inputs, and output tibbles, a special class of data frames.

Tibbles print to screen a bit differently from regular data frames – usually up to 10 rows and only as many columns as can fit on the screen. We also get each column’s “type”:

d
## # A tibble: 120 x 24
##    hospital hospid docid dis_date sex     age   test1 test2  pain tumorsize
##    <chr>     <dbl> <chr> <chr>    <chr> <dbl>   <dbl> <dbl> <dbl>     <dbl>
##  1 UCLA          1 1-1   6-Sep-09 male   65.0   3.70  8.09      4      68.0
##  2 UCLA          1 1-1   7-Jan-11 fema~  53.9   2.63  0.803     2      64.7
##  3 UCLA          1 1-1   4-Sep-10 male   41.4 -99     2.13      3      86.4
##  4 UCLA          1 1-1   25-Jun-~ male   46.8   3.89  1.35      3      53.4
##  5 UCLA          1 1-1   1-Jul-09 male   51.9   1.42  2.19      4      51.7
##  6 UCLA          1 1-1   6-Mar-09 fema~  53.8   2.29  8.61      3      78.9
##  7 UCLA          1 1-1   15-Apr-~ male   54.4   8.03  7.23      4      62.9
##  8 UCLA          1 1-11  12-Jul-~ fema~  47.1   0.810 2.60      5      81.2
##  9 UCLA          1 1-11  25-Jul-~ fema~  59.3 -99     5.18      4      73.2
## 10 UCLA          1 1-11  12-Jul-~ fema~  47.1   0.810 2.60      5      81.2
## # ... with 110 more rows, and 14 more variables: co2 <dbl>, wound <dbl>,
## #   mobility <dbl>, ntumors <dbl>, remission <dbl>, lungcapacity <dbl>,
## #   married <dbl>, familyhx <chr>, smokinghx <chr>, cancerstage <chr>,
## #   lengthofstay <dbl>, wbc <chr>, rbc <dbl>, bmi <dbl>

Tibbles are still data frames, and will work in nearly all functions that require data frames as inputs. For this seminar, anything we say about data frames will also apply to tibbles.

To convert a tibble to a regular data frame, use as.data.frame().

# d is of class tibble (tbl_df), class table (tbl) and class data.frame
class(d)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

# now just a data.frame
class(as.data.frame(d))
## [1] "data.frame"

Read more about tibbles.

Subsetting data frames

We can select a particular row and column of data frame x, with the syntax x[rows, columns], where rows and columns are one of the following: a number, a vector of numbers, the name of a variable that stores numbers, or omitted.

Omitting the row value x[, columns] selects all rows, and omitting the column value, x[rows, ], selects all columns. (Using this syntax to extract a row or column will result in a data.frame, not a vector)

The syntax x$colname extracts the column with name colname as a vector. We recommend this syntax for extracting columns.

The syntax x[[“colname”]] and x[[colnum]] can be used to extract column vectors by name or by number, respectively. (Using this syntax to extract a column will result in a vector)

# row 3 column 2
d[3,2]
## # A tibble: 1 x 1
##   hospid
##    <dbl>
## 1      1

# rows 2 and 4 of columns age
d[c(2,4), "age"]
## # A tibble: 2 x 1
##     age
##   <dbl>
## 1  53.9
## 2  46.8

# all columns for row 10
d[10,]
## # A tibble: 1 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCLA          1 1-11  12-Jul-~ fema~  47.1 0.810  2.60     5      81.2  1.55
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

# all rows of column "pain"
d[,"pain"]
## # A tibble: 120 x 1
##     pain
##    <dbl>
##  1     4
##  2     2
##  3     3
##  4     3
##  5     4
##  6     3
##  7     4
##  8     5
##  9     4
## 10     5
## # ... with 110 more rows

# subsetting with $ creates a numeric vector, which itself can be subset
d$age[2:3]
## [1] 53.91714 41.36804

Logical subsetting

We can also use logical expressions to subset. Logical comparisons such as x > 0 return TRUE/FALSE vectors.

We can place such an expression inside of [] and values with TRUE will be selected.

# logical comparison results in a logical vector
d$age > 60
##   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# when placed in [] before the comma, rows with TRUE are selected
#  this returns columns age and pain for observations where age > 60
d[d$age>60, c("age", "pain")]
## # A tibble: 5 x 2
##     age  pain
##   <dbl> <dbl>
## 1  65.0     4
## 2  65.8     4
## 3 358.      6
## 4  63.9     8
## 5  64.2     7

Functions

Functions perform most of the work on data in R.

Functions in R are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input \(x\), and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean.

When we load a new package into R with library(), the set of functions associated with that library are loaded into the R session.

Help files for functions

In the help file for a function (?function_name), we will find a list of Arguments to the function, in a specific order. Arguments are the inputs to the function. Values for arguments to functions can be specified either by name or position.

In the Usage section, a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.

The Value section specifies what is returned. Usually there are Examples at the bottom.

Specifying arguments by name and position

Input arguments to functions can be specified by either name or position.

# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5

# specifying arguments by position
seq(10, 0, -2)
## [1] 10  8  6  4  2  0

Data cleaning and missing values

Quick summaries of variables

We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:

The base R function summary() provides summary statistics for numeric variables and frequencies for factor variables (we have no factors yet). Character variables are left unsummarized.

summary(d)
##    hospital             hospid         docid             dis_date        
##  Length:120         Min.   :1.000   Length:120         Length:120        
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :1.000   Mode  :character   Mode  :character  
##                     Mean   :1.483                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##      sex                 age             test1             test2        
##  Length:120         Min.   : 34.19   Min.   :-99.000   Min.   :-99.000  
##  Class :character   1st Qu.: 47.75   1st Qu.:  1.560   1st Qu.:  2.249  
##  Mode  :character   Median : 51.83   Median :  3.107   Median :  4.162  
##                     Mean   : 53.59   Mean   : -1.989   Mean   : -1.226  
##                     3rd Qu.: 55.01   3rd Qu.:  5.707   3rd Qu.:  6.166  
##                     Max.   :357.89   Max.   : 12.416   Max.   : 17.228  
##       pain         tumorsize           co2               wound      
##  Min.   :1.000   Min.   : 49.10   Min.   :-98.0000   Min.   :2.000  
##  1st Qu.:4.000   1st Qu.: 62.18   1st Qu.:  1.4954   1st Qu.:5.000  
##  Median :5.000   Median : 68.23   Median :  1.5859   Median :6.000  
##  Mean   :5.325   Mean   : 70.08   Mean   : -0.8901   Mean   :5.592  
##  3rd Qu.:6.000   3rd Qu.: 77.48   3rd Qu.:  1.6860   3rd Qu.:7.000  
##  Max.   :9.000   Max.   :109.01   Max.   :  1.9424   Max.   :9.000  
##     mobility        ntumors       remission     lungcapacity     
##  Min.   :2.000   Min.   :0.00   Min.   :0.00   Min.   :-99.0000  
##  1st Qu.:5.000   1st Qu.:1.00   1st Qu.:0.00   1st Qu.:  0.5141  
##  Median :6.000   Median :2.00   Median :0.00   Median :  0.7457  
##  Mean   :6.033   Mean   :3.15   Mean   :0.35   Mean   :-18.3280  
##  3rd Qu.:7.000   3rd Qu.:5.00   3rd Qu.:1.00   3rd Qu.:  0.8721  
##  Max.   :9.000   Max.   :9.00   Max.   :1.00   Max.   :  0.9982  
##     married         familyhx          smokinghx         cancerstage       
##  Min.   :0.0000   Length:120         Length:120         Length:120        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :1.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.6333                                                           
##  3rd Qu.:1.0000                                                           
##  Max.   :1.0000                                                           
##   lengthofstay       wbc                 rbc             bmi       
##  Min.   :3.000   Length:120         Min.   :4.360   Min.   :18.45  
##  1st Qu.:4.000   Class :character   1st Qu.:4.825   1st Qu.:24.51  
##  Median :5.000   Mode  :character   Median :4.978   Median :27.82  
##  Mean   :5.308                      Mean   :4.970   Mean   :29.38  
##  3rd Qu.:6.000                      3rd Qu.:5.150   3rd Qu.:34.31  
##  Max.   :8.000                      Max.   :5.535   Max.   :58.00

Use describe() from package Hmisc for more detailed summaries

The describe() function from package Hmisc will give differents summary statistics for all variables in a data frame. The information provided depends on the variable’s type and number of distinct values.

Important Features:

The results of describe() can be used to determine quickly whether variables have plausible values and distributions:

library(Hmisc)

# detailed summaries of variables
describe(d)
## d 
## 
##  24  Variables      120  Observations
## --------------------------------------------------------------------------------
## hospital 
##        n  missing distinct 
##      120        0        2 
##                       
## Value       UCLA  UCSF
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## hospid 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        2    0.749    1.483   0.5036 
##                       
## Value          1     2
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## docid 
##        n  missing distinct 
##      120        0       22 
## 
## lowest : 1-1   1-100 1-11  1-21  1-22 , highest: 2-177 2-178 2-188 2-201 2-216
## --------------------------------------------------------------------------------
## dis_date 
##        n  missing distinct 
##      120        0      104 
## 
## lowest : 1-Jul-09  10-Jun-09 10-Jun-10 11-Apr-10 11-Dec-09
## highest: 9-Apr-10  9-Feb-09  9-Feb-10  9-Jun-10  9-May-10 
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    53.59    11.92    40.86    41.52 
##      .25      .50      .75      .90      .95 
##    47.75    51.83    55.01    58.61    59.65 
## 
## lowest :  34.19229  35.31930  37.25225  39.61641  40.03724
## highest:  63.93238  64.16432  64.96824  65.80417 357.89001
##                                                           
## Value         35    40    45    50    55    60    65   360
## Frequency      3    10    16    38    35    13     4     1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
## 
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## test1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.989    14.11 -99.0000   0.5809 
##      .25      .50      .75      .90      .95 
##   1.5597   3.1065   5.7067   7.9931   9.5583 
## 
## lowest : -99.0000000   0.1048958   0.1927608   0.4293420   0.5155185
## highest:   9.7981329  10.2903990  10.4685400  11.0714440  12.4163920
##                                                                             
## Value        -99     0     1     2     3     4     5     6     7     8     9
## Frequency      7     3    20    17    21    12     8    11     6     7     1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##                             
## Value         10    11    12
## Frequency      5     1     1
## Proportion 0.042 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## test2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.226    14.59 -99.0000   0.7881 
##      .25      .50      .75      .90      .95 
##   2.2494   4.1620   6.1657   8.9973  10.9861 
## 
## lowest : -99.0000000   0.5807591   0.6179262   0.6571499   0.7485080
## highest:  11.4315750  12.4493730  14.2252270  14.5365420  17.2275810
##                                                                             
## Value        -99     1     2     3     4     5     6     7     8     9    10
## Frequency      7    15    12    19    16    15     9     6     6     6     1
## Proportion 0.058 0.125 0.100 0.158 0.133 0.125 0.075 0.050 0.050 0.050 0.008
##                                         
## Value         11    12    14    15    17
## Frequency      4     1     1     1     1
## Proportion 0.033 0.008 0.008 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## pain 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        9    0.961    5.325    1.742 
## 
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency      1     2    11    21    33    25    18     5     4
## Proportion 0.008 0.017 0.092 0.175 0.275 0.208 0.150 0.042 0.033
## --------------------------------------------------------------------------------
## tumorsize 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    70.08    12.92    53.48    55.97 
##      .25      .50      .75      .90      .95 
##    62.18    68.23    77.48    84.06    92.54 
## 
## lowest :  49.09861  50.28009  50.50217  51.65727  53.19937
## highest:  98.05510  98.32850  98.70570 102.69671 109.00956
## --------------------------------------------------------------------------------
## co2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      115        1  -0.8901     5.03    1.390    1.440 
##      .25      .50      .75      .90      .95 
##    1.495    1.586    1.686    1.756    1.793 
## 
## lowest : -98.000000   1.327440   1.362927   1.372113   1.391310
## highest:   1.810847   1.820266   1.905606   1.920025   1.942401
##                             
## Value        -98     1     2
## Frequency      3    29    88
## Proportion 0.025 0.242 0.733
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## wound 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        8    0.958    5.592    1.767 
## 
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##                                                           
## Value          2     3     4     5     6     7     8     9
## Frequency      5    10    12    23    36    22    10     2
## Proportion 0.042 0.083 0.100 0.192 0.300 0.183 0.083 0.017
## --------------------------------------------------------------------------------
## mobility 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        8    0.969    6.033    2.264 
## 
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##                                                           
## Value          2     3     4     5     6     7     8     9
## Frequency      6     4    19    19    28    15     5    24
## Proportion 0.050 0.033 0.158 0.158 0.233 0.125 0.042 0.200
## --------------------------------------------------------------------------------
## ntumors 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       10    0.981     3.15    3.101        0        0 
##      .25      .50      .75      .90      .95 
##        1        2        5        8        9 
## 
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency     23    21    19     9    12    11     7     5     5     8
## Proportion 0.192 0.175 0.158 0.075 0.100 0.092 0.058 0.042 0.042 0.067
## --------------------------------------------------------------------------------
## remission 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      120        0        2    0.683       42     0.35   0.4588 
## 
## --------------------------------------------------------------------------------
## lungcapacity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       96    0.995   -18.33    31.26 -99.0000 -99.0000 
##      .25      .50      .75      .90      .95 
##   0.5142   0.7457   0.8721   0.9573   0.9790 
## 
## lowest : -99.0000000 -98.0000000   0.2949074   0.3264440   0.3450253
## highest:   0.9856416   0.9864109   0.9924814   0.9940955   0.9982018
##                                                     
## Value      -99.0 -98.0   0.2   0.4   0.6   0.8   1.0
## Frequency     20     3     1     6    21    46    23
## Proportion 0.167 0.025 0.008 0.050 0.175 0.383 0.192
## 
## For the frequency table, variable is rounded to the nearest 0.2
## --------------------------------------------------------------------------------
## married 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      120        0        2    0.697       76   0.6333   0.4683 
## 
## --------------------------------------------------------------------------------
## familyhx 
##        n  missing distinct 
##      120        0        3 
##                             
## Value        -99    no   yes
## Frequency      6    97    17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## smokinghx 
##        n  missing distinct 
##      120        0        4 
##                                           
## Value          -99 current  former   never
## Frequency        6      26      22      66
## Proportion   0.050   0.217   0.183   0.550
## --------------------------------------------------------------------------------
## cancerstage 
##        n  missing distinct 
##      120        0        4 
##                                   
## Value          I    II   III    IV
## Frequency     40    54    17     9
## Proportion 0.333 0.450 0.142 0.075
## --------------------------------------------------------------------------------
## lengthofstay 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        6    0.933    5.308    1.271 
## 
## lowest : 3 4 5 6 7, highest: 4 5 6 7 8
##                                               
## Value          3     4     5     6     7     8
## Frequency      5    29    29    40    15     2
## Proportion 0.042 0.242 0.242 0.333 0.125 0.017
## --------------------------------------------------------------------------------
## wbc 
##        n  missing distinct 
##      120        0      116 
## 
## lowest : 3671.880371  4176.054199  4201.741211  4238.355957  4331.902344 
## highest: 7999.091309  8340.71582   8415.605469  8567.246094  not assessed
## --------------------------------------------------------------------------------
## rbc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1     4.97   0.2885    4.526    4.606 
##      .25      .50      .75      .90      .95 
##    4.825    4.978    5.150    5.285    5.353 
## 
## lowest : 4.359662 4.436482 4.456108 4.465468 4.470100
## highest: 5.441604 5.442614 5.459067 5.502604 5.535052
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    29.38    7.278    20.71    22.07 
##      .25      .50      .75      .90      .95 
##    24.51    27.82    34.31    37.75    40.71 
## 
## lowest : 18.44992 18.68505 20.07485 20.17994 20.49195
## highest: 42.84858 44.04223 46.50746 52.30723 58.00000
## --------------------------------------------------------------------------------

Suspicious values

Here are some variables with values that we are not sure are correct:

describe(d[,c("age", "sex", "test1")])
## d[, c("age", "sex", "test1")] 
## 
##  3  Variables      120  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    53.59    11.92    40.86    41.52 
##      .25      .50      .75      .90      .95 
##    47.75    51.83    55.01    58.61    59.65 
## 
## lowest :  34.19229  35.31930  37.25225  39.61641  40.03724
## highest:  63.93238  64.16432  64.96824  65.80417 357.89001
##                                                           
## Value         35    40    45    50    55    60    65   360
## Frequency      3    10    16    38    35    13     4     1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
## 
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## test1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.989    14.11 -99.0000   0.5809 
##      .25      .50      .75      .90      .95 
##   1.5597   3.1065   5.7067   7.9931   9.5583 
## 
## lowest : -99.0000000   0.1048958   0.1927608   0.4293420   0.5155185
## highest:   9.7981329  10.2903990  10.4685400  11.0714440  12.4163920
##                                                                             
## Value        -99     0     1     2     3     4     5     6     7     8     9
## Frequency      7     3    20    17    21    12     8    11     6     7     1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##                             
## Value         10    11    12
## Frequency      5     1     1
## Proportion 0.042 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------

plot of describe()

The results of describe() can be plotted:

These plots are also useful to inspect distributions of variables and to look for suspicious values.

In the dot plots, look for suspicious labels for categorical variables (see variables sex, familyhx and smokinghx below).

In the histogram plots, each histogram has its own scale, so bunched-up histograms often denote the presence of extreme values (see variables age, test1, test2, co2 and lungcapacity).

plot(describe(d))
## $Categorical

## 
## $Continuous

Converting suspicious values to missing

A look at the output of our dataset summaries, as well as the plot of describe() identified several suspicious values:

Becasue we don’t know any replacement values, we will change suspicous values to NA, the missing value indicator in R.

Note: Many times, datasets differentiate between different types of missing. For example, -99=“absent”, while -98=“refused to answer”. For analysis purposes, using NA for both will usually suffice. For users that need to distinguish between different kinds of missing, we suggest looking into the as.item() function from package memisc.

We can use logical subsetting to assign NA to the data entry errors:

# change all impossible age values to NA 
#  assume all adults in this dataset
d$age[d$age<18|d$age>120] <- NA

# remember to use quotes for character variables
d$sex[d$sex == "12.2"] <- NA

Now we want to translate all -99, -98, and “not assessed” values to NA. The following syntax can be used to translate all values of a to b in dataset df:

df[df==a] <- b

d[d==-99] <- NA
d[d==-98] <- NA
d[d=="not assessed"] <- NA

Complete cases

complete.cases(x) returns a logical (TRUE/FALSE) vector of whether each row of x has no missing. Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.

# which cases are complete
complete.cases(d)
##   [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
##  [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [25]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [37]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [49] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [73]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [85]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [109] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

# number of complete cases
sum(complete.cases(d))
## [1] 82

# create a dataset of complete cases
d_comp <- d[complete.cases(d),]

describe() plot after cleaning data

The plot of describe() looks better now – no suspcious labels in the dot plots and nicely spread out histograms.

Color indicates the number of missing values for that variable.

plot(describe(d))
## $Categorical

## 
## $Continuous

plot(describe(d), which="continuous")

Data management with dplyr

The dplyr package

The dplyr package provides tools for some of the most common data management tasks. Its primary functions are “verbs” to help you think about what you need to do to your dataset:

Two great conveniences of using dplyr functions:

The dplyr package is automatically loaded with library(tidyverse).

Selecting rows with filter()

We have already seen how to subset using logical vectors and logical subsetting:

# all variables for patients with pain>=9
pain9 <- d[d$pain>=9,]
pain9
## # A tibble: 4 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-177 9-Jun-10 male   58.7  7.62  9.31     9      63.5  1.48
## 2 UCSF          2 2-188 15-Jul-~ fema~  41.5  4.00  7.57     9      98.3  1.82
## 3 UCSF          2 2-216 11-Apr-~ fema~  52.0  6.70  9.37     9      78.5  1.70
## 4 UCSF          2 2-177 9-Jun-10 male   58.7  7.62  9.31     9      63.5  1.48
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated, especially witha long data set name.

The dplyr function filter() provides a cleaner syntax for subsetting datasets. Conditions separated by , are joined by & (logical AND).

# subset to females with pain >= 9
dfhp <- filter(d, sex=="female", pain>=9)
dfhp
## # A tibble: 2 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-188 15-Jul-~ fema~  41.5  4.00  7.57     9      98.3  1.82
## 2 UCSF          2 2-216 11-Apr-~ fema~  52.0  6.70  9.37     9      78.5  1.70
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

R Logical operators and functions

Here are some operators and functions to help with selection:

# select those not in UCLA, who are either younger than 40 or older than 60
filter(d, hospital!="UCLA", age<40 | age>60)
## # A tibble: 3 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-113 4-Feb-10 male   63.9  4.77  1.50     8      75.1  1.56
## 2 UCSF          2 2-178 19-Apr-~ male   34.2  4.63  3.26     3      65.5  1.64
## 3 UCSF          2 2-201 21-Nov-~ fema~  64.2  2.03  6.20     7      79.7  1.72
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

Selecting columns with select()

Use dplyr function select() to keep only the variables you need.

# select 4 variables
d_small <- select(d, hospid, docid, age, cancerstage)
head(d_small, n=3)
## # A tibble: 3 x 4
##   hospid docid   age cancerstage
##    <dbl> <chr> <dbl> <chr>      
## 1      1 1-1    65.0 II         
## 2      1 1-1    53.9 II         
## 3      1 1-1    41.4 I

Often we want to select a group of related variables with similar names. dplyr supplies helper functions to find columns whose names match a specified pattern:

# get hospital variables
dhosp <- select(d, starts_with("hosp"))
names(dhosp)
## [1] "hospital" "hospid"

# get "test1" and "test2" variables using num_range
tests <- select(d, num_range("test", 1:2))
names(tests)
## [1] "test1" "test2"

Use select() to reorder columns

You can specify a particular ordering of columns with select(). To place the remaining columns in their current ordering, use everything(). You can also rename variables with select.

# put the two test vars first while renaming them, 
#   then everything else 
tests_first <- select(d, t1=test1, t2=test2, everything())
names(tests_first)
##  [1] "t1"           "t2"           "hospital"     "hospid"       "docid"       
##  [6] "dis_date"     "sex"          "age"          "pain"         "tumorsize"   
## [11] "co2"          "wound"        "mobility"     "ntumors"      "remission"   
## [16] "lungcapacity" "married"      "familyhx"     "smokinghx"    "cancerstage" 
## [21] "lengthofstay" "wbc"          "rbc"          "bmi"

Sorting rows with arrange()

Sort the order of rows by variable values using arrange() from dplyr.

Be default, ascending order will be used. Surround a sorting variable with desc() to sort by descending order instead.

# sort, with males before females, then by age, youngest first
arrange(d, desc(sex), age) 
## # A tibble: 120 x 24
##    hospital hospid docid dis_date sex     age test1  test2  pain tumorsize   co2
##    <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl>  <dbl> <dbl>     <dbl> <dbl>
##  1 UCSF          2 2-178 19-Apr-~ male   34.2  4.63  3.26      3      65.5  1.64
##  2 UCLA          1 1-58  9-Feb-10 male   35.3  2.14 12.4       6      60.5  1.56
##  3 UCSF          2 2-121 20-Apr-~ male   40.1 NA    NA         5      73.5  1.65
##  4 UCLA          1 1-1   4-Sep-10 male   41.4 NA     2.13      3      86.4  1.45
##  5 UCSF          2 2-121 22-Aug-~ male   41.5  2.88  6.80      5      60.2  1.50
##  6 UCSF          2 2-163 19-Apr-~ male   42.5  1.63  0.749     4      54.0  1.72
##  7 UCLA          1 1-11  19-Aug-~ male   43.7 10.5   5.89      7      61.3  1.49
##  8 UCLA          1 1-1   25-Jun-~ male   46.8  3.89  1.35      3      53.4  1.57
##  9 UCLA          1 1-21  22-Apr-~ male   48.0  3.39 NA         8      79.6  1.55
## 10 UCSF          2 2-178 12-Mar-~ male   48.9  4.15  3.04      7      71.8  1.47
## # ... with 110 more rows, and 13 more variables: wound <dbl>, mobility <dbl>,
## #   ntumors <dbl>, remission <dbl>, lungcapacity <dbl>, married <dbl>,
## #   familyhx <chr>, smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>,
## #   wbc <chr>, rbc <dbl>, bmi <dbl>

*Transforming variables into new variables with mutate()*

The function mutate() allows us to transform many variables in one step without having to respecify the data frame name over and over.

Useful R functions for transforming:

# create age category variable, and highpain binary variable
d <- mutate(d,
            agecat = cut(age, breaks=c(30,40,50,60,70,120)),
            highpain = pain > mean(pain))
table(d$agecat, d$highpain)
##           
##            FALSE TRUE
##   (30,40]      2    2
##   (40,50]     27   15
##   (50,60]     37   32
##   (60,70]      2    2
##   (70,120]     0    0

Grouped data operations

For many datasets we want to calculate statistics by group, rather than over the whole dataset.

The dplyr functions group_by() and summarise() together greatly simplify group operations.

First, we create a grouped data frame (tibble) with group_by(). We will group our patient dataset by doctor id.

# group_by creates a grouped_df (grouped data frame) class structure
by_doc <- group_by(d, docid)
class(by_doc)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Next we use summarise() and statistical or numerical functions to summarize variables by group.

Example of useful functions for summarise():

# Create summaries of patients by doctor
pat_summ <- summarise(by_doc,
                      n_pat = n(), # number of patients
                      longest_los = max(lengthofstay), # longest length of stay
                      avg_age = mean(age), # average age
                      n_stage_four = sum(cancerstage=="IV") # number of stage IV patients 
                      )
pat_summ
## # A tibble: 22 x 5
##    docid n_pat longest_los avg_age n_stage_four
##    <chr> <int>       <dbl>   <dbl>        <int>
##  1 1-1       7           6    52.5            0
##  2 1-100     9           7    NA              0
##  3 1-11      8           7    51.7            1
##  4 1-21      1           4    48.0            0
##  5 1-22      4           6    48.8            0
##  6 1-33      6           7    49.0            0
##  7 1-48      7           7    51.3            1
##  8 1-57      3           5    45.6            0
##  9 1-58      3           7    51.1            0
## 10 1-72      8           8    50.3            0
## # ... with 12 more rows

A quick aside about package conflicts and masking

The dplyr package provides summarize() for those who prefer an American spelling to summarise(). However, the package Hmisc also contains a summarize() function, and if loaded into the same session as dplyr(), a conflict arises between the two summarize() functions. The function from the package that was loaded first will be masked when the second package is loaded into the session (with an accompanying red warning message). The function loaded second will be the default function used when summarize() is specified.

So, if Hmisc is loaded first and dplyr loaded second, specifying summarize(x) will call summarize() from dplyr.

One can still access the function loaded from the first package in a couple of ways.

Piping

The pipe operator %>%

A data management task may involve many steps to reach the final desired dataset. Often, during intermediate steps, datasets are generated that we don’t care about or plan on keeping. For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand.

Naming datasets takes time to think about and clutters code. Piping makes your code more readable by focusing on the functions used rather than the name of datasets.

The pipe operator is %>%, , which can be typed with CTRL-SHIFT-M (CMD-SHIFT-M on a Mac). The pipe operator can be expressed as “then” when reading code. The operator is loaded with pacakage magrittr (automatically loaded with library(tidyverse)).

The pipe operator “pipes” the dataset on the left of the %>% operator to the function on the right of the operator.

The code x %>% f(y) translates to f(x,y), that is, x is treated by default as the first argument of f(). If the function returns a data frame, we can then pipe this data frame into another function. Thus x %>% f(y) %>% g(z) translates to g(f(x,y), z).

Examples of using the pipe operator

As a first example, perhaps we want to create a dataset of just females under 40, with only the age and pain variables selected. We could do this in 2 steps, like so:

# a dataset of age and pain for females younger than 40
f40 <- filter(d, sex=="female" & age<40)
f40_small <- select(f40, age, pain)
f40_small
## # A tibble: 2 x 2
##     age  pain
##   <dbl> <dbl>
## 1  37.3     8
## 2  39.6     5

While that works fine, the intermediate dataset f40 is not of interest and is cluttering up memory and the workspace unnecessarily.

We could use %>% instead:

# start with d, then filter rows, then select variables
f40_small <- d %>%   
  filter(sex=="female" & age<40) %>% 
  select(age, pain)
f40_small
## # A tibble: 2 x 2
##     age  pain
##   <dbl> <dbl>
## 1  37.3     8
## 2  39.6     5

No intermediate dataset was created, and dataset names are rarely needed in intermediate steps. The dataset transformations, filtering and selecting, are easy to see.

Another pipe example

Piping really helps make a long chain of commands more immediately understandable.

# create a plot of average age vs tumor size 
#   by doctors, for doctors with more than 5 patients
g1 <- d %>%     
  group_by(docid) %>%  
  summarise(n_pat=n(),   
            avg_age=mean(age), 
            avg_tumor=mean(tumorsize)) %>%   
  filter(n_pat > 5) %>%  
  ggplot(aes(x=avg_age, y=avg_tumor)) + geom_point() +
  geom_smooth() 
g1

Using . in piped commands

If the dataset is not to be specified as the first argument of the function on the right of the %>% operator, use . to specify where the dataset should appear as the argument.

x %>% f(y,.) translates to f(y,x).

For example, for modeling functions such as lm(), the model formula is typically the first argument rather than the dataset, but we can use . to stand in for the dataset in the piping chain:

d %>% 
  filter(age < 40) %>% 
  lm(tumorsize ~ age, data=.) # the . is the filtered dataset
## 
## Call:
## lm(formula = tumorsize ~ age, data = .)
## 
## Coefficients:
## (Intercept)          age  
##     -78.219        4.116

Appending and merging

Appending data frames

Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.

Two choices for appending, rbind() and bind_rows()

We can use either the base R function rbind() or the dplyr function bind_rows() to append. They differ in how they handle the situation where the datasets being appended have non-matching columns.

They also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).

Example of appending data sets with non-matching columns

Let’s load a second dataset that contains the same variables as the original version of our first dataset but with 2 variables omitted.

# new data set that contains the same variables as d, except is missing 2 of them
d2 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt2_dm.csv")
# rows and columns of d2 and d
dim(d2)
## [1] 111  24
dim(d)
## [1] 120  26

Because our first data set d has a few extra variables we created earlier with mutate(), the two datasets do not have completely matching columns.

The best solution would be to calculate the new variables for the second dataset and then use either function to append…

However, to demonstrate the behavior of bind_rows(), we append the datasets as they are.

Use the .id= argument in bind_rows to create a variable that identifies source dataset.

# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")

# these are the rows where the datasets meet
#  hospital is found in both datasets, agecat and highpain are not
select(d3, source, hospital, agecat, highpain)[118:123,]
## # A tibble: 6 x 4
##   source hospital     agecat  highpain
##   <chr>  <chr>        <fct>   <lgl>   
## 1 1      UCSF         (50,60] TRUE    
## 2 1      UCSF         (50,60] TRUE    
## 3 1      UCSF         (50,60] TRUE    
## 4 2      Cedars-Sinai <NA>    NA      
## 5 2      Cedars-Sinai <NA>    NA      
## 6 2      Cedars-Sinai <NA>    NA

# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:24], d2)

Merging datasets with dplyr joins

Appending adds more rows of observations, whereas merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).

The dplyr “join” functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by= argument to specify specific matching id variables.

These joins all return a table with all columns from x and y, but differ in how they deal with mismatched rows:

Each of these joins is possible with base R function merge(), but merge() is many times slower than the joins.

Example of merging with inner_join()

We begin our demonstration of merging by loading a dataset of doctor variables, that describes the doctor’s years of experience, schooling, number of lawsuits, etc.

d_doc <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/doctor_dm.csv")

It is easiest to show how the joins differ if we work on smaller, more manageable datasets.

We will be merging a subset of the d_doc dataset we just imported into a subset the d3 dataset we created earlier (the two patient datasets appended together). We will select two doctors from d3, doctors “1-21” and “2-178”, and one matching doctor and one non-matching doctor from d_doc, “2-178” and “3-407”.

# select one non-matching and one matching doctor from each to demo joins
#   just a few variables from d3
d3.1 <- select(filter(d3, docid == "1-21" | docid == "2-178"),
               docid, sex, age, test1, test2)
d_doc.1 <- filter(d_doc, docid == "3-407" | docid == "2-178")

Let’s take a look at our small datasets

# first a look at the two datasets to be merged
#  2 rows for docid 2-178 in d3.1
d3.1
## # A tibble: 3 x 5
##   docid sex     age test1 test2
##   <chr> <chr> <dbl> <dbl> <dbl>
## 1 1-21  male   48.0  3.39 NA   
## 2 2-178 male   34.2  4.63  3.26
## 3 2-178 male   48.9  4.15  3.04
d_doc.1
## # A tibble: 2 x 5
##   docid experience school  lawsuits medicaid
##   <chr>      <dbl> <chr>      <dbl>    <dbl>
## 1 2-178         15 average        4    0.817
## 2 3-407         23 average        3    0.343

The merging id variable “docid” appears in both datasets, so we do not need to specify a merging variables.

Notice how inner_join(), left_join() and full_join() differ in their behavior:

# only matching rows returned
#   2-178 from d_doc.1 matched twice to 2-178 in d3.1
inner_join(d3.1, d_doc.1)
## # A tibble: 2 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 2 2-178 male   48.9  4.15  3.04         15 average        4    0.817
# all rows from d3.1 returned
left_join(d3.1, d_doc.1)
## # A tibble: 3 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 1-21  male   48.0  3.39 NA            NA <NA>          NA   NA    
## 2 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 3 2-178 male   48.9  4.15  3.04         15 average        4    0.817
# all rows from both returned
full_join(d3.1, d_doc.1)
## # A tibble: 4 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 1-21  male   48.0  3.39 NA            NA <NA>          NA   NA    
## 2 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 3 2-178 male   48.9  4.15  3.04         15 average        4    0.817
## 4 3-407 <NA>   NA   NA    NA            23 average        3    0.343

Tidy data

Tidy data philosophy

We now explore the tools in the tidyverse package tidyr (loaded with library(tidyverse)), which make datasets “tidy”.

Tidy datasets adhere to a simple set of structural rules:

Tidy data conventions provide a standard organization that prevents having to “rethink the wheel” each time data need to be organized.

Doing a little more work upfront to make data “tidy” will pay off with time saved later on.

Unit of Analysis

Each row of the dataset should represent one unit of analysis. Units may be subjects, or trials within subjects, or whole groups of subjects.

Some analyses may involve both variables describing individual units and aggregated units. For example, a student-level analysis may involve school-level variables. Longitudinal studies may involve within-subject and between-subject variables. For multilevel analyses, the unit of analysis is generally the lowest level.

# data at multiple levels - patients nested in doctors
d %>% 
  inner_join(pat_summ) %>% 
  select(sex, age, lengthofstay, docid, avg_age, longest_los) %>% 
  head(n=3)
## # A tibble: 3 x 6
##   sex      age lengthofstay docid avg_age longest_los
##   <chr>  <dbl>        <dbl> <chr>   <dbl>       <dbl>
## 1 male    65.0            6 1-1      52.5           6
## 2 female  53.9            6 1-1      52.5           6
## 3 male    41.4            5 1-1      52.5           6

Untidy data

The use of data not collected by the analyst makes encountering untidy data more common than desired.

Data can be messy in many ways, but we focus on two ways that can be tidied with tidyr tools:

Let’s take a look at some untidy datsets and see how the tidyr functions pivot_longer() and pivot_wider() will help us to tidy them.

Column headers are values, not variable names.

We first load a dataset of the number of graduates from 3 academic departments across 3 years. The planned analysis will examine whether graduation has grown over the years.

The unit of analysis here is the graduation rate in a particular department in a particular year. Each row should thus represents one year in one department.

dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 x 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

Year is a predictor in our analysis, so should be a column variable. Each department has 3 years of data so each requires 3 rows of data. Additionally, the entire table of values measures the same thing, graduation numbers in one department in one year, so we can put all values in one column.

Use pivot_longer() to create a variable out of column headings and restructure the dataset

To use pivot_longer(), we select a set of column variable for reshaping:

This process is also known as “reshaping long”.

Arguments to pivot_longer():

By default, pivot_longer will select all columns for stacking. However, columns that do not vary with values in the new names_to= column should not be stacked.

Example of using pivot_longer()

Here again is the data set we want to reshape:

dept
## # A tibble: 3 x 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

In the newly reshaped dataset, we want to put years and graudation rates into 2 columns. Since years are the column headings, we will name the names_to= variable “year”. The graudation rate values in the columns 2015, 2016, and 2017 will be stacked into another new variable whose name we specify in values_to= as “grad”.

Department name (id) does not vary with year, so we do not need to stack it. We can specify which columns are to be reshaped by specifying only the columns that need to be collapsed as c(`2015`, `2016`, `2017`)* or by specifying the columns not to be collapsed after the symbol -, as in -id.

*Note that the ` are needed to specify the column names here because they don’t start with a letter.

# the new column "year" uses the column headings as values,
#  the new column "graduates" will be the collapsed values
#  we do not want to collapse id
dept_by_year <- dept %>% 
  pivot_longer(names_to="year", values_to="grad", -id)

pivot_longer(names_to="year", values_to="grad", -id)

names_to="year"

values_to="grad"

-id

Multiple variables in one column

Columns should contain values that represent one variable, but we often encounter datasets where multiple variables are stored in the same column.

Let’s take a look at a dataset of worms, who have had their age, length, and weight measured, but all stored in one column.

worms <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/worms.csv")
worms
## # A tibble: 9 x 3
##    worm feature measure
##   <dbl> <chr>     <dbl>
## 1     1 age         5  
## 2     1 length      3.2
## 3     1 weight      4.1
## 4     2 age         4  
## 5     2 length      2.6
## 6     2 weight      3.5
## 7     3 age         5  
## 8     3 length      3.6
## 9     3 weight      5.5

We are interested in whether age predicts length and weight, which would make the unit of analysis the worm. So each row should be one worm.

We would prefer that the feature values be spread across three columns.

Use pivot_wider() to spread a single column into multiple columns

This problem reverses the previous problem solved by pivot_longer() where the column headings should actually be values of a variable. Here, we want “age”, “length”, and “weight” to be the column headings. In this reverse procedure, the names_from= column variable is converted to column headings, and the value_from= column is subsequently spread across the new columns.

This process is sometimes known as “reshaping wide”.

Here, we want to use the values in the feature variable as the new column headings, so we specify it in names_from=. We want the values in the measure column to fill the rows of data in the newly formed columns, so we specify it in values_from=.

by_worm <- worms %>% 
  pivot_wider(names_from=feature, values_from=measure)
by_worm
## # A tibble: 3 x 4
##    worm   age length weight
##   <dbl> <dbl>  <dbl>  <dbl>
## 1     1     5    3.2    4.1
## 2     2     4    2.6    3.5
## 3     3     5    3.6    5.5

pivot_wider(names_from=feature, values_from=measure)

Multiple variables stored in the same cell, or one variable split into multiple cells

Two other issues that make datasets untidy are columns where multiple variables are stored in the same cell, or where one variable is stored across multiple cells.

The table5 data that is loaded with the tidyr package exhibits both of these problems. The rate variable is actually two variables, “cases”, which is divided by “population”. The year variable is split across two columns, “century” and “year”.

table5
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Use separate() and unite() to split and concatenate columns, respectively

separate() and unite() are complements of each other.

For separate():

For unite():

table5 %>% 
  separate(col=rate, into=c("cases", "population"), sep="/") %>% 
  unite(col="year", century, year, sep="")
## # A tibble: 6 x 4
##   country     year  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

tidyverse cheatsheets

In the Help menu for RStudio, there is a submenu for “Cheatsheets”, which are documents that summarize the usage and syntax of the various tools in tidyverse. You’ll find sheets for dplyr and tidyr, and a separate cheatsheet for ggplot2.

In-class exercise 1

The job

We will begin with dataset that contains test scores for two classes that each contain one female and one male student from each of three socioeconomic (ses) statuses, low, middle and high (12 students total). We are interested in whether scores are related to ses, but only comparing the mean test scores of female students of low and middle ses. The data are untidy.

From this dataset, we want to create a new dataset that

Then we will create a dataset that contains the mean scores of the 2 students in each ses category.

We will try to make as much use of %>% as possible.

1.1 Load the dataset. How are the data untidy?

ex1 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/exercise_1.csv")
ex1
## # A tibble: 2 x 7
##   class f_low f_mid f_high m_low m_mid m_high
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1     1    52    48     52    50    58     56
## 2     2    56    47     42    50    46     49

1.2 How do we make the column headings a variable?

We will put the column headings into a variable called group and the test scores in a variable called score. Remember that because these are new variable names, they must go inside quotes.

ex1 %>% 
  pivot_longer(names_to="group", values_to="score", -class)
## # A tibble: 12 x 3
##    class group  score
##    <dbl> <chr>  <dbl>
##  1     1 f_low     52
##  2     1 f_mid     48
##  3     1 f_high    52
##  4     1 m_low     50
##  5     1 m_mid     58
##  6     1 m_high    56
##  7     2 f_low     56
##  8     2 f_mid     47
##  9     2 f_high    42
## 10     2 m_low     50
## 11     2 m_mid     46
## 12     2 m_high    49

1.3 Now we see that there might be two variables in group. How do we fix that?

ex1 %>% 
  pivot_longer(names_to="group", values_to="score", -class) %>% 
  separate(group, into=c("gender", "ses"), sep="_")
## # A tibble: 12 x 4
##    class gender ses   score
##    <dbl> <chr>  <chr> <dbl>
##  1     1 f      low      52
##  2     1 f      mid      48
##  3     1 f      high     52
##  4     1 m      low      50
##  5     1 m      mid      58
##  6     1 m      high     56
##  7     2 f      low      56
##  8     2 f      mid      47
##  9     2 f      high     42
## 10     2 m      low      50
## 11     2 m      mid      46
## 12     2 m      high     49

1.4 Doing great so far! Now, how can we select only female students from either the “low” or “mid” ses?

ex1 %>% 
  pivot_longer(names_to="group", values_to="score", -class) %>% 
  separate(group, into=c("gender", "ses"), sep="_") %>% 
  filter(gender=="f", ses=="low"|ses=="mid")
## # A tibble: 4 x 4
##   class gender ses   score
##   <dbl> <chr>  <chr> <dbl>
## 1     1 f      low      52
## 2     1 f      mid      48
## 3     2 f      low      56
## 4     2 f      mid      47

1.5 Ok, finally we need to get the means of score by ses. How can we get the means by ses?

ex1 %>% 
  pivot_longer(names_to="group", values_to="score", -class) %>% 
  separate(group, into=c("gender", "ses"), sep="_") %>% 
  filter(gender=="f", ses=="low"|ses=="mid") %>% 
  group_by(ses) %>% 
  summarise(meanscore=mean(score))
## # A tibble: 2 x 2
##   ses   meanscore
##   <chr>     <dbl>
## 1 low        54  
## 2 mid        47.5

Dates

Dates in R

Before dates in a dataset are loaded into R, typically they are stored as a column of character (string) values. However, dates are inherently numeric and we lose this information when they are stored as strings.

# string dates have no numeric value, so this errors
"2018-03-05" + 1
## Error in "2018-03-05" + 1: non-numeric argument to binary operator

In R, we want to convert our string dates to R class Date, which preserves the dates’ numeric values and allows us to take advantage of the many date-related functions in R.

Once converted to class Date, the numeric value for the date represents the number of days since January 1, 1970 (1970-01-01).

as.Date()

Base R provides as.Date() to convert strings to dates, but it can be intimidating and unwieldy to use, particularly if your dates are not stored in one of 2 default formats (see ?strptime for more about date formats).

# as.Date only accepts a couple of formats by default
# good
as.Date("2015-02-14")
## [1] "2015-02-14"

# bad
as.Date("02/14/2014")
## Error in charToDate(x): character string is not in a standard unambiguous format

# specify a format to fix
as.Date("02/14/2014", format="%m/%d/%Y")
## [1] "2014-02-14"

Once our dates are class Date, we can perform date arithmetic.

a <- as.Date("1971-01-01")
class(a)
## [1] "Date"

# days since 1970-01-01
as.numeric(a)
## [1] 365

# date arithmetic
a - as.Date("1970/12/31")
## Time difference of 1 days

a + 2
## [1] "1971-01-03"

We just need something easier to use to convert strings of various formats to class Date!

Package lubridate

tidyverse provides the amusingly named package lubridate to help R users convert their string dates to R Date format more easily as well as functions to process those dates.

The Date conversion functions in lubridate accept a wide variety of date formats, removing the need for us to remember all of those format specifications. Instead, just take the letters y, m, and d and place them in the order of the year, month and day, respectively, as they are stored in the date column. That ordering produces the name of the function to convert that column to Date (e.g. ymd(), mdy(), dmy()).

lubridate is not automatically loaded with library(tidyverse), though is installed with tidyverse, so we load it now.

library(lubridate)

Using lubridate

We’ll first load a dataset with various date formats to demonstrate the flexibility of lubridate functions.

d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dates.csv")
d
## # A tibble: 3 x 5
##   fmt1     fmt2             fmt3           fmt4 decision_time             
##   <chr>    <chr>            <date>        <dbl> <chr>                     
## 1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
## 2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23  
## 3 03/15/84 March 3, 2010    2017-12-25 20071011 Approved 04-24-15 02:03:03

Currently, the first 4 columns are stored as chr, chr, date, and int. The third column is actually read in as Date by read_csv() because of its particular formatting signals a date to read_csv(). (We’ll look at the fifth column later)

The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.

So, we want mdy() and ymd().

Marvel at the flexibility!

# no format specifications needed
# just ordering y,m, and d
dates <- data.frame(f1=mdy(d$fmt1), f2=mdy(d$fmt2),
                    f3=ymd(d$fmt3), f4=ymd(d$fmt4))
dates
##           f1         f2         f3         f4
## 1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
## 2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
## 3 1984-03-15 2010-03-03 2017-12-25 2007-10-11

Date-time variables

If your date column additionally contains time information (i.e. is date-time), you can add one or more of h, m, and s to y, m, and d to form the function name to convert the string to class POSIXct. Class POSIXct stores date-time variables as a number representing the number of seconds since the beginning of 1970.

The fifth column of our dates dataset, decision_time, is a date-time variable. (It also contains an “Approved/Denied” string at the beginning that the lubridate() function will ignore!). Specifically, the date-time is recorded as month, day, year, hour, minute, second, so we want mdy_hms():

# month day year hour minute second
mdy_hms(d$decision_time)
## [1] "2015-10-10 07:15:55 UTC" "2011-09-27 14:57:23 UTC"
## [3] "2015-04-24 02:03:03 UTC"

# POSIXct is a standard way of representing calendar time
class(mdy_hms(d$decision_time))
## [1] "POSIXct" "POSIXt"

The “time zone” standard UTC (Coordinated Universal Time) is used as the default zone. Use the argument, tz= to assign a zone. To see a list of valid time zone specifications, run OlsonNames().

# we'll use this for our dates variable
dates$decision_time <- mdy_hms(d$decision_time, tz="US/Pacific")
dates$decision_time
## [1] "2015-10-10 07:15:55 PDT" "2011-09-27 14:57:23 PDT"
## [3] "2015-04-24 02:03:03 PDT"

# first 20 valid time zones
head(OlsonNames(),  n=20)
##  [1] "Africa/Abidjan"       "Africa/Accra"         "Africa/Addis_Ababa"  
##  [4] "Africa/Algiers"       "Africa/Asmara"        "Africa/Asmera"       
##  [7] "Africa/Bamako"        "Africa/Bangui"        "Africa/Banjul"       
## [10] "Africa/Bissau"        "Africa/Blantyre"      "Africa/Brazzaville"  
## [13] "Africa/Bujumbura"     "Africa/Cairo"         "Africa/Casablanca"   
## [16] "Africa/Ceuta"         "Africa/Conakry"       "Africa/Dakar"        
## [19] "Africa/Dar_es_Salaam" "Africa/Djibouti"

Extracting information from Date variables

lubridate provides several functions to extract specific information from Date variables including:

Some examples of extracting information from Date variables.

# we'll use the first column of our dates dataset
dates$f1
## [1] "1989-01-15" "1992-02-13" "1984-03-15"

# day of the month
day(dates$f1)
## [1] 15 13 15

# day of the year
yday(dates$f1)
## [1] 15 44 75

Some more date information extraction:

# weekday as numbers
wday(dates$f1)
## [1] 1 5 5

# weekday with labels
wday(dates$f1, label=TRUE)
## [1] Sun Thu Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

# month of the year
month(dates$f1)
## [1] 1 2 3

Extracting information from POSIXct variables

lubridate also has functions to extract time information from POSIXct date-time variables.

#break up the time variable decision time
#display as a data.frame with 4 columns
with(dates,  ## with() tells R to look for variables in object "dates"
     data.frame(time=decision_time, h=hour(decision_time),
           m=minute(decision_time), s=second(decision_time)))
##                  time  h  m  s
## 1 2015-10-10 07:15:55  7 15 55
## 2 2011-09-27 14:57:23 14 57 23
## 3 2015-04-24 02:03:03  2  3  3

Two sets of functions for date-time arithmetic

If you need to add or subtract time to your date variables, lubridate provides 2 sets of functions.

One set gives “intuitive” results, and will ignore conventions like leap year. These include seconds(), minutes(), hours(), days(), weeks(), years().

The other set will adhere to those conventions, and are named by adding d to the names of the previous functions, dseconds(), dminutes(), dhours(), ddays(), dweeks(), dyears().

#2016 is a leap year
# the intuitive result
ymd("2015-02-14") + years(2) 
## [1] "2017-02-14"

# the exact result
ymd("2015-02-14") + dyears(2)
## [1] "2017-02-13 12:00:00 UTC"

Managing strings

Strings

Strings are character values. Columns in external datasets that contain any non-number value will generally be read in as strings. You can generate string variables manually with either double quotes, "", or single quotes, ''.

Base R provides a number of functions to process string variables. We will cover functions that:

Package stringr

The tidyverse package stringr provides many functions that have nearly the same functionality as base R string functions.

Like the rest of tidyverse, the stringr functions have consistent names (they all begin with str_) and consistent syntax, unlike the base R functions. However, because the base R string functions are so widely used we will cover them primarily and point out tidyverse alternatives.

The package stringr is automatically loaded with library(tidyverse).

We will use a dataset of locations of partially fictitious statistical consulting centers.

centers <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/stats_centers.csv")
centers
## # A tibble: 4 x 6
##   NAME              ROOM  BUILDING        UNIVERSITY         CITY        STATE
##   <chr>             <chr> <chr>           <chr>              <chr>       <chr>
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE    UCLA               LOS ANGELES CA   
## 2 OIT CONSULTING    Y521  HOUSTON HALL    LONE STAR UNIV.    DALLAS      TX   
## 3 ATS STAT HELP     M234  ORANGE BUILDING SUNSHINE COLLEGE   MIAMI       FL   
## 4 OAC STAT CENTER   B2431 SNOW HALL       GRIZZLY UNIVERSITY JUNEAU      AK

Concatenating strings

At least 3 functions loaded in the current session (base R and tidyverse) can concatenate strings together. We highlight what makes each unique.

paste() to concatenate string

When used on vectors, paste() concatenates element-wise. By default, joined elements are separated by a space.

# element-by-element concatenation
paste(centers$ROOM, centers$BUILDING)
## [1] "X4919 MATH SCIENCE"   "Y521 HOUSTON HALL"    "M234 ORANGE BUILDING"
## [4] "B2431 SNOW HALL"

You can change the separating character with sep=.

# separate with comma and space
paste(centers$CITY, centers$STATE, sep=", ")
## [1] "LOS ANGELES, CA" "DALLAS, TX"      "MIAMI, FL"       "JUNEAU, AK"

Vectors of numbers used in paste() are converted to character before pasting.

# separate with comma and space
paste(1:4, centers$UNIVERSITY)
## [1] "1 UCLA"               "2 LONE STAR UNIV."    "3 SUNSHINE COLLEGE"  
## [4] "4 GRIZZLY UNIVERSITY"

Interestingly, paste() will coerce the missing value NA to the string "NA". This usually is not desirable.

# separate with comma and space
paste(centers$CITY, centers$STATE, c("USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, NA"       "MIAMI, FL, USA"      
## [4] "JUNEAU, AK, NA"

If you need to paste the elements within a vector together into a single string, specify a separating character with collapse= instead of sep=.

# put everything into a single string with collapse
paste(1:4, centers$UNIVERSITY, collapse="; ")
## [1] "1 UCLA; 2 LONE STAR UNIV.; 3 SUNSHINE COLLEGE; 4 GRIZZLY UNIVERSITY"

str_c() from stringr to concatenate strings

str_c() from the package stringr is quite similar to paste(). It has both sep= and collapse= arguments that function that same as in paste(). It differs in a couple of ways.

First, its default separator is the empty character "":

# default is no separating character
str_c(centers$CITY, centers$STATE)
## [1] "LOS ANGELESCA" "DALLASTX"      "MIAMIFL"       "JUNEAUAK"

Second, concatenating a NA into a string will result in NA:

# separate with comma and space
str_c(centers$CITY, centers$STATE, c("USA", "USA", "USA", NA), sep=", ")
## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA"      "MIAMI, FL, USA"      
## [4] NA

unite() from dplyr to concatenate strings

We previously encountered unite() when tidying data, and its important distinction from paste() and str_c() is that it is strictly designed to concatenate columns stored in the same dataset. It will not accept a vector from outside the dataset. Indeed, the dataset is the first argument to unite(). It is best used to add a column of concatenated existing variables to a dataset (and remove the original columns).

# concatenate CITY and STATE as citystate, then remove them
centers <- unite(centers, col="citystate", CITY, STATE, sep=", ")
centers
## # A tibble: 4 x 5
##   NAME              ROOM  BUILDING        UNIVERSITY         citystate      
##   <chr>             <chr> <chr>           <chr>              <chr>          
## 1 IDRE STAT CONSULT X4919 MATH SCIENCE    UCLA               LOS ANGELES, CA
## 2 OIT CONSULTING    Y521  HOUSTON HALL    LONE STAR UNIV.    DALLAS, TX     
## 3 ATS STAT HELP     M234  ORANGE BUILDING SUNSHINE COLLEGE   MIAMI, FL      
## 4 OAC STAT CENTER   B2431 SNOW HALL       GRIZZLY UNIVERSITY JUNEAU, AK

Extracting characters from strings by position

To extract characters, or a “substring”, from a string by position, we can use either of two similar functions:

For both functions, we specify a string variable, a start position, and an end position.

# start at first, end at third character
substr(centers$BUILDING, 1, 3)
## [1] "MAT" "HOU" "ORA" "SNO"

# specifying an end longer than the entirety of the string
#   results in truncation
str_sub(centers$UNIVERSITY, 2, 6)
## [1] "CLA"   "ONE S" "UNSHI" "RIZZL"

One difference between substr() and str_sub() is that str_sub() accepts negative numbers, which count from the end of the string.

# start from third to last, end at last
str_sub(centers$UNIVERSITY, -3, -1)
## [1] "CLA" "IV." "EGE" "ITY"

Replacing characters strings by position

Using the assignment operator <- with substr() and str_sub() allows us to replace a substring in string variables by position.

# change the first character of the ROOM variable to "R"
substr(centers$ROOM, 1, 1) <- "R"
centers$ROOM
## [1] "R4919" "R521"  "R234"  "R2431"

# change the last 2 characters of citystate to US
str_sub(centers$citystate, -2, -1) <- "US"
centers$citystate
## [1] "LOS ANGELES, US" "DALLAS, US"      "MIAMI, US"       "JUNEAU, US"

Changing case

Use base R functions tolower() and toupper() to change the case of string variables.

# lower case city names
tolower(centers$citystate)
## [1] "los angeles, us" "dallas, us"      "miami, us"       "juneau, us"

# change variable names to lower case
names(centers) <- tolower(names(centers))
names(centers)
## [1] "name"       "room"       "building"   "university" "citystate"

The stringr functions str_to_lower() and str_to_upper() work very similarly.

Pattern Matching with grep()

Often, we need to scan through a vector of strings and find values that match a text pattern (e.g. finding all phone numbers beginning with area code “310”)

Base R and stringr provide several functions for powerful and flexible pattern matching.

The primary base R function for pattern matching is grep(). Several other related functions also match patterns, but return slightly different output (e.g. grepl() and regexpr()).

The basic syntax is grep(pattern, x), where pattern is the text pattern to match in character vector x.

By default, grep() returns the index (row) number of the matches, but we can get the strings themselves returned with value=TRUE.

# stat center names
centers$name
## [1] "IDRE STAT CONSULT" "OIT CONSULTING"    "ATS STAT HELP"    
## [4] "OAC STAT CENTER"

# row numbers of names with "STAT"
grep(pattern="STAT", centers$name)
## [1] 1 3 4

# we can use those indices to subset
centers$name[grep(pattern="STAT", centers$name)]
## [1] "IDRE STAT CONSULT" "ATS STAT HELP"     "OAC STAT CENTER"

# or just use value=TRUE to get the words directly
grep(pattern="STAT", centers$name, value=TRUE)
## [1] "IDRE STAT CONSULT" "ATS STAT HELP"     "OAC STAT CENTER"

Substitution with sub()

We can also substitute based on pattern matches with base R function sub(). Now the syntax is sub(pattern, replacement, x), where pattern is the pattern to be matched, replacement is the replacement text, and x is the string vector.

# change STAT to STATISTICAL
centers$name <- sub("STAT", "STATISTICAL", centers$name)
centers$name 
## [1] "IDRE STATISTICAL CONSULT" "OIT CONSULTING"          
## [3] "ATS STATISTICAL HELP"     "OAC STATISTICAL CENTER"

*Regular Expressions*

Regular expressions are a “language” in which a sequence of character codes forms a search pattern for string matching and substitution. For example, the expression “[0-9]” matches any single number. The expression “[0-9]{2}-[0-9]{2}” matches any two numbers followed by a hyphen followed by any two numbers.

We encourage users who know they will be doing intense string matching to learn regular expressions syntax, which is beyond the scope of this seminar.

R string functions support regular expression pattern matching. All pattern matching function mentioned so far, including grep() and sub() support regular expressions.

# vector of dates formatted in two different ways
four_dates <- c("09-12-2018", "2015-11-13", 
                "03-24-2016", "2017-05-12")

# the regular expression means,
# 2 numbers, hyphen, 2 numbers, hyphen, 4 numbers
grep("[0-9]{2}-[0-9]{2}-[0-9]{4}", 
     four_dates, value=TRUE)
## [1] "09-12-2018" "03-24-2016"

Substitution with regular expressions is immensely powerful – here we take the four-digit years that appear at the end of the two dates and move them to the beginning of each date string, and add a hyphen afterward.

# the replacement expression \\3-\\1 means:
# put the substring captured in the 3rd set of parentheses first,
#  then a hyphen, then the substring captured in
#  the first set of parenetheses
sub("([0-9]{2}-[0-9]{2})(-)([0-9]{4})", "\\3-\\1",
     four_dates)
## [1] "2018-09-12" "2015-11-13" "2016-03-24" "2017-05-12"

Several stringr functions also support regular expressions, but we do not discuss them here (see str_detect() and str_replace() as alternatives to grep() and sub(), respectively).

In-class exercise 2

Excercise 2 dataset

We will practice using date and string functions, as well as looping in this exercise.

We will practice on a dataset of admission, treatment, and death dates. We want to convert all of these variables to R class Date and POSIXct as appropriate.

We will want to combine Deathday and TimeofDeath into a single variable that codes for date-time of death, before converting to POSIXct.

ex2 <- read_csv("http://stats.idre.ucla.edu/stat/data/rdm/exercise_2.csv")
ex2
## # A tibble: 3 x 5
##   Admit            Treat1     Treat2     DeathDay        TimeofDeath
##   <chr>            <chr>      <chr>      <chr>           <time>     
## 1 May 24, 2013     03-15-2014 04-17-2114 January 2, 2015 11:15:50   
## 2 August 23, 2012  07-05-2013 10-11-2113 July 19, 2016   18:59:15   
## 3 December 1, 2013 07-25-2014 11-18-2114 May 25, 2017    02:04:16

2.1. First, let’s convert all of the column names to lower case so that they’re easier to work with.

What function do we want to use with names() to do this?

# how to change these to lower case?
names(ex2)
## [1] "Admit"       "Treat1"      "Treat2"      "DeathDay"    "TimeofDeath"
# how to change these to lower case
names(ex2) <- tolower(names(ex2))
names(ex2)
## [1] "admit"       "treat1"      "treat2"      "deathday"    "timeofdeath"

2.2. Currently, all the date variables are character. We want them converted to Date (or POSIXct for the time variable). First, there are some data entry issues. We see that the year in “Treat2” that years are in the 2100s rather than the 2000s, so first let’s fix that.

# fix these
ex2$treat2
## [1] "04-17-2114" "10-11-2113" "11-18-2114"

How can we use sub() to match a pattern and fix the error?

# don't match to "21", you will match dates where day is the 21st
#  (e.g. 07-21-2113)
ex2$treat2 <- sub("211", "201", ex2$treat2)
ex2$treat2
## [1] "04-17-2014" "10-11-2013" "11-18-2014"

2.3. Now let’s concatenate deathday and timeofdeath together so we can create a POSIXct date-time variables, but let’s keep the old variables.

# concatenate these together
select(ex2, deathday, timeofdeath)
## # A tibble: 3 x 2
##   deathday        timeofdeath
##   <chr>           <time>     
## 1 January 2, 2015 11:15:50   
## 2 July 19, 2016   18:59:15   
## 3 May 25, 2017    02:04:16

How can we do that?

# concatenate these together
ex2$deathtime <- paste(ex2$deathday, ex2$timeofdeath)
ex2$deathtime
## [1] "January 2, 2015 11:15:50" "July 19, 2016 18:59:15"  
## [3] "May 25, 2017 02:04:16"

2.4. Now let’s change “admit”, “treat1”, “treat2”, and “deathday” to Date and deathtime to POSIXct.

select(ex2, admit, treat1, treat2, deathday, deathtime)
## # A tibble: 3 x 5
##   admit            treat1     treat2     deathday        deathtime              
##   <chr>            <chr>      <chr>      <chr>           <chr>                  
## 1 May 24, 2013     03-15-2014 04-17-2014 January 2, 2015 January 2, 2015 11:15:~
## 2 August 23, 2012  07-05-2013 10-11-2013 July 19, 2016   July 19, 2016 18:59:15 
## 3 December 1, 2013 07-25-2014 11-18-2014 May 25, 2017    May 25, 2017 02:04:16

What function(s) do we want to conver to Date? And what to convert to POSIXct?

ex2 <- ex2 %>% 
  mutate(admit=mdy(admit),
         treat1=mdy(treat1),
         treat2=mdy(treat2),
         deathday=mdy(deathday),
         deathtime=mdy_hms(deathtime))

# now everything has the correct class
str(ex2)
## tibble [3 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ admit      : Date[1:3], format: "2013-05-24" "2012-08-23" ...
##  $ treat1     : Date[1:3], format: "2014-03-15" "2013-07-05" ...
##  $ treat2     : Date[1:3], format: "2014-04-17" "2013-10-11" ...
##  $ deathday   : Date[1:3], format: "2015-01-02" "2016-07-19" ...
##  $ timeofdeath: 'hms' num [1:3] 11:15:50 18:59:15 02:04:16
##   ..- attr(*, "units")= chr "secs"
##  $ deathtime  : POSIXct[1:3], format: "2015-01-02 11:15:50" "2016-07-19 18:59:15" ...

Efficient coding with functions and looping

Repetitive tasks

Data management often involves performing the same task over and over, whether processing a set of variables in the same way, or appending many datasets together. With repetitive tasks, we can easily copy-and-paste code with slight modifications for each specific task. However, copying and pasting code has at least 2 drawbacks:

In this section we introduce two tools that can eliminate the need to copy-and-paste code for repetitive tasks:

Repetitive tasks dataset

For this section, we will load a dataset of 10 students, with variables describing the student’s gender and scores on 4 academic tests.

## gender and 4 test scores
hsb <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/hsb_small.csv")
hsb
## # A tibble: 10 x 5
##    female  read write  math science
##    <chr>  <dbl> <dbl> <dbl>   <dbl>
##  1 male      63    57    55      58
##  2 female    50    49    56      47
##  3 female    50    52    45      58
##  4 male      34    49    39      42
##  5 male      50    59    42      53
##  6 female    76    63    60      67
##  7 female    52    67    57      63
##  8 female    71    65    72      66
##  9 male      60    65    58      61
## 10 male      50    40    39      49

Imagine your task is to calculate a “trimmed mean” for each of the test scores. For this trimmed mean, we want to take the mean of a column after removing the minimum and maximum score.

We could write code to calculate the trimmed mean for one column, say “read”, and then copy-and-paste that code for the other 3 variables, changing “read” to the new variable name.

# first do it for read
hsb$read
##  [1] 63 50 50 34 50 76 52 71 60 50

# put the min and max in a vector
min_max_read <- c(min(hsb$read), max(hsb$read))

# we don't want the min and max (the ! means not)
!(hsb$read %in% min_max_read)
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

# take all but the min and mix
trim_read <- hsb$read[!(hsb$read %in% min_max_read)]
trim_read
## [1] 63 50 50 50 52 71 60 50

# mean of trim_read
trmean_read <- mean(trim_read)
trmean_read
## [1] 55.75

We could then copy-and-paste that code three more times for each of the other variables:

# repeat for write
min_max_write <- c(min(hsb$write), max(hsb$write))
trim_write <- hsb$write[!(hsb$write %in% min_max_write)]
trmean_write <- mean(trim_write)

# then repeat for math and science...
# 6 more lines of code

While that would work fine, it is inefficient coding. First, there are numerous places for coding errors. Second, imagine we wanted to change how the trimmed means was calculated – if we wanted to trim only the minimum score before trimming the mean, we have to be careful to change the code in all the correct places.

Functions to the rescue!

Functions allow us to code tasks in a general way, allowing the inputs to vary. Whenever you have written the same code multiple times, consider using a function instead. We can write a function that calculates the trimmed mean for any input.

The two main decisions for writing a function are:

For our trimmed mean function the input is a numeric vector, and the output is the trimmed mean, a single number (numeric vector of length one).

Function syntax

Function definitions consist of the following elements:

The first line of a function definition consists of its name, <-, the keyword function, and the arguments. For example:

my_fun <- function(arg1, arg2=0)

In that syntax, my_fun is the name of the function, arg1 is the first argument name, arg2 is the second argument name, and its default value is set to zero. The keyword function tells R that my_fun is a function (often called type closure in R errors/messages/warnings).

The remaining syntax is the body of code, encased in {}. The opening { should be on the same line as the function header.

The value returned by the function is the last line in the body (the object output_object would be returned by the function below). You can also use return() to return the object you want.

# not run, just a template
# my_fun is the name
# 2 arguments, arg1 has no default, arg2 has default value zero

my_fun <- function(arg1, arg2=0) {
   do something to arg1 and arg2
   output_object  # or return(output_object)
}

Creating our first function

Let’s create our function to compute a trimmed mean. We only need one argument, the numeric vector:

# function_name <- function(argument)
trim_mean <- function(x) {
  min_max <- c(min(x), max(x)) # x is the input numeric vector
  x <- x[! x %in% min_max]
  mean(x) ## last line, so returned as output
}

# now use it!
c(trim_mean(hsb$read), trim_mean(hsb$write),
  trim_mean(hsb$math), trim_mean(hsb$science))
## [1] 55.75000 57.37500 53.28571 56.87500

Modifying the function

Now if we want to change how we calculate the trimmed mean, we can just incorporate those changes in the function.

Let’s change the function to just trim off the minimum before calculating the mean.

# you only have to change the function once to reestimate 4 new trimmed means
trim_mean_min <- function(x) {
  x <- x[x != min(x)]  # now just trimming the min
  mean(x) 
}

c(trim_mean_min(hsb$read), trim_mean_min(hsb$write),
  trim_mean_min(hsb$math), trim_mean_min(hsb$science))
## [1] 58.00000 58.44444 55.62500 58.00000

Even better, we can add a second argument that allows you to choose whether to trim just the minimum, or both the min and max. Here the second argument is called trim=, and it is set by default to “both”, which causes the function to trim both the min and max. If the user passes “min” to trim= instead, it will trim just the min.

# be default, we will trim both the min and max
trim_mean_both <- function(x, trim="both") {
  min_max <- c(min(x), max(x))
  if (trim == "min") {     # if trim="min", trim just the min
    min_max <- min_max[1]  # subset min_max to its first element, the min
  }
  x <- x[! x %in% min_max]
  mean(x)
}

#defaults to trimming both
trim_mean_both(hsb$read)
## [1] 55.75

# just trim the min
trim_mean_both(hsb$read, trim="min")
## [1] 58

Looping - for loops

Looping structures provide another tool to code repetitive tasks efficiently. Most (probably all) programming languages support looping structures, including R. Loops in R are useful for iterating across the columns of a dataset or even across different datasets.

The ubiquitous for loop is a popular choice to iterate over a repetitive task because it allows the programmer to control how many times the loop iterates. The for loop executes the same block of code for each value in a vector. R also has while loops that terminate when a condition is FALSE, but we do not discuss these in this seminar.

The basic syntax of a for loop is:

For example, this loop prints “Hello world!” 5 times, once for each value in the vector 1:5:

# keyword for, then loop control variable, i, and vector 1:5 inside ()
for (i in 1:5) {  
  print("Hello world!") 
}
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"
## [1] "Hello world!"

The process of the loop above can be described:

  1. The loop control variable, i, takes on the value 1
  2. The loop checks that the value of i is in the vector (1,2,3,4,5). If so, the code block is executed. If not, the loop ends.
  3. The code block executes (“Hello world!” is printed), and then i increments by 1.
  4. Repeat step 2 and 3 until i increments to 6.

We often want to use the value of the loop control variable (i above) in the code block.

Loops will generally run faster if you pre-allocate space for any output generated by the loop. Growing the output object inside a loop will be slower.

# create a vector to hold values first
result <- rep(0, 5)

# i is used as a row number and in the calculation
for (i in 1:5) {
  result[i] <- i*5
}
result
## [1]  5 10 15 20 25

Looping over variables

Now we have the tools to loop over a set of variables in a data frame, because we can access the columns by number.

# we want the test score columns
names(hsb)
## [1] "female"  "read"    "write"   "math"    "science"

# loop over columns 2 through 4
for (col in 2:4) {
  print(mean(hsb[[col]])) ## use [[]] when extracting by column number 
}
## [1] 55.6
## [1] 56.6
## [1] 52.3

The vector of values over which the for loop iterates does not need to be numeric. We can repeat the loop above using the string names of the columns rather than their position:

# we want the test score columns, so drop the first
mycols <- names(hsb)[-1]
mycols
## [1] "read"    "write"   "math"    "science"

# loop over columns 2 through 4
for (col in mycols) {
  print(mean(hsb[[col]])) ## use [[]] to extract a column vector by name
}
## [1] 55.6
## [1] 56.6
## [1] 52.3
## [1] 56.4

Here is a slightly more sophisticated example of what loops can do. In this loop, we add the trimmed means for each of the 4 test scores to the dataset.

# let's add trim means to the dataset
for (col in mycols) {
  hsb$tm <- trim_mean(hsb[[col]])  # add new trim mean to dataset, call it "tm" temporarily
  newname = paste("tm", col, sep="_") # create new variable name like tm_read
  names(hsb)[ncol(hsb)] <- newname # replace last variable name, currently tm, with newname
}

# the result
hsb
## # A tibble: 10 x 9
##    female  read write  math science tm_read tm_write tm_math tm_science
##    <chr>  <dbl> <dbl> <dbl>   <dbl>   <dbl>    <dbl>   <dbl>      <dbl>
##  1 male      63    57    55      58    55.8     57.4    53.3       56.9
##  2 female    50    49    56      47    55.8     57.4    53.3       56.9
##  3 female    50    52    45      58    55.8     57.4    53.3       56.9
##  4 male      34    49    39      42    55.8     57.4    53.3       56.9
##  5 male      50    59    42      53    55.8     57.4    53.3       56.9
##  6 female    76    63    60      67    55.8     57.4    53.3       56.9
##  7 female    52    67    57      63    55.8     57.4    53.3       56.9
##  8 female    71    65    72      66    55.8     57.4    53.3       56.9
##  9 male      60    65    58      61    55.8     57.4    53.3       56.9
## 10 male      50    40    39      49    55.8     57.4    53.3       56.9

In-class exercise 3

Excercise 3 dataset

In this final exercise, we will use functions and loops to convert a set of continuous variables into categorical binary variables. The binary variables will take the value 0 if the original value is less than the mean of that variable and 1 if is greater. Each of the variables may have missing data code “-99”, which should be converted to missing value NA before calculating the mean. We will be adding the binary variables to the data set, not replacing the continuous variables.

We will practice on a dataset in which subjects take 3 surveys (q1, q2, q3) at 3 different time points .

## gender and 4 test scores
ex3 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/exercise_3.csv")

# notice the missing data codes
ex3
## # A tibble: 10 x 7
##       id    q1 adm1     q2 adm2     q3 adm3 
##    <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr>
##  1     1     4 AL        6 SD        5 CW   
##  2     2     2 SJ        2 CW        4 JL