Introduction

Purpose

This seminar introduces tools in R useful for data management.

The seminar is focused on the tools found in the tidyverse collection of packages, but also includes various tools pulled from other packages that complement tidyverse tools.

Data management packages used in this seminar

tidyverse: a collection of packages with tools for most aspects of data analysis, particularly strong in data import, management, and visualization. We will focus on the following packages within tidyverse:

  • readr - updated data import functions like read_csv()
  • dplyr - subsetting, sorting, transforming variables, grouping
  • tidyr - restructuring rows and columns
  • magrittr - piping a chain of commands
  • lubridate - date and time variable processing
  • stringr - string variable manipulation

Hmisc: another large collection of tools for most aspects of data analysis, but we use it here for describe(), a dataset summary function

install.packages("tidyverse", dependencies=TRUE)
install.packages("Hmisc", dependencies=TRUE)
library(tidyverse)
library(Hmisc)

Review of data frames and functions

Data frames

Datasets for statistical analysis are typically stored in data frames in R.

Data frames are rectangular, where the columns are variables and the rows are observations of those variables. Columns can be of different types (double, character, etc.) but must be of equal length.

When we import data from a file, such as with read.csv(), the resulting structure is typically a data frame.

Most of the tools discussed in this seminar work on data frames.

Example data frame

First dataset for this seminar

We begin our data mangement work on a dataset of 120 observations of 25 variables describing simulated cancer patient data. Each patient is also linked to a doctor in the dataset. We will merge this doctor dataset into this dataset later in the seminar.

We will be using the function read_csv(), from the tidyverse package readr to import our data for this seminar. Although very similar to the base R function read.csv(), read_csv() has slightly different defaults and options, but the two can be used interchangeably for the most part. One of the primary differences is that read_csv() will import the data as a tibble, an updated class of data frames, which we will introduce shortly.

d <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt1_dm.csv")

Viewing data as a spreadsheet with View()

Use View() on a dataset to open a spreadsheet-style view of a dataset. In RStudio, clicking on a dataset in the Environment pane will View() it.

View(d)

Tibbles

Throughout this seminar, we will be using tidyverse functions for data management. These functions typically accept data frames as inputs, and output tibbles, a special class of data frames.

Tibbles print to screen a bit differently from regular data frames – usually up to 10 rows and only as many columns as can fit on the screen. We also get each column’s “type”:

d
## # A tibble: 120 x 24
##    hospital hospid docid dis_date sex     age   test1 test2  pain tumorsize
##    <chr>     <dbl> <chr> <chr>    <chr> <dbl>   <dbl> <dbl> <dbl>     <dbl>
##  1 UCLA          1 1-1   6-Sep-09 male   65.0   3.70  8.09      4      68.0
##  2 UCLA          1 1-1   7-Jan-11 fema~  53.9   2.63  0.803     2      64.7
##  3 UCLA          1 1-1   4-Sep-10 male   41.4 -99     2.13      3      86.4
##  4 UCLA          1 1-1   25-Jun-~ male   46.8   3.89  1.35      3      53.4
##  5 UCLA          1 1-1   1-Jul-09 male   51.9   1.42  2.19      4      51.7
##  6 UCLA          1 1-1   6-Mar-09 fema~  53.8   2.29  8.61      3      78.9
##  7 UCLA          1 1-1   15-Apr-~ male   54.4   8.03  7.23      4      62.9
##  8 UCLA          1 1-11  12-Jul-~ fema~  47.1   0.810 2.60      5      81.2
##  9 UCLA          1 1-11  25-Jul-~ fema~  59.3 -99     5.18      4      73.2
## 10 UCLA          1 1-11  12-Jul-~ fema~  47.1   0.810 2.60      5      81.2
## # ... with 110 more rows, and 14 more variables: co2 <dbl>, wound <dbl>,
## #   mobility <dbl>, ntumors <dbl>, remission <dbl>, lungcapacity <dbl>,
## #   married <dbl>, familyhx <chr>, smokinghx <chr>, cancerstage <chr>,
## #   lengthofstay <dbl>, wbc <chr>, rbc <dbl>, bmi <dbl>

Tibbles are still data frames, and will work in nearly all functions that require data frames as inputs. For this seminar, anything we say about data frames will also apply to tibbles.

To convert a tibble to a regular data frame, use as.data.frame().

# d is of class tibble (tbl_df), class table (tbl) and class data.frame
class(d)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

# now just a data.frame
class(as.data.frame(d))
## [1] "data.frame"

Read more about tibbles.

Subsetting data frames

We can select a particular row and column of data frame x, with the syntax x[rows, columns], where rows and columns are one of the following: a number, a vector of numbers, the name of a variable that stores numbers, or omitted.

Omitting the row value x[, columns] selects all rows, and omitting the column value, x[rows, ], selects all columns. (Using this syntax to extract a row or column will result in a data.frame, not a vector)

The syntax x$colname extracts the column with name colname as a vector. We recommend this syntax for extracting columns.

The syntax x[[“colname”]] and x[[colnum]] can be used to extract column vectors by name or by number, respectively. (Using this syntax to extract a column will result in a vector)

# row 3 column 2
d[3,2]
## # A tibble: 1 x 1
##   hospid
##    <dbl>
## 1      1

# rows 2 and 4 of columns age
d[c(2,4), "age"]
## # A tibble: 2 x 1
##     age
##   <dbl>
## 1  53.9
## 2  46.8

# all columns for row 10
d[10,]
## # A tibble: 1 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCLA          1 1-11  12-Jul-~ fema~  47.1 0.810  2.60     5      81.2  1.55
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

# all rows of column "pain"
d[,"pain"]
## # A tibble: 120 x 1
##     pain
##    <dbl>
##  1     4
##  2     2
##  3     3
##  4     3
##  5     4
##  6     3
##  7     4
##  8     5
##  9     4
## 10     5
## # ... with 110 more rows

# subsetting with $ creates a numeric vector, which itself can be subset
d$age[2:3]
## [1] 53.91714 41.36804

Logical subsetting

We can also use logical expressions to subset. Logical comparisons such as x > 0 return TRUE/FALSE vectors.

We can place such an expression inside of [] and values with TRUE will be selected.

# logical comparison results in a logical vector
d$age > 60
##   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# when placed in [] before the comma, rows with TRUE are selected
#  this returns columns age and pain for observations where age > 60
d[d$age>60, c("age", "pain")]
## # A tibble: 5 x 2
##     age  pain
##   <dbl> <dbl>
## 1  65.0     4
## 2  65.8     4
## 3 358.      6
## 4  63.9     8
## 5  64.2     7

Functions

Functions perform most of the work on data in R.

Functions in R are much the same as they are in math – they perform some operation on an input and return some output. For example, the mathematical function \(f(x) = x^2\), takes an input \(x\), and returns its square. Similarly, the mean() function in R takes a vector of numbers and returns its mean.

When we load a new package into R with library(), the set of functions associated with that library are loaded into the R session.

Help files for functions

In the help file for a function (?function_name), we will find a list of Arguments to the function, in a specific order. Arguments are the inputs to the function. Values for arguments to functions can be specified either by name or position.

In the Usage section, a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.

The Value section specifies what is returned. Usually there are Examples at the bottom.

Specifying arguments by name and position

Input arguments to functions can be specified by either name or position.

# specifying arguments by name
seq(from=1, to=5, by=1)
## [1] 1 2 3 4 5

# specifying arguments by position
seq(10, 0, -2)
## [1] 10  8  6  4  2  0

Data cleaning and missing values

Quick summaries of variables

We highly recommend familiarizing yourself with a quick summary of the distribution of each of your variables to:

  • learn the characteristics of the sample
  • identify suspicious values

The base R function summary() provides summary statistics for numeric variables and frequencies for factor variables (we have no factors yet). Character variables are left unsummarized.

summary(d)
##    hospital             hospid         docid             dis_date        
##  Length:120         Min.   :1.000   Length:120         Length:120        
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :1.000   Mode  :character   Mode  :character  
##                     Mean   :1.483                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##      sex                 age             test1             test2        
##  Length:120         Min.   : 34.19   Min.   :-99.000   Min.   :-99.000  
##  Class :character   1st Qu.: 47.75   1st Qu.:  1.560   1st Qu.:  2.249  
##  Mode  :character   Median : 51.83   Median :  3.107   Median :  4.162  
##                     Mean   : 53.59   Mean   : -1.989   Mean   : -1.226  
##                     3rd Qu.: 55.01   3rd Qu.:  5.707   3rd Qu.:  6.166  
##                     Max.   :357.89   Max.   : 12.416   Max.   : 17.228  
##       pain         tumorsize           co2               wound      
##  Min.   :1.000   Min.   : 49.10   Min.   :-98.0000   Min.   :2.000  
##  1st Qu.:4.000   1st Qu.: 62.18   1st Qu.:  1.4954   1st Qu.:5.000  
##  Median :5.000   Median : 68.23   Median :  1.5859   Median :6.000  
##  Mean   :5.325   Mean   : 70.08   Mean   : -0.8901   Mean   :5.592  
##  3rd Qu.:6.000   3rd Qu.: 77.48   3rd Qu.:  1.6860   3rd Qu.:7.000  
##  Max.   :9.000   Max.   :109.01   Max.   :  1.9424   Max.   :9.000  
##     mobility        ntumors       remission     lungcapacity     
##  Min.   :2.000   Min.   :0.00   Min.   :0.00   Min.   :-99.0000  
##  1st Qu.:5.000   1st Qu.:1.00   1st Qu.:0.00   1st Qu.:  0.5141  
##  Median :6.000   Median :2.00   Median :0.00   Median :  0.7457  
##  Mean   :6.033   Mean   :3.15   Mean   :0.35   Mean   :-18.3280  
##  3rd Qu.:7.000   3rd Qu.:5.00   3rd Qu.:1.00   3rd Qu.:  0.8721  
##  Max.   :9.000   Max.   :9.00   Max.   :1.00   Max.   :  0.9982  
##     married         familyhx          smokinghx         cancerstage       
##  Min.   :0.0000   Length:120         Length:120         Length:120        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :1.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.6333                                                           
##  3rd Qu.:1.0000                                                           
##  Max.   :1.0000                                                           
##   lengthofstay       wbc                 rbc             bmi       
##  Min.   :3.000   Length:120         Min.   :4.360   Min.   :18.45  
##  1st Qu.:4.000   Class :character   1st Qu.:4.825   1st Qu.:24.51  
##  Median :5.000   Mode  :character   Median :4.978   Median :27.82  
##  Mean   :5.308                      Mean   :4.970   Mean   :29.38  
##  3rd Qu.:6.000                      3rd Qu.:5.150   3rd Qu.:34.31  
##  Max.   :8.000                      Max.   :5.535   Max.   :58.00

Use describe() from package Hmisc for more detailed summaries

The describe() function from package Hmisc will give differents summary statistics for all variables in a data frame. The information provided depends on the variable’s type and number of distinct values.

Important Features:

  • frequency tables for variables with fewer than 20 distinct values
  • more detailed quantiles for continuous variables
  • counts and proportions for binary variables
  • counts of missing (NA) for all variables

The results of describe() can be used to determine quickly whether variables have plausible values and distributions:

library(Hmisc)

# detailed summaries of variables
describe(d)
## d 
## 
##  24  Variables      120  Observations
## --------------------------------------------------------------------------------
## hospital 
##        n  missing distinct 
##      120        0        2 
##                       
## Value       UCLA  UCSF
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## hospid 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        2    0.749    1.483   0.5036 
##                       
## Value          1     2
## Frequency     62    58
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## docid 
##        n  missing distinct 
##      120        0       22 
## 
## lowest : 1-1   1-100 1-11  1-21  1-22 , highest: 2-177 2-178 2-188 2-201 2-216
## --------------------------------------------------------------------------------
## dis_date 
##        n  missing distinct 
##      120        0      104 
## 
## lowest : 1-Jul-09  10-Jun-09 10-Jun-10 11-Apr-10 11-Dec-09
## highest: 9-Apr-10  9-Feb-09  9-Feb-10  9-Jun-10  9-May-10 
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    53.59    11.92    40.86    41.52 
##      .25      .50      .75      .90      .95 
##    47.75    51.83    55.01    58.61    59.65 
## 
## lowest :  34.19229  35.31930  37.25225  39.61641  40.03724
## highest:  63.93238  64.16432  64.96824  65.80417 357.89001
##                                                           
## Value         35    40    45    50    55    60    65   360
## Frequency      3    10    16    38    35    13     4     1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
## 
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## test1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.989    14.11 -99.0000   0.5809 
##      .25      .50      .75      .90      .95 
##   1.5597   3.1065   5.7067   7.9931   9.5583 
## 
## lowest : -99.0000000   0.1048958   0.1927608   0.4293420   0.5155185
## highest:   9.7981329  10.2903990  10.4685400  11.0714440  12.4163920
##                                                                             
## Value        -99     0     1     2     3     4     5     6     7     8     9
## Frequency      7     3    20    17    21    12     8    11     6     7     1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##                             
## Value         10    11    12
## Frequency      5     1     1
## Proportion 0.042 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## test2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.226    14.59 -99.0000   0.7881 
##      .25      .50      .75      .90      .95 
##   2.2494   4.1620   6.1657   8.9973  10.9861 
## 
## lowest : -99.0000000   0.5807591   0.6179262   0.6571499   0.7485080
## highest:  11.4315750  12.4493730  14.2252270  14.5365420  17.2275810
##                                                                             
## Value        -99     1     2     3     4     5     6     7     8     9    10
## Frequency      7    15    12    19    16    15     9     6     6     6     1
## Proportion 0.058 0.125 0.100 0.158 0.133 0.125 0.075 0.050 0.050 0.050 0.008
##                                         
## Value         11    12    14    15    17
## Frequency      4     1     1     1     1
## Proportion 0.033 0.008 0.008 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## pain 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        9    0.961    5.325    1.742 
## 
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##                                                                 
## Value          1     2     3     4     5     6     7     8     9
## Frequency      1     2    11    21    33    25    18     5     4
## Proportion 0.008 0.017 0.092 0.175 0.275 0.208 0.150 0.042 0.033
## --------------------------------------------------------------------------------
## tumorsize 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    70.08    12.92    53.48    55.97 
##      .25      .50      .75      .90      .95 
##    62.18    68.23    77.48    84.06    92.54 
## 
## lowest :  49.09861  50.28009  50.50217  51.65727  53.19937
## highest:  98.05510  98.32850  98.70570 102.69671 109.00956
## --------------------------------------------------------------------------------
## co2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      115        1  -0.8901     5.03    1.390    1.440 
##      .25      .50      .75      .90      .95 
##    1.495    1.586    1.686    1.756    1.793 
## 
## lowest : -98.000000   1.327440   1.362927   1.372113   1.391310
## highest:   1.810847   1.820266   1.905606   1.920025   1.942401
##                             
## Value        -98     1     2
## Frequency      3    29    88
## Proportion 0.025 0.242 0.733
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------
## wound 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        8    0.958    5.592    1.767 
## 
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##                                                           
## Value          2     3     4     5     6     7     8     9
## Frequency      5    10    12    23    36    22    10     2
## Proportion 0.042 0.083 0.100 0.192 0.300 0.183 0.083 0.017
## --------------------------------------------------------------------------------
## mobility 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        8    0.969    6.033    2.264 
## 
## lowest : 2 3 4 5 6, highest: 5 6 7 8 9
##                                                           
## Value          2     3     4     5     6     7     8     9
## Frequency      6     4    19    19    28    15     5    24
## Proportion 0.050 0.033 0.158 0.158 0.233 0.125 0.042 0.200
## --------------------------------------------------------------------------------
## ntumors 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       10    0.981     3.15    3.101        0        0 
##      .25      .50      .75      .90      .95 
##        1        2        5        8        9 
## 
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency     23    21    19     9    12    11     7     5     5     8
## Proportion 0.192 0.175 0.158 0.075 0.100 0.092 0.058 0.042 0.042 0.067
## --------------------------------------------------------------------------------
## remission 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      120        0        2    0.683       42     0.35   0.4588 
## 
## --------------------------------------------------------------------------------
## lungcapacity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0       96    0.995   -18.33    31.26 -99.0000 -99.0000 
##      .25      .50      .75      .90      .95 
##   0.5142   0.7457   0.8721   0.9573   0.9790 
## 
## lowest : -99.0000000 -98.0000000   0.2949074   0.3264440   0.3450253
## highest:   0.9856416   0.9864109   0.9924814   0.9940955   0.9982018
##                                                     
## Value      -99.0 -98.0   0.2   0.4   0.6   0.8   1.0
## Frequency     20     3     1     6    21    46    23
## Proportion 0.167 0.025 0.008 0.050 0.175 0.383 0.192
## 
## For the frequency table, variable is rounded to the nearest 0.2
## --------------------------------------------------------------------------------
## married 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      120        0        2    0.697       76   0.6333   0.4683 
## 
## --------------------------------------------------------------------------------
## familyhx 
##        n  missing distinct 
##      120        0        3 
##                             
## Value        -99    no   yes
## Frequency      6    97    17
## Proportion 0.050 0.808 0.142
## --------------------------------------------------------------------------------
## smokinghx 
##        n  missing distinct 
##      120        0        4 
##                                           
## Value          -99 current  former   never
## Frequency        6      26      22      66
## Proportion   0.050   0.217   0.183   0.550
## --------------------------------------------------------------------------------
## cancerstage 
##        n  missing distinct 
##      120        0        4 
##                                   
## Value          I    II   III    IV
## Frequency     40    54    17     9
## Proportion 0.333 0.450 0.142 0.075
## --------------------------------------------------------------------------------
## lengthofstay 
##        n  missing distinct     Info     Mean      Gmd 
##      120        0        6    0.933    5.308    1.271 
## 
## lowest : 3 4 5 6 7, highest: 4 5 6 7 8
##                                               
## Value          3     4     5     6     7     8
## Frequency      5    29    29    40    15     2
## Proportion 0.042 0.242 0.242 0.333 0.125 0.017
## --------------------------------------------------------------------------------
## wbc 
##        n  missing distinct 
##      120        0      116 
## 
## lowest : 3671.880371  4176.054199  4201.741211  4238.355957  4331.902344 
## highest: 7999.091309  8340.71582   8415.605469  8567.246094  not assessed
## --------------------------------------------------------------------------------
## rbc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1     4.97   0.2885    4.526    4.606 
##      .25      .50      .75      .90      .95 
##    4.825    4.978    5.150    5.285    5.353 
## 
## lowest : 4.359662 4.436482 4.456108 4.465468 4.470100
## highest: 5.441604 5.442614 5.459067 5.502604 5.535052
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    29.38    7.278    20.71    22.07 
##      .25      .50      .75      .90      .95 
##    24.51    27.82    34.31    37.75    40.71 
## 
## lowest : 18.44992 18.68505 20.07485 20.17994 20.49195
## highest: 42.84858 44.04223 46.50746 52.30723 58.00000
## --------------------------------------------------------------------------------

Suspicious values

Here are some variables with values that we are not sure are correct:

describe(d[,c("age", "sex", "test1")])
## d[, c("age", "sex", "test1")] 
## 
##  3  Variables      120  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      117        1    53.59    11.92    40.86    41.52 
##      .25      .50      .75      .90      .95 
##    47.75    51.83    55.01    58.61    59.65 
## 
## lowest :  34.19229  35.31930  37.25225  39.61641  40.03724
## highest:  63.93238  64.16432  64.96824  65.80417 357.89001
##                                                           
## Value         35    40    45    50    55    60    65   360
## Frequency      3    10    16    38    35    13     4     1
## Proportion 0.025 0.083 0.133 0.317 0.292 0.108 0.033 0.008
## 
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      120        0        3 
##                                
## Value        12.2 female   male
## Frequency       1     74     45
## Proportion  0.008  0.617  0.375
## --------------------------------------------------------------------------------
## test1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      120        0      111        1   -1.989    14.11 -99.0000   0.5809 
##      .25      .50      .75      .90      .95 
##   1.5597   3.1065   5.7067   7.9931   9.5583 
## 
## lowest : -99.0000000   0.1048958   0.1927608   0.4293420   0.5155185
## highest:   9.7981329  10.2903990  10.4685400  11.0714440  12.4163920
##                                                                             
## Value        -99     0     1     2     3     4     5     6     7     8     9
## Frequency      7     3    20    17    21    12     8    11     6     7     1
## Proportion 0.058 0.025 0.167 0.142 0.175 0.100 0.067 0.092 0.050 0.058 0.008
##                             
## Value         10    11    12
## Frequency      5     1     1
## Proportion 0.042 0.008 0.008
## 
## For the frequency table, variable is rounded to the nearest 1
## --------------------------------------------------------------------------------

plot of describe()

The results of describe() can be plotted:

  • dot plots of frequencies of categorical variables
  • spike histograms of distributions of continuous variables

These plots are also useful to inspect distributions of variables and to look for suspicious values.

In the dot plots, look for suspicious labels for categorical variables (see variables sex, familyhx and smokinghx below).

In the histogram plots, each histogram has its own scale, so bunched-up histograms often denote the presence of extreme values (see variables age, test1, test2, co2 and lungcapacity).

plot(describe(d))
## $Categorical

## 
## $Continuous

Converting suspicious values to missing

A look at the output of our dataset summaries, as well as the plot of describe() identified several suspicious values:

  • age has value 357.89001, likely a data entry error
  • sex has value “12.2”, also a data entry error
  • test1, test2, lungcapacity, smokinghx, familyhx have value -99, a missing value code
  • co2 and lungcapacity have value -98, another missing value code
  • wbc seems to be a numeric variable, but has the value “not assessed”, which causes R to read it in as character

Becasue we don’t know any replacement values, we will change suspicous values to NA, the missing value indicator in R.

Note: Many times, datasets differentiate between different types of missing. For example, -99=“absent”, while -98=“refused to answer”. For analysis purposes, using NA for both will usually suffice. For users that need to distinguish between different kinds of missing, we suggest looking into the as.item() function from package memisc.

We can use logical subsetting to assign NA to the data entry errors:

# change all impossible age values to NA 
#  assume all adults in this dataset
d$age[d$age<18|d$age>120] <- NA

# remember to use quotes for character variables
d$sex[d$sex == "12.2"] <- NA

Now we want to translate all -99, -98, and “not assessed” values to NA. The following syntax can be used to translate all values of a to b in dataset df:

df[df==a] <- b

d[d==-99] <- NA
d[d==-98] <- NA
d[d=="not assessed"] <- NA

Complete cases

complete.cases(x) returns a logical (TRUE/FALSE) vector of whether each row of x has no missing. Since TRUE=1 and FALSE=0, a sum of this vector is a count of complete cases.

# which cases are complete
complete.cases(d)
##   [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
##  [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [25]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [37]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [49] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [73]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [85]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [109] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

# number of complete cases
sum(complete.cases(d))
## [1] 82

# create a dataset of complete cases
d_comp <- d[complete.cases(d),]

describe() plot after cleaning data

The plot of describe() looks better now – no suspcious labels in the dot plots and nicely spread out histograms.

Color indicates the number of missing values for that variable.

plot(describe(d))
## $Categorical

## 
## $Continuous


plot(describe(d), which="continuous")

Data management with dplyr

The dplyr package

The dplyr package provides tools for some of the most common data management tasks. Its primary functions are “verbs” to help you think about what you need to do to your dataset:

  • filter(): select rows according to conditions
  • select(): select columns (you can rename as you select)
  • arrange(): sort rows
  • mutate(): add new columns
  • summarize(): calculate statistics across groups

Two great conveniences of using dplyr functions:

  • The first argument to all dplyr functions is a data frame, and all variables supplied as arguments are assumed to be column of that data frame, so you only need to specify the dataset name once.
  • existing variables do not need to be quoted when used as arguments (new variable names should be quoted)

The dplyr package is automatically loaded with library(tidyverse).

Selecting rows with filter()

We have already seen how to subset using logical vectors and logical subsetting:

# all variables for patients with pain>=9
pain9 <- d[d$pain>=9,]
pain9
## # A tibble: 4 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-177 9-Jun-10 male   58.7  7.62  9.31     9      63.5  1.48
## 2 UCSF          2 2-188 15-Jul-~ fema~  41.5  4.00  7.57     9      98.3  1.82
## 3 UCSF          2 2-216 11-Apr-~ fema~  52.0  6.70  9.37     9      78.5  1.70
## 4 UCSF          2 2-177 9-Jun-10 male   58.7  7.62  9.31     9      63.5  1.48
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

While that works fine enough, the code can get unwieldy if there are many conditions that need to be evaluated, especially witha long data set name.

The dplyr function filter() provides a cleaner syntax for subsetting datasets. Conditions separated by , are joined by & (logical AND).

# subset to females with pain >= 9
dfhp <- filter(d, sex=="female", pain>=9)
dfhp
## # A tibble: 2 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-188 15-Jul-~ fema~  41.5  4.00  7.57     9      98.3  1.82
## 2 UCSF          2 2-216 11-Apr-~ fema~  52.0  6.70  9.37     9      78.5  1.70
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

R Logical operators and functions

Here are some operators and functions to help with selection:

  • ==: equality
  • >, >=: greater than, greater than or equal to
  • !: not
  • &: AND
  • |: OR
  • %in%: matches any of (2 %in% c(1,2,3) = TRUE)
  • is.na(): equality to NA
  • near(): checking for equality for floating point (decimal) numbers, has a built-in tolerance
# select those not in UCLA, who are either younger than 40 or older than 60
filter(d, hospital!="UCLA", age<40 | age>60)
## # A tibble: 3 x 24
##   hospital hospid docid dis_date sex     age test1 test2  pain tumorsize   co2
##   <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 UCSF          2 2-113 4-Feb-10 male   63.9  4.77  1.50     8      75.1  1.56
## 2 UCSF          2 2-178 19-Apr-~ male   34.2  4.63  3.26     3      65.5  1.64
## 3 UCSF          2 2-201 21-Nov-~ fema~  64.2  2.03  6.20     7      79.7  1.72
## # ... with 13 more variables: wound <dbl>, mobility <dbl>, ntumors <dbl>,
## #   remission <dbl>, lungcapacity <dbl>, married <dbl>, familyhx <chr>,
## #   smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>, wbc <chr>,
## #   rbc <dbl>, bmi <dbl>

Selecting columns with select()

Use dplyr function select() to keep only the variables you need.

# select 4 variables
d_small <- select(d, hospid, docid, age, cancerstage)
head(d_small, n=3)
## # A tibble: 3 x 4
##   hospid docid   age cancerstage
##    <dbl> <chr> <dbl> <chr>      
## 1      1 1-1    65.0 II         
## 2      1 1-1    53.9 II         
## 3      1 1-1    41.4 I

Often we want to select a group of related variables with similar names. dplyr supplies helper functions to find columns whose names match a specified pattern:

  • starts_with(x): matches names that begin with the string x
  • ends_with(x): matches names that end with the string x
  • contains(x): matches names that contain the string x
  • matches(re): matches regular expression re
  • num_range(prefix, range) matches names that contain prefix and one element of range
# get hospital variables
dhosp <- select(d, starts_with("hosp"))
names(dhosp)
## [1] "hospital" "hospid"

# get "test1" and "test2" variables using num_range
tests <- select(d, num_range("test", 1:2))
names(tests)
## [1] "test1" "test2"

Use select() to reorder columns

You can specify a particular ordering of columns with select(). To place the remaining columns in their current ordering, use everything(). You can also rename variables with select.

# put the two test vars first while renaming them, 
#   then everything else 
tests_first <- select(d, t1=test1, t2=test2, everything())
names(tests_first)
##  [1] "t1"           "t2"           "hospital"     "hospid"       "docid"       
##  [6] "dis_date"     "sex"          "age"          "pain"         "tumorsize"   
## [11] "co2"          "wound"        "mobility"     "ntumors"      "remission"   
## [16] "lungcapacity" "married"      "familyhx"     "smokinghx"    "cancerstage" 
## [21] "lengthofstay" "wbc"          "rbc"          "bmi"

Sorting rows with arrange()

Sort the order of rows by variable values using arrange() from dplyr.

Be default, ascending order will be used. Surround a sorting variable with desc() to sort by descending order instead.

# sort, with males before females, then by age, youngest first
arrange(d, desc(sex), age) 
## # A tibble: 120 x 24
##    hospital hospid docid dis_date sex     age test1  test2  pain tumorsize   co2
##    <chr>     <dbl> <chr> <chr>    <chr> <dbl> <dbl>  <dbl> <dbl>     <dbl> <dbl>
##  1 UCSF          2 2-178 19-Apr-~ male   34.2  4.63  3.26      3      65.5  1.64
##  2 UCLA          1 1-58  9-Feb-10 male   35.3  2.14 12.4       6      60.5  1.56
##  3 UCSF          2 2-121 20-Apr-~ male   40.1 NA    NA         5      73.5  1.65
##  4 UCLA          1 1-1   4-Sep-10 male   41.4 NA     2.13      3      86.4  1.45
##  5 UCSF          2 2-121 22-Aug-~ male   41.5  2.88  6.80      5      60.2  1.50
##  6 UCSF          2 2-163 19-Apr-~ male   42.5  1.63  0.749     4      54.0  1.72
##  7 UCLA          1 1-11  19-Aug-~ male   43.7 10.5   5.89      7      61.3  1.49
##  8 UCLA          1 1-1   25-Jun-~ male   46.8  3.89  1.35      3      53.4  1.57
##  9 UCLA          1 1-21  22-Apr-~ male   48.0  3.39 NA         8      79.6  1.55
## 10 UCSF          2 2-178 12-Mar-~ male   48.9  4.15  3.04      7      71.8  1.47
## # ... with 110 more rows, and 13 more variables: wound <dbl>, mobility <dbl>,
## #   ntumors <dbl>, remission <dbl>, lungcapacity <dbl>, married <dbl>,
## #   familyhx <chr>, smokinghx <chr>, cancerstage <chr>, lengthofstay <dbl>,
## #   wbc <chr>, rbc <dbl>, bmi <dbl>

*Transforming variables into new variables with mutate()*

The function mutate() allows us to transform many variables in one step without having to respecify the data frame name over and over.

Useful R functions for transforming:

  • log(): logarithm
  • min_rank(): rank values
  • cut(): cut a continuous variable into intervals with new integer value signifying into which interval original value falls
  • scale(): standardizes variable (substracts mean and divides by standard deviation)
  • lag(), lead(): lag and lead a variable (from dplyr)
  • cumsum(): cumulative sum
  • rowMeans(), rowSums(): means and sums of several columns
  • recode(): recode values (from dplyr)
# create age category variable, and highpain binary variable
d <- mutate(d,
            agecat = cut(age, breaks=c(30,40,50,60,70,120)),
            highpain = pain > mean(pain))
table(d$agecat, d$highpain)
##           
##            FALSE TRUE
##   (30,40]      2    2
##   (40,50]     27   15
##   (50,60]     37   32
##   (60,70]      2    2
##   (70,120]     0    0

Grouped data operations

For many datasets we want to calculate statistics by group, rather than over the whole dataset.

The dplyr functions group_by() and summarise() together greatly simplify group operations.

First, we create a grouped data frame (tibble) with group_by(). We will group our patient dataset by doctor id.

# group_by creates a grouped_df (grouped data frame) class structure
by_doc <- group_by(d, docid)
class(by_doc)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Next we use summarise() and statistical or numerical functions to summarize variables by group.

Example of useful functions for summarise():

  • min(), max(), mean(), sum(), var(), sd()
  • n(): number of observations in the group
  • n_distinct(x): number of distinct values in variable x
# Create summaries of patients by doctor
pat_summ <- summarise(by_doc,
                      n_pat = n(), # number of patients
                      longest_los = max(lengthofstay), # longest length of stay
                      avg_age = mean(age), # average age
                      n_stage_four = sum(cancerstage=="IV") # number of stage IV patients 
                      )
pat_summ
## # A tibble: 22 x 5
##    docid n_pat longest_los avg_age n_stage_four
##    <chr> <int>       <dbl>   <dbl>        <int>
##  1 1-1       7           6    52.5            0
##  2 1-100     9           7    NA              0
##  3 1-11      8           7    51.7            1
##  4 1-21      1           4    48.0            0
##  5 1-22      4           6    48.8            0
##  6 1-33      6           7    49.0            0
##  7 1-48      7           7    51.3            1
##  8 1-57      3           5    45.6            0
##  9 1-58      3           7    51.1            0
## 10 1-72      8           8    50.3            0
## # ... with 12 more rows

A quick aside about package conflicts and masking

The dplyr package provides summarize() for those who prefer an American spelling to summarise(). However, the package Hmisc also contains a summarize() function, and if loaded into the same session as dplyr(), a conflict arises between the two summarize() functions. The function from the package that was loaded first will be masked when the second package is loaded into the session (with an accompanying red warning message). The function loaded second will be the default function used when summarize() is specified.

So, if Hmisc is loaded first and dplyr loaded second, specifying summarize(x) will call summarize() from dplyr.

One can still access the function loaded from the first package in a couple of ways.

  • use the syntax package::function(), for example Hmisc::summarize(x)
  • remove a package from the session using detach(package:name), for example detach(package:dplyr)

Piping

The pipe operator %>%

A data management task may involve many steps to reach the final desired dataset. Often, during intermediate steps, datasets are generated that we don’t care about or plan on keeping. For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand.

Naming datasets takes time to think about and clutters code. Piping makes your code more readable by focusing on the functions used rather than the name of datasets.

The pipe operator is %>%, , which can be typed with CTRL-SHIFT-M (CMD-SHIFT-M on a Mac). The pipe operator can be expressed as “then” when reading code. The operator is loaded with pacakage magrittr (automatically loaded with library(tidyverse)).

The pipe operator “pipes” the dataset on the left of the %>% operator to the function on the right of the operator.

The code x %>% f(y) translates to f(x,y), that is, x is treated by default as the first argument of f(). If the function returns a data frame, we can then pipe this data frame into another function. Thus x %>% f(y) %>% g(z) translates to g(f(x,y), z).

Examples of using the pipe operator

As a first example, perhaps we want to create a dataset of just females under 40, with only the age and pain variables selected. We could do this in 2 steps, like so:

# a dataset of age and pain for females younger than 40
f40 <- filter(d, sex=="female" & age<40)
f40_small <- select(f40, age, pain)
f40_small
## # A tibble: 2 x 2
##     age  pain
##   <dbl> <dbl>
## 1  37.3     8
## 2  39.6     5

While that works fine, the intermediate dataset f40 is not of interest and is cluttering up memory and the workspace unnecessarily.

We could use %>% instead:

# start with d, then filter rows, then select variables
f40_small <- d %>%   
  filter(sex=="female" & age<40) %>% 
  select(age, pain)
f40_small
## # A tibble: 2 x 2
##     age  pain
##   <dbl> <dbl>
## 1  37.3     8
## 2  39.6     5

No intermediate dataset was created, and dataset names are rarely needed in intermediate steps. The dataset transformations, filtering and selecting, are easy to see.

Another pipe example

Piping really helps make a long chain of commands more immediately understandable.

# create a plot of average age vs tumor size 
#   by doctors, for doctors with more than 5 patients
g1 <- d %>%     
  group_by(docid) %>%  
  summarise(n_pat=n(),   
            avg_age=mean(age), 
            avg_tumor=mean(tumorsize)) %>%   
  filter(n_pat > 5) %>%  
  ggplot(aes(x=avg_age, y=avg_tumor)) + geom_point() +
  geom_smooth() 
g1

Using . in piped commands

If the dataset is not to be specified as the first argument of the function on the right of the %>% operator, use . to specify where the dataset should appear as the argument.

x %>% f(y,.) translates to f(y,x).

For example, for modeling functions such as lm(), the model formula is typically the first argument rather than the dataset, but we can use . to stand in for the dataset in the piping chain:

d %>% 
  filter(age < 40) %>% 
  lm(tumorsize ~ age, data=.) # the . is the filtered dataset
## 
## Call:
## lm(formula = tumorsize ~ age, data = .)
## 
## Coefficients:
## (Intercept)          age  
##     -78.219        4.116

Appending and merging

Appending data frames

Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers. When files share the same variables (hopefully!), we can append the datasets, or bind their rows together.

Two choices for appending, rbind() and bind_rows()

We can use either the base R function rbind() or the dplyr function bind_rows() to append. They differ in how they handle the situation where the datasets being appended have non-matching columns.

  • rbind() will produce and error
  • bind_rows() will append the datasets, and fill in with NA values for missing values in unmatched columns

They also differ in how they handle the situation where the same column in the two datasets has two different types (e.g. one character the other numeric).

  • rbind() will coerce one type to the other (direction of coercion is logical -> integer -> double -> character)
  • bind_rows() will produce an error

Example of appending data sets with non-matching columns

Let’s load a second dataset that contains the same variables as the original version of our first dataset but with 2 variables omitted.

# new data set that contains the same variables as d, except is missing 2 of them
d2 <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/patient_pt2_dm.csv")
# rows and columns of d2 and d
dim(d2)
## [1] 111  24
dim(d)
## [1] 120  26

Because our first data set d has a few extra variables we created earlier with mutate(), the two datasets do not have completely matching columns.

  • Using rbind() on the datasets as they are will produce an error.
  • Using bind_rows() will produce NA values for the observation on the second datasets on the variables that are exclusively in the first dataset.

The best solution would be to calculate the new variables for the second dataset and then use either function to append…

However, to demonstrate the behavior of bind_rows(), we append the datasets as they are.

Use the .id= argument in bind_rows to create a variable that identifies source dataset.

# a new variable called source is added to the beginning of the dataset
d3 <- bind_rows(d, d2, .id="source")

# these are the rows where the datasets meet
#  hospital is found in both datasets, agecat and highpain are not
select(d3, source, hospital, agecat, highpain)[118:123,]
## # A tibble: 6 x 4
##   source hospital     agecat  highpain
##   <chr>  <chr>        <fct>   <lgl>   
## 1 1      UCSF         (50,60] TRUE    
## 2 1      UCSF         (50,60] TRUE    
## 3 1      UCSF         (50,60] TRUE    
## 4 2      Cedars-Sinai <NA>    NA      
## 5 2      Cedars-Sinai <NA>    NA      
## 6 2      Cedars-Sinai <NA>    NA

# this will work because we restrict d to only variables common to both
drbind <- rbind(d[,1:24], d2)

Merging datasets with dplyr joins

Appending adds more rows of observations, whereas merging adds more columns of variables. Datasets to be merged should be matched on some id variable(s).

The dplyr “join” functions perform such merges and will use any same-named variables between the datasets as the id variables by default. Use the by= argument to specify specific matching id variables.

These joins all return a table with all columns from x and y, but differ in how they deal with mismatched rows:

  • inner_join(x, y): returns all rows from x where there is a matching value in y (returns only matching rows).

  • left_join(x, y): returns all rows from x, unmatched rows in x will have NA in the columns from y. Unmatched rows in y not returned.

  • full_join(x, y): returns all rows from x and from y; unmatched rows in either will have NA in new columns

Each of these joins is possible with base R function merge(), but merge() is many times slower than the joins.

Example of merging with inner_join()

We begin our demonstration of merging by loading a dataset of doctor variables, that describes the doctor’s years of experience, schooling, number of lawsuits, etc.

d_doc <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/doctor_dm.csv")

It is easiest to show how the joins differ if we work on smaller, more manageable datasets.

We will be merging a subset of the d_doc dataset we just imported into a subset the d3 dataset we created earlier (the two patient datasets appended together). We will select two doctors from d3, doctors “1-21” and “2-178”, and one matching doctor and one non-matching doctor from d_doc, “2-178” and “3-407”.

# select one non-matching and one matching doctor from each to demo joins
#   just a few variables from d3
d3.1 <- select(filter(d3, docid == "1-21" | docid == "2-178"),
               docid, sex, age, test1, test2)
d_doc.1 <- filter(d_doc, docid == "3-407" | docid == "2-178")

Let’s take a look at our small datasets

# first a look at the two datasets to be merged
#  2 rows for docid 2-178 in d3.1
d3.1
## # A tibble: 3 x 5
##   docid sex     age test1 test2
##   <chr> <chr> <dbl> <dbl> <dbl>
## 1 1-21  male   48.0  3.39 NA   
## 2 2-178 male   34.2  4.63  3.26
## 3 2-178 male   48.9  4.15  3.04
d_doc.1
## # A tibble: 2 x 5
##   docid experience school  lawsuits medicaid
##   <chr>      <dbl> <chr>      <dbl>    <dbl>
## 1 2-178         15 average        4    0.817
## 2 3-407         23 average        3    0.343

The merging id variable “docid” appears in both datasets, so we do not need to specify a merging variables.

Notice how inner_join(), left_join() and full_join() differ in their behavior:

# only matching rows returned
#   2-178 from d_doc.1 matched twice to 2-178 in d3.1
inner_join(d3.1, d_doc.1)
## # A tibble: 2 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 2 2-178 male   48.9  4.15  3.04         15 average        4    0.817
# all rows from d3.1 returned
left_join(d3.1, d_doc.1)
## # A tibble: 3 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 1-21  male   48.0  3.39 NA            NA <NA>          NA   NA    
## 2 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 3 2-178 male   48.9  4.15  3.04         15 average        4    0.817
# all rows from both returned
full_join(d3.1, d_doc.1)
## # A tibble: 4 x 9
##   docid sex     age test1 test2 experience school  lawsuits medicaid
##   <chr> <chr> <dbl> <dbl> <dbl>      <dbl> <chr>      <dbl>    <dbl>
## 1 1-21  male   48.0  3.39 NA            NA <NA>          NA   NA    
## 2 2-178 male   34.2  4.63  3.26         15 average        4    0.817
## 3 2-178 male   48.9  4.15  3.04         15 average        4    0.817
## 4 3-407 <NA>   NA   NA    NA            23 average        3    0.343

Tidy data

Tidy data philosophy

We now explore the tools in the tidyverse package tidyr (loaded with library(tidyverse)), which make datasets “tidy”.

Tidy datasets adhere to a simple set of structural rules:

  • A dataset is a set of values organized into variables (columns) and observations (rows)
  • A variable should measure the same attribute across observations
  • An observation should represent the same unit measured across variables.

Tidy data conventions provide a standard organization that prevents having to “rethink the wheel” each time data need to be organized.

Doing a little more work upfront to make data “tidy” will pay off with time saved later on.

Unit of Analysis

Each row of the dataset should represent one unit of analysis. Units may be subjects, or trials within subjects, or whole groups of subjects.

Some analyses may involve both variables describing individual units and aggregated units. For example, a student-level analysis may involve school-level variables. Longitudinal studies may involve within-subject and between-subject variables. For multilevel analyses, the unit of analysis is generally the lowest level.

# data at multiple levels - patients nested in doctors
d %>% 
  inner_join(pat_summ) %>% 
  select(sex, age, lengthofstay, docid, avg_age, longest_los) %>% 
  head(n=3)
## # A tibble: 3 x 6
##   sex      age lengthofstay docid avg_age longest_los
##   <chr>  <dbl>        <dbl> <chr>   <dbl>       <dbl>
## 1 male    65.0            6 1-1      52.5           6
## 2 female  53.9            6 1-1      52.5           6
## 3 male    41.4            5 1-1      52.5           6

Untidy data

The use of data not collected by the analyst makes encountering untidy data more common than desired.

Data can be messy in many ways, but we focus on two ways that can be tidied with tidyr tools:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.

Let’s take a look at some untidy datsets and see how the tidyr functions pivot_longer() and pivot_wider() will help us to tidy them.

Column headers are values, not variable names.

We first load a dataset of the number of graduates from 3 academic departments across 3 years. The planned analysis will examine whether graduation has grown over the years.

The unit of analysis here is the graduation rate in a particular department in a particular year. Each row should thus represents one year in one department.

dept <- read_csv("https://stats.idre.ucla.edu/stat/data/rdm/dept1.csv")
dept
## # A tibble: 3 x 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

Year is a predictor in our analysis, so should be a column variable. Each department has 3 years of data so each requires 3 rows of data. Additionally, the entire table of values measures the same thing, graduation numbers in one department in one year, so we can put all values in one column.

Use pivot_longer() to create a variable out of column headings and restructure the dataset

To use pivot_longer(), we select a set of column variable for reshaping:

  • the column headings are stacked repeatedly into a column variable
  • the values in the column variables are stacked into a single column variable

This process is also known as “reshaping long”.

Arguments to pivot_longer():

  • the column variables
  • names_to=: name of the new column variable that will hold the stacked selected column headings
  • values_to=: the name of the new column variable that will hold the stacked values of the selected columns

By default, pivot_longer will select all columns for stacking. However, columns that do not vary with values in the new names_to= column should not be stacked.

Example of using pivot_longer()

Here again is the data set we want to reshape:

dept
## # A tibble: 3 x 4
##   id      `2015` `2016` `2017`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 biology    207    211    259
## 2 math        96     75     99
## 3 physics    112    126    125

In the newly reshaped dataset, we want to put years and graudation rates into 2 columns. Since years are the column headings, we will name the names_to= variable “year”. The graudation rate values in the columns 2015, 2016, and 2017 will be stacked into another new variable whose name we specify in values_to= as “grad”.

Department name (id) does not vary with year, so we do not need to stack it. We can specify which columns are to be reshaped by specifying only the columns that need to be collapsed as c(`2015`, `2016`, `2017`)* or by specifying the columns not to be collapsed after the symbol -, as in -id.

*Note that the ` are needed to specify the column names here because they don’t start with a letter.

# the new column "year" uses the column headings as values,
#  the new column "graduates" will be the collapsed values
#  we do not want to collapse id
dept_by_year <- dept %>% 
  pivot_longer(names_to="year", values_to="grad", -id)

pivot_longer(names_to="year", values_to="grad", -id)