How can I read in multiple files?

Version info: Code for this page was tested in R Under development (unstable) (2012-07-05 r59734) On: 2012-08-08 With: knitr 0.6.3

You may at times wish to read a set of data files into R. The code below demonstrates how to do so looping through the names of the files to be read in. The datasets we read in here are from Stata, so we will be using the foreign package. All of the data sets are stored in one list, each element of which is a data frame class object of one of the data sets. We use the file.path() function to combine the path to each file with the actual file names.

library(foreign)

## create and view an object with file names and full paths
(f <- file.path("https://stats.idre.ucla.edu/stat/data", c("auto.dta",
    "cancer.dta", "efa_cfa.dta", "hsbmar.dta")))

## [1] "https://stats.idre.ucla.edu/stat/data/auto.dta"   
## [2] "https://stats.idre.ucla.edu/stat/data/cancer.dta" 
## [3] "https://stats.idre.ucla.edu/stat/data/efa_cfa.dta"
## [4] "https://stats.idre.ucla.edu/stat/data/hsbmar.dta"

Now we would like to use a loop to read all of these files from the internet and store them in one list, d. Note that this would work just as well if the files were on a local disk instead of the internet. R can easily read local or remote files. lapply loops through each file in f, passes it to the function specified (in this case read.dta) and returns all of the results as a list which is then assigned to d.

d <- lapply(f, read.dta)

## view the structure of d
str(d, give.attr = FALSE)

## List of 4
##  $ :'data.frame':	74 obs. of  12 variables:
##   ..$ make        : chr [1:74] "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
##   ..$ price       : int [1:74] 4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
##   ..$ mpg         : int [1:74] 22 17 22 20 15 18 26 20 16 19 ...
##   ..$ rep78       : int [1:74] 3 3 NA 3 4 3 NA 3 3 3 ...
##   ..$ headroom    : num [1:74] 2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
##   ..$ trunk       : int [1:74] 11 11 12 16 20 21 10 16 17 13 ...
##   ..$ weight      : int [1:74] 2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
##   ..$ length      : int [1:74] 186 173 168 196 222 218 170 200 207 200 ...
##   ..$ turn        : int [1:74] 40 40 35 40 43 43 34 42 43 42 ...
##   ..$ displacement: int [1:74] 121 258 121 196 350 231 304 196 231 231 ...
##   ..$ gear_ratio  : num [1:74] 3.58 2.53 3.08 2.93 2.41 ...
##   ..$ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
##  $ :'data.frame':	48 obs. of  7 variables:
##   ..$ id       : num [1:48] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ studytime: int [1:48] 1 1 2 3 4 4 5 5 8 8 ...
##   ..$ died     : int [1:48] 1 1 1 1 1 1 1 1 1 0 ...
##   ..$ drug     : num [1:48] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ age      : int [1:48] 61 65 59 52 56 67 63 58 56 58 ...
##   ..$ distime  : num [1:48] 1 1 1 1 1 1 1 1 2 2 ...
##   ..$ censor   : num [1:48] 0 0 0 0 0 0 0 0 0 1 ...
##  $ :'data.frame':	500 obs. of  6 variables:
##   ..$ y1: num [1:500] -0.535 0.522 1.581 -0.458 0.276 ...
##   ..$ y2: num [1:500] 0.0996 0.959 2.4845 0.8851 -0.8402 ...
##   ..$ y3: num [1:500] -0.887 2.619 3.085 0.25 -2.08 ...
##   ..$ y4: num [1:500] -1.245 2.668 0.908 -1.076 -0.441 ...
##   ..$ y5: num [1:500] -3.493 0.438 2.218 -1.365 -2.67 ...
##   ..$ y6: num [1:500] 1.4408 0.7939 -0.0673 -0.9105 -1.2395 ...
##  $ :'data.frame':	200 obs. of  13 variables:
##   ..$ id     : int [1:200] 52 155 33 78 17 189 70 101 36 199 ...
##   ..$ female : Factor w/ 2 levels "male","female": NA NA NA NA NA NA NA NA NA NA ...
##   ..$ ses    : Factor w/ 3 levels "low","middle",..: 1 2 1 2 2 2 1 3 1 3 ...
##   ..$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 2 1 1 1 2 ...
##   ..$ prog   : Factor w/ 3 levels "general","academic",..: 2 1 2 2 2 2 1 2 1 2 ...
##   ..$ read   : int [1:200] 50 44 57 39 47 47 57 60 44 NA ...
##   ..$ write  : int [1:200] 46 44 65 54 57 59 52 62 49 59 ...
##   ..$ math   : int [1:200] 53 46 72 54 48 63 41 67 44 50 ...
##   ..$ science: int [1:200] 53 39 54 53 44 53 47 50 35 61 ...
##   ..$ socst  : int [1:200] 66 51 56 41 41 46 57 56 51 61 ...
##   ..$ honors : Factor w/ 2 levels "not enrolled",..: 1 1 2 1 1 1 1 2 1 1 ...
##   ..$ awards : int [1:200] 0 0 5 1 2 2 1 3 0 2 ...
##   ..$ cid    : int [1:200] 9 3 18 9 8 13 8 16 3 13 ...

You can then operate on the list as a whole or specific elements. For example, you could return a list of al the variable names in each data set by applying the function names to each element of d using the function lapply.

lapply(d, names)

## [[1]]
##  [1] "make"         "price"        "mpg"          "rep78"       
##  [5] "headroom"     "trunk"        "weight"       "length"      
##  [9] "turn"         "displacement" "gear_ratio"   "foreign"     
## 
## [[2]]
## [1] "id"        "studytime" "died"      "drug"      "age"       "distime"  
## [7] "censor"   
## 
## [[3]]
## [1] "y1" "y2" "y3" "y4" "y5" "y6"
## 
## [[4]]
##  [1] "id"      "female"  "ses"     "schtyp"  "prog"    "read"    "write"  
##  [8] "math"    "science" "socst"   "honors"  "awards"  "cid"    
##

You can also work with just one element. For example, to view a summary of the first three variable of the first data set, first use [[ to extract the first element of the list, data set 1 (auto.dta) and then [ to extract the first 3 columns and call summary on those.

summary(d[[1]][, 1:3])

##      make               price            mpg      
##  Length:74          Min.   : 3291   Min.   :12.0  
##  Class :character   1st Qu.: 4220   1st Qu.:18.0  
##  Mode  :character   Median : 5006   Median :20.0  
##                     Mean   : 6165   Mean   :21.3  
##                     3rd Qu.: 6332   3rd Qu.:24.8  
##                     Max.   :15906   Max.   :41.0

To make working with the list easier, we might want to name the elements. That way, we know which dataset is in which element and can refer to them by name instead of just by position if we want. To do this, let’s use regular expressions to parse the full path names and extract the portion after the last “/” and before the “.”. Then we assign those as the names of d.

names(d) <- gsub(".*/(.*)\\..*", "\\1", f)

## view the structure of d again
str(d, give.attr = FALSE)

## List of 4
##  $ auto   :'data.frame':	74 obs. of  12 variables:
##   ..$ make        : chr [1:74] "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
##   ..$ price       : int [1:74] 4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
##   ..$ mpg         : int [1:74] 22 17 22 20 15 18 26 20 16 19 ...
##   ..$ rep78       : int [1:74] 3 3 NA 3 4 3 NA 3 3 3 ...
##   ..$ headroom    : num [1:74] 2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
##   ..$ trunk       : int [1:74] 11 11 12 16 20 21 10 16 17 13 ...
##   ..$ weight      : int [1:74] 2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
##   ..$ length      : int [1:74] 186 173 168 196 222 218 170 200 207 200 ...
##   ..$ turn        : int [1:74] 40 40 35 40 43 43 34 42 43 42 ...
##   ..$ displacement: int [1:74] 121 258 121 196 350 231 304 196 231 231 ...
##   ..$ gear_ratio  : num [1:74] 3.58 2.53 3.08 2.93 2.41 ...
##   ..$ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cancer :'data.frame':	48 obs. of  7 variables:
##   ..$ id       : num [1:48] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ studytime: int [1:48] 1 1 2 3 4 4 5 5 8 8 ...
##   ..$ died     : int [1:48] 1 1 1 1 1 1 1 1 1 0 ...
##   ..$ drug     : num [1:48] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ age      : int [1:48] 61 65 59 52 56 67 63 58 56 58 ...
##   ..$ distime  : num [1:48] 1 1 1 1 1 1 1 1 2 2 ...
##   ..$ censor   : num [1:48] 0 0 0 0 0 0 0 0 0 1 ...
##  $ efa_cfa:'data.frame':	500 obs. of  6 variables:
##   ..$ y1: num [1:500] -0.535 0.522 1.581 -0.458 0.276 ...
##   ..$ y2: num [1:500] 0.0996 0.959 2.4845 0.8851 -0.8402 ...
##   ..$ y3: num [1:500] -0.887 2.619 3.085 0.25 -2.08 ...
##   ..$ y4: num [1:500] -1.245 2.668 0.908 -1.076 -0.441 ...
##   ..$ y5: num [1:500] -3.493 0.438 2.218 -1.365 -2.67 ...
##   ..$ y6: num [1:500] 1.4408 0.7939 -0.0673 -0.9105 -1.2395 ...
##  $ hsbmar :'data.frame':	200 obs. of  13 variables:
##   ..$ id     : int [1:200] 52 155 33 78 17 189 70 101 36 199 ...
##   ..$ female : Factor w/ 2 levels "male","female": NA NA NA NA NA NA NA NA NA NA ...
##   ..$ ses    : Factor w/ 3 levels "low","middle",..: 1 2 1 2 2 2 1 3 1 3 ...
##   ..$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 2 1 1 1 2 ...
##   ..$ prog   : Factor w/ 3 levels "general","academic",..: 2 1 2 2 2 2 1 2 1 2 ...
##   ..$ read   : int [1:200] 50 44 57 39 47 47 57 60 44 NA ...
##   ..$ write  : int [1:200] 46 44 65 54 57 59 52 62 49 59 ...
##   ..$ math   : int [1:200] 53 46 72 54 48 63 41 67 44 50 ...
##   ..$ science: int [1:200] 53 39 54 53 44 53 47 50 35 61 ...
##   ..$ socst  : int [1:200] 66 51 56 41 41 46 57 56 51 61 ...
##   ..$ honors : Factor w/ 2 levels "not enrolled",..: 1 1 2 1 1 1 1 2 1 1 ...
##   ..$ awards : int [1:200] 0 0 5 1 2 2 1 3 0 2 ...
##   ..$ cid    : int [1:200] 9 3 18 9 8 13 8 16 3 13 ...

## reference list elements by name create a summary of the cancer dataset
## in d
summary(d$cancer)

##        id         studytime         died            drug      
##  Min.   : 1.0   Min.   : 1.0   Min.   :0.000   Min.   :0.000  
##  1st Qu.:12.8   1st Qu.: 7.8   1st Qu.:0.000   1st Qu.:0.000  
##  Median :24.5   Median :12.5   Median :1.000   Median :1.000  
##  Mean   :24.5   Mean   :15.5   Mean   :0.646   Mean   :0.583  
##  3rd Qu.:36.2   3rd Qu.:23.0   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :48.0   Max.   :39.0   Max.   :1.000   Max.   :1.000  
##       age          distime         censor     
##  Min.   :47.0   Min.   :1.00   Min.   :0.000  
##  1st Qu.:50.8   1st Qu.:2.00   1st Qu.:0.000  
##  Median :56.0   Median :2.50   Median :0.000  
##  Mean   :55.9   Mean   :2.98   Mean   :0.354  
##  3rd Qu.:60.0   3rd Qu.:4.00   3rd Qu.:1.000  
##  Max.   :67.0   Max.   :6.00   Max.   :1.000

This code easily expands to other types of files. CSV text files could be read using read.csv, general text files with read.table. If you wanted to read all of the files in a particular directory, it can be done by first getting a list of all the file names using list.dirs(), then simply reading them in as before.