IDRE Statistical Consulting Group
R as a programming environmentR is a programming environment for data analysis and graphics.
Some features:
S)These tools are distributed as packages, which any user can download to customize the R environment.
Use the function install.packages() to download and install packages on to your computer. The package name should be surrounded by quotes.
Once installed, you then use library() to load the package into the current R session.
For this seminar, we will not be using any external pacakages.
# example code to install a package (and any dependencies)
install.packages("tidyverse", dependencies=TRUE)
# load the package into the current R session
library(tidyverse)You can work directly in R, but most users prefer a graphical interface. We highly recommend using RStudio, an integrated development environment (IDE) that features:
R code can be entered into the Console or the script Editor, which can then be saved for later use.
You can run a command directly from a script by placing the cursor on the same line as the command or highlighting the command and hitting Ctrl-Enter (Command-Enter on Macs). This will advance the cursor to the next command, where you can hit Ctrl-Enter again to run it, advancing the cursor to the next command…
Commands are separated either by a ; or by a newline.
R is case sensitive.
The # character at the beginning of a line signifies a comment, which is not executed.
Commands can extend beyond one line of text. Put operators like + at the end of lines for multi-line commands.
Try adding 2 and 3 together on 2 separate lines in the Console:
Type 2 +, hit Enter, type 3 and hit Enter
R stores both data and output from data analysis (as well as everything else) in objects.
Objects can be vectors of numbers or strings, or matrices that form datasets, or functions, among many other types.
Data are assigned to and stored in objects using the <- or = operator.
To assign a string value to an object, use quotes.
To print the contents of an object, specify the object’s name alone.
A. Create an object named
Ain the Console and assign it3
B. Create an object namedBin the script editor and assign it the wordhello
C. View the contents ofAandB
Functions perform most of the work on data in R.
Functions accept inputs, called arguments, and return some object as output.
Use the
log()function to calculate the log of 10
Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?seq).
In the help file, under Usage is a list of arguments, which are explained in detail in the Arguments section. Values for arguments to functions can be specified either by name or position, so note the order.
Examine the help file for the function
seq()
Vectors, the fundamental data structure in R, are one-dimensional and homogeneous.
Vectors contain either logical (TRUE/FALSE), numeric, or character (string) values.
The c() function combines values of common type together to form a vector.
You can also create vectors of sequential values with seq() or repeating values with rep().
# create a vector with c()
x <- c(1, 3, 5)
x
## [1] 1 3 5
# seq(from=, to=, by=) to create a sequence
seq(from=1,to=7,by=2)
## [1] 1 3 5 7
# you can omit the argument names
seq(10,0,-5)
## [1] 10 5 0
# rep(x=, times=) for repeating values
rep("yes", 2)
## [1] "yes" "yes"A. Create a vector which contains the sequence
(1,2,3)repeated twice. Try thinking of a second or third way to accomplish this.
Datasets for statistical analysis are typically stored in data frames in R.
Data frames are rectangular, where the columns are variables and the rows are observations of those variables.
Each column of a data frame is a vector, and the columns can be different types.
We can create data frames manually with data.frame(), but typically we will load them from files.
Fig 2.5. Data frame with 4 observations of 5 variables
We recommend storing datasets as comma-separated values (CSV) files, which are text files where the fields are delimited by commas. Use read.csv(file) to load the data stored in CSV file (whose name should be specified inside quotes).
Data files should be set up in this way:
_ and . are fine also)Use
read.csv()to read in theschool.csvdataset and assign to an object namedschool. The only argument that is required is the data file’s name.
If the data file is not in the current working directory, you can specify the file’s path as well.
Note: There are many more arguments to read.csv() to control how a CSV file is read into R.
Note: You can check what the current working directory is with getwd() and set it with setwd().
Other text formats that use different delimiters (e.g. tab, space) are easy to load as well with read.delim(). Use the sep= argument to specify the delimiter.
For datasets stored in other statistical software formats (e.g. SPSS, Stata, SAS), see the haven package.
Use View(x) to open a spreadsheet-style view of data frame x. In RStudio, clicking on a data frame in the Environment pane will View() it.
View the
schooldatatset.
You can also view the first few rows of a matrix with head() and the last few with tail(). Each function has an n= argument to control the number of rows displayed.
Display the first 10 rows of
school.
Finally, summary(x) will summarize the distribution of each column of a data frame x. Numeric variables are summarized by quantiles, mean and median, while character variables are described by frequency tables.
Use
summary()onschoolto learn about its variables.
The syntax dataframe[r,c] will extract the rows numbered r and the columns numbered c.
r and c can be single numbers, vectors of numbers, or names of columns or rows.r specifies all rows and omitting c specifies all columnsExtract the value in the first row, second column of
school.
Extract all columns from the first 2 rows of
school.
The syntax dataframe$column will extract the variable named column in the data frame.
Extract the variable
mathfrom theschooldataset.
Data frame columns are vectors themselves, and can be subset with [].
Extract the second and third elements of the
readvariable in theschooldataset.
We can add variables (columns) to a data frame directly by assigning something to a new column, for example dataframe$newcol <- x.
New variables are often transformations of existing variables. Here are some useful transformation functions:
log(x): logarithm of xcut(x, breaks):cut x into intervals at cut points specified in vector breaks, returning integer codes signifying into which interval each value fallsscale(x): standardizes x (substracts mean and divides by standard deviation)rowMeans(x), rowSums(x): means and sums of columns of object x, which should have multiple columnsUse the
rowMeans()function to add a new variable calledavgto theschooldataset that is the average ofread,write, andmath. The input should be a subset ofschoolcontaining all rows for these 3 columns.
Data sets are often split into multiple parts, because, for example, the observations are collected over time, or by multiple investigators.
Appending a data frame to another will add more rows of observations. The two data frames are supplied as arguments should have the same variables with the same names.
We use the rbind() function to append data frames. Note: rbind() will produce an error if the two data frames have non-matching columns.
Load the data set stored in CSV file
second_school.csvinto R and name the objectschool2.
Use
rbind()to append theschool2data frame to theschooldata frame and name the resulting objectschool_both.
Merging another data set into our current data set adds more columns (variables) rather than rows. Merging typically depends on an identification variable present in both data sets that is used to match corresponding rows in the two datasets.
The first two arguments to merge() are the names of the two data frames to be merged. The merge() function will automatically use any variables common to both datasets as matching variables. If this is not desired, the matching variable(s) can be explicitly specified with the by= argument in merge().
Load the data in
demographics.csvinto a data frame nameddemo.
Merge the
school_bothanddemodata frames usingidas the matching variable, and name the resulting objectschool_final.
Often we need to combine the contents of separate columns into a single column. For example, we might have measured a single outcome at several instances over time, each represented by a separate column in a data frame. If we would like to analyze all instances a single outcome (for instance to get the overall mean), we would need to combine the separate columns into one.
This process is often known as reshaping data from wide format to long format.
The R function reshape() reshapes data frames from wide to long (and vice-versa). The relevant arguments to reshape() for reshaping wide to long are:
data: the data frame to be reshapedvarying: a vector of column names to be combined into a single columnv.names: name of new single variable that contains the columns specified in varyingtimevar: name of new variable that identifies originating column of combined valuestimes: values to use to identify originating column in the new timevar variable.direction: set to “long” to reshape from wide to longReshape the
school_finaldata frame into data frameschool_longthat combines the variablesmath,read, andwrite, into a single, new column calledscore(see argumentsvaryingandv.names). Specify that a new column namedtestwill hold values identifying from which columns the values originate (see argumentstimevarandtimes).
Statistical functions for continuous, numeric variables:
mean(x), median(x): mean, median of xvar(x), sd(x): variance and standard deviation of xMany of the functions will output NA if any of the input values are NA (use na.rm=TRUE to remove NA values first)
Create a vector named
math.stats, where the first element is the mean ofmathin data setschool_final, and the second is the standard deviation
Frequency functions for categorical variables:
table(x): frequency table for x (counts of categories)proportions table(table): proportions of table, a table created by table()Create a frequency table of the categories of the variable
ses.
Get the proportions for these categories.
An two-samples, independent t-test assesses whether the mean of a continuous variable is different between 2 groups.
The syntax t.test(y ~ x): will perform a test of the difference in means of variable y between 2 groups specified in variable x.
Determine whether there is evidence that the mean of
mathdiffers between the genders (variablefemale) in theschool_finaldata set.
A chi-square test of indepedence is used to test for association between two categorical variables. It tests whether the proportions of membership to categories of one variable are related to the proportion of membership to categories of another variable. Use chisq.test(x, y) to test for association between categorical variables x and y.
Determine whether membership to socioeconomic status categories (variable
ses) are equally distributed across genders in theschool_finaldata set.
Linear regression relates one or more independent variables (IV) to a dependent variable (DV) through an estimated set of regression coefficients.
Regression models in R typically use a formula specification relating the IV to the DV, for example y ~ a + b, where y is the dependent variable and a and b are the independent variables.
Use the syntax lm(formula, data) to perform a linear regression on the equation specified in formula, whose variables appear in dataset data.
Typicall, the result of lm() are saved in an object (often called a model object), and then various statistics can be extracted from this object through functions, such as:
summary(m): regression table (coefficients, standard errors, p-values, omnibus F-test) for model object m.coef(m): just coefficients for model object m.predict(m): predicted values based on modelresiduals(m): estimated residualsPerform a linear regression of
mathonwriteandfemale, variables in data setschool_finaland save the results to an object namedm1.
Use
summary()onm1to view the results of the regression.
Plots and graphics are one of the greatest strengths of R.
Common plot types available with base R:
plot(x, y) generates a scatter plot of x versus y.histogram(x): histogram of xboxplot(formula): produces a boxplot of y for each level of x specified in formula y ~ xbarplot(height): produces a bar plot of the numbers specified in vector heights, such as one created by table()Create a scatter plot of
readvswrite
Create a bar plot of the frequencies of
sesusing the output oftable()as the input tobarplot().