Introduction to R

IDRE Statistical Consulting Group

Background

R as a programming environment

R is a programming environment for data analysis and graphics.

Some features:

These tools are distributed as packages, which any user can download to customize the R environment.

Use the function install.packages() to download and install packages on to your computer. The package name should be surrounded by quotes.

Once installed, you then use library() to load the package into the current R session.

For this seminar, we will not be using any external pacakages.

# example code to install a package (and any dependencies)
install.packages("tidyverse", dependencies=TRUE)
# load the package into the current R session
library(tidyverse)

RStudio

You can work directly in R, but most users prefer a graphical interface. We highly recommend using RStudio, an integrated development environment (IDE) that features:

R programming 1: Coding

R code can be entered into the Console or the script Editor, which can then be saved for later use.

You can run a command directly from a script by placing the cursor on the same line as the command or highlighting the command and hitting Ctrl-Enter (Command-Enter on Macs). This will advance the cursor to the next command, where you can hit Ctrl-Enter again to run it, advancing the cursor to the next command…

Commands are separated either by a ; or by a newline.

R is case sensitive.

The # character at the beginning of a line signifies a comment, which is not executed.

Commands can extend beyond one line of text. Put operators like + at the end of lines for multi-line commands.

Try adding 2 and 3 together on 2 separate lines in the Console:
Type 2 +, hit Enter, type 3 and hit Enter

R programming 2: Objects

R stores both data and output from data analysis (as well as everything else) in objects.

Objects can be vectors of numbers or strings, or matrices that form datasets, or functions, among many other types.

Data are assigned to and stored in objects using the <- or = operator.

To assign a string value to an object, use quotes.

To print the contents of an object, specify the object’s name alone.

A. Create an object named A in the Console and assign it 3
B. Create an object named B in the script editor and assign it the word hello
C. View the contents of A and B

R programming 3: Functions

Functions perform most of the work on data in R.

Functions accept inputs, called arguments, and return some object as output.

Use the log() function to calculate the log of 10

Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?seq).

In the help file, under Usage is a list of arguments, which are explained in detail in the Arguments section. Values for arguments to functions can be specified either by name or position, so note the order.

Examine the help file for the function seq()

Getting data into R

Vectors

Vectors, the fundamental data structure in R, are one-dimensional and homogeneous.

Vectors contain either logical (TRUE/FALSE), numeric, or character (string) values.

The c() function combines values of common type together to form a vector.

You can also create vectors of sequential values with seq() or repeating values with rep().

# create a vector with c()
x <- c(1, 3, 5)
x
## [1] 1 3 5

# seq(from=, to=, by=) to create a sequence
seq(from=1,to=7,by=2)
## [1] 1 3 5 7

# you can omit the argument names
seq(10,0,-5)
## [1] 10  5  0

# rep(x=, times=) for repeating values
rep("yes", 2)
## [1] "yes" "yes"

A. Create a vector which contains the sequence (1,2,3) repeated twice. Try thinking of a second or third way to accomplish this.

Data frames

Datasets for statistical analysis are typically stored in data frames in R.

Data frames are rectangular, where the columns are variables and the rows are observations of those variables.

Each column of a data frame is a vector, and the columns can be different types.

We can create data frames manually with data.frame(), but typically we will load them from files.

Fig 2.5. Data frame with 4 observations of 5 variables

Fig 2.5. Data frame with 4 observations of 5 variables

Reading in text data

We recommend storing datasets as comma-separated values (CSV) files, which are text files where the fields are delimited by commas. Use read.csv(file) to load the data stored in CSV file (whose name should be specified inside quotes).

Data files should be set up in this way:

Use read.csv() to read in the school.csv dataset and assign to an object named school. The only argument that is required is the data file’s name.
If the data file is not in the current working directory, you can specify the file’s path as well.

Note: There are many more arguments to read.csv() to control how a CSV file is read into R.
Note: You can check what the current working directory is with getwd() and set it with setwd().

Other data set file formats

Other text formats that use different delimiters (e.g. tab, space) are easy to load as well with read.delim(). Use the sep= argument to specify the delimiter.

For datasets stored in other statistical software formats (e.g. SPSS, Stata, SAS), see the haven package.

Quick looks at the dataset

Use View(x) to open a spreadsheet-style view of data frame x. In RStudio, clicking on a data frame in the Environment pane will View() it.

View the school datatset.

You can also view the first few rows of a matrix with head() and the last few with tail(). Each function has an n= argument to control the number of rows displayed.

Display the first 10 rows of school.

Finally, summary(x) will summarize the distribution of each column of a data frame x. Numeric variables are summarized by quantiles, mean and median, while character variables are described by frequency tables.

Use summary() on school to learn about its variables.

Data Management

Subsetting rows and columns from a data frame

The syntax dataframe[r,c] will extract the rows numbered r and the columns numbered c.

Extract the value in the first row, second column of school.

Extract all columns from the first 2 rows of school.

The syntax dataframe$column will extract the variable named column in the data frame.

Extract the variable math from the school dataset.

Data frame columns are vectors themselves, and can be subset with [].

Extract the second and third elements of the read variable in the school dataset.

Adding new variables

We can add variables (columns) to a data frame directly by assigning something to a new column, for example dataframe$newcol <- x.

New variables are often transformations of existing variables. Here are some useful transformation functions:

Use the rowMeans() function to add a new variable called avg to the school dataset that is the average of read, write, and math. The input should be a subset of school containing all rows for these 3 columns.

Appending datasets

Data sets are often split into multiple parts, because, for example, the observations are collected over time, or by multiple investigators.

Appending a data frame to another will add more rows of observations. The two data frames are supplied as arguments should have the same variables with the same names.

We use the rbind() function to append data frames. Note: rbind() will produce an error if the two data frames have non-matching columns.

Load the data set stored in CSV file second_school.csv into R and name the object school2.

Use rbind() to append the school2 data frame to the school data frame and name the resulting object school_both.

Merging datasets

Merging another data set into our current data set adds more columns (variables) rather than rows. Merging typically depends on an identification variable present in both data sets that is used to match corresponding rows in the two datasets.

The first two arguments to merge() are the names of the two data frames to be merged. The merge() function will automatically use any variables common to both datasets as matching variables. If this is not desired, the matching variable(s) can be explicitly specified with the by= argument in merge().

Load the data in demographics.csv into a data frame named demo.

Merge the school_both and demo data frames using id as the matching variable, and name the resulting object school_final.

Reshaping data

Often we need to combine the contents of separate columns into a single column. For example, we might have measured a single outcome at several instances over time, each represented by a separate column in a data frame. If we would like to analyze all instances a single outcome (for instance to get the overall mean), we would need to combine the separate columns into one.

This process is often known as reshaping data from wide format to long format.

The R function reshape() reshapes data frames from wide to long (and vice-versa). The relevant arguments to reshape() for reshaping wide to long are:

Reshape the school_final data frame into data frame school_long that combines the variables math, read, and write, into a single, new column called score (see arguments varying and v.names). Specify that a new column named test will hold values identifying from which columns the values originate (see arguments timevar and times).

Basic statistical analysis

Descriptives

Statistical functions for continuous, numeric variables:

Many of the functions will output NA if any of the input values are NA (use na.rm=TRUE to remove NA values first)

Create a vector named math.stats, where the first element is the mean of math in data set school_final, and the second is the standard deviation

Frequency functions for categorical variables:

Create a frequency table of the categories of the variable ses.

Get the proportions for these categories.

Basic statistical tests

An two-samples, independent t-test assesses whether the mean of a continuous variable is different between 2 groups.

The syntax t.test(y ~ x): will perform a test of the difference in means of variable y between 2 groups specified in variable x.

Determine whether there is evidence that the mean of math differs between the genders (variable female) in the school_final data set.

A chi-square test of indepedence is used to test for association between two categorical variables. It tests whether the proportions of membership to categories of one variable are related to the proportion of membership to categories of another variable. Use chisq.test(x, y) to test for association between categorical variables x and y.

Determine whether membership to socioeconomic status categories (variable ses) are equally distributed across genders in the school_final data set.

Linear regression (optional)

Linear regression relates one or more independent variables (IV) to a dependent variable (DV) through an estimated set of regression coefficients.

Regression models in R typically use a formula specification relating the IV to the DV, for example y ~ a + b, where y is the dependent variable and a and b are the independent variables.

Use the syntax lm(formula, data) to perform a linear regression on the equation specified in formula, whose variables appear in dataset data.

Typicall, the result of lm() are saved in an object (often called a model object), and then various statistics can be extracted from this object through functions, such as:

Perform a linear regression of math on write and female, variables in data set school_final and save the results to an object named m1.

Use summary() on m1 to view the results of the regression.

Plots

Plots and graphics are one of the greatest strengths of R.

Common plot types available with base R:

Create a scatter plot of read vs write

Create a bar plot of the frequencies of ses using the output of table() as the input to barplot().

Getting help online

End