UCLA OARC Statistical Methods and Data Analytics

This seminar introduces the funtionality of `R`

, with a
focus on data analysis.

In this seminar we will learn:

- how to interact with
`R`

through`RStudio`

- a bit of
`R`

coding - tools to import, clean, manage, and export data
- simple data analysis functions
- the basics of two
`R`

graphical systems - how to use
`R`

to share your data analysis with others

Coding instructions to help you learn

`R`

will appear in boxes like this.

`R`

as a programming environment`R`

is a programming environment for statistical computing
and graphics.

`R`

- serves as a data analysis and storage facility
- is designed to perform operations on vectors and matrices
- uses a well-developed but simple programming language
- allows for rapid development of new tools according to user demand

These tools are distributed as packages, which any user can download
to customize the `R`

environment.

`R`

?`R`

has many advantages as data analysis software:

- Free
- Free online books to learn R
- online community is much larger than those for other statistical software

- Powerful, intuitive graphics systems make it easy to produce publication-quality graphics
- Easily create data analysis reports as documents and presentations for reproducibility with R Markdown
- Because it is a true programming language,
`R`

makes it much easier to add functionality and flexibility as needed compared to other data analysis software - Many specialized packages featuring analysis tools not available in other software

`R`

also has a few disadvantages:

- somewhat more difficult to master than other statistical software
- frequent updates require maintenance

You can work directly in `R`

, but most users prefer a
graphical interface. We *highly* recommend using
RStudio, an integrated
development environment (IDE) that features:

- a console
- a powerful code/script editor featuring
- special tools for plotting, viewing R objects and code history
- cheatsheets for R programming
- tab-completion for object names and function arguments (enough reason by itself!)

You can input and execute commands directly in the console.

Output of commands will typically be displayed in the console.

The `RStudio`

console features tab-code-completion and
live help for function coding.

Type

`rno`

into the`RStudio`

console, wait a second, and when a window with`rnorm`

appears, hit theTabkey.

Once you have the `rnorm()`

function name completed, if
the cursor is inside of `()`

, `RStudio`

will
remind you of the the function’s arguments in a window . You can also
hit the `Tab`

key to see and issue the function argument
names.

Place the cursor inside of

`rnorm()`

. Hit theTabkey twice, and then type the number`10`

. HitEnterto execute the code.

You have just drawn 10 random numbers from a normal distribution.

Most `R`

programs written for data analysis consist of
many commands, making entering code line-by-line into the console
inefficient.

Instead, we use the script editor to save our commands as a record of
the steps we took to analyze our data. We can also issue `R`

commands directly from the editor.

The script editor features the same tab-code-completion and function help as the console, as well as syntax highlighting.

If you do not see the script editor already open, open it now by selecting

`File`

>`New File`

>`R Script`

.

Write the code

`1 + 2`

in the script editor.

On the next line, type the code`log`

. When the tab-completion window appears with a list of possible commands, hitTabto choose`log()`

. Enter 10 inside the`()`

.

To execute code directly from the script editor, place the cursor
inside the command (or highlight the entire command), and then hit
`Ctrl-Enter`

(on PCs, use `Command-Enter`

on
Macs). This will advance the cursor to the next command, where you can
hit Ctrl-Enter again to run it, advancing the cursor to the next
command…

Execute the two commands in the script editor using the keyboard shortcuts.

Make sure to periodically save your scripts. Most R scripts are saved
with the `.R`

extension.

Save your R script with the name

`mycode.R`

.

Unless otherwise specified, all coding instructions for this seminar should be entered into the script editor.

`R`

stores both data and output from data analysis (as
well as everything else) in *objects*.

Data are assigned to and stored in objects using the
`<-`

or `=`

operator.

In the script editor, type and run the code

`x <- 5`

to create our first object.

Once you create an object, it should appear in the
`RStudio`

`Environment`

pane.

To print the contents of an object to the console, specify the object’s name alone.

Specify and execute the code

`x`

to view the contents of`x`

in the console.

`R`

and `RStudio`

The `RStudio`

`Help`

menu contains links to
many documents for help with both `R`

(select
`R Help`

) and `RStudio`

(see
`RStudio Docs`

and `RStduio Community Forum`

).

We particularly like the `Cheatsheets`

, which are compact
documents crammed with useful information on how to use various products
made by the `RStudio`

group.

Open the cheatsheet for

`RStudio`

by selecting the`Help`

menu ->`Cheat Sheets`

->`RStudio IDE Cheat Sheet`

. Note the cheatsheet will usually be downloaded in a web browser as a .pdf.

`R`

packages`R`

and packagesBase `R`

and most `R`

packages are available
for download from the Comprehensive R Archive Network (CRAN)

- cran.r-project.org
- base
`R`

comes with a number of basic data management, analysis, and graphical tools - However,
`R`

’s power and flexibility lie in its array of packages (currently more than 15,000 on CRAN!)

To use packages in `R`

, we must first install them using
the `install.packages()`

function, which typically downloads
the package from CRAN and installs it for use.

Use the argument `dependencies=TRUE`

to load all other
packages required by the targeted package.

```
# uncomment (remove ##) to run
install.packages("dplyr", dependencies=TRUE)
install.packages("ggplot2", dependencies=TRUE)
install.packages("rmarkdown", dependencies=TRUE)
```

If you have not installed them already, please install

`dplyr`

,`ggplot2`

, and`rmarkdown`

with the necessary dependencies.

After installing a package, we can load it into the R environment
using the `library()`

or `require()`

functions,
which more or less do the same thing.

Functions and data structures within the package will then be available for use.

Load the package

`dplyr`

into your session now with`library()`

.

Many packages include vignettes – longer, tutorial style guides for a package.

To see a list of available vignettes for the packages that are
loaded, use `vignette()`

with no arguments. Then to view a
vignette, place its name inside `vignette()`

:

View the “Introduction to dplyr” vignette by issuing the command

`vignette("dplyr")`

.

Remember that we assign data to objects with `<-`

or
`=`

.

Character data (i.e. strings) are surrounded by `"`

or
`'`

.

In the script editor, create an object named

`a`

and assign it the character string “hello”.

Commands are separated either by a `;`

or by a
newline.

`R`

is case sensitive.

The `#`

character at the beginning of a line signifies a
comment, which is not executed.

On the next line, create an object named

`b`

, assign it the logarithm of 10 (using the`log()`

function), but put a`#`

at the beginning of the line. What happens when you try to execute this line?

Commands can extend beyond one line of text. Put operators like
`+`

at the end of lines for multi-line commands.

Functions perform most of the work on data in `R`

.

Functions in `R`

are much the same as they are in math –
they perform some operation on an input and return some output. For
example, the mathematical function \(f(x) =
x^2\), takes an input \(x\), and
returns its square. Similarly, the `mean()`

function in
`R`

takes a vector of numbers and returns its mean.

The inputs to functions are often referred to as
*arguments*.

Help files for `R`

functions are accessed by preceding the
name of the function with `?`

.

Try opening the help file for

`log()`

with the code`?log`

.

In the help file, we will find the following useful sections:

**Description**: overview of the function**Usage**: syntax, with list of arguments in particular order**Arguments**: description of arguments**Details**: in depth description of function’s operation**Value**: output of the function**Examples**: copy-and-pasteable examples

Values for arguments to functions can be specified either by name or position.

For `log()`

, we see the first argument is `x`

,
the number whose log we want to take, and the second is
`base`

, the base of the logarithm.

```
# specifying arguments by name
log(x=100, base=10)
## [1] 2
# specifying arguments by position
log(8, 2)
## [1] 3
```

In the help file `Usage`

section, if something is
specified after an argument’s name and `=`

, it is the default
value.

In the `log()`

help file, we see that the default value
for `base`

is `exp(1)`

, or \(e\), making `log()`

a natural
logarithm function by default.

Vectors, the fundamental data structure in `R`

, are
one-dimensional and homogeneous.

A single variable can usually be represented by one of the following vector data types:

- logical: TRUE or FALSE (1 or 0)
- integer: integers only (represented by a number followed by L; e.g. 10L is the integer 10)
- double: real numbers, also known as
*numeric* - character: strings

A single value is a vector of length one in `R`

.

The `c()`

function combines values of common type together
to form a vector.

```
# create a vector
first_vec <- c(1, 3, 5)
first_vec
## [1] 1 3 5
# length() returns the number of elements
char_vec <- c("these", "are", "some", "words")
length(char_vec)
## [1] 4
# the result of this comparison is a logical vector
first_vec > c(2, 2, 2)
## [1] FALSE TRUE TRUE
```

To create vectors with a predictable sequence of elements, use
`rep()`

for repeating elements and `seq()`

for
sequential elements.

The expression `m:n`

will generate a vector of integers
from `m`

to `n`

```
# first argument to rep is what to repeate
# the second argument is number of repetitions
rep(0, times=3)
## [1] 0 0 0
rep("abc", 4)
## [1] "abc" "abc" "abc" "abc"
# arguments for seq are from, to, by
seq(from=1, to=5, by=2)
## [1] 1 3 5
seq(10, 0, -5)
## [1] 10 5 0
# colon operator
3:7
## [1] 3 4 5 6 7
# you can nest functions
rep(seq(1,3,1), times=2)
## [1] 1 2 3 1 2 3
```

Create the vector (4,5,6) in three different ways using

`c()`

,`seq()`

, and the`:`

operator.

Try creating the vector (2,2,1,1) in at least two different ways.

Elements of a vector can be accessed or subset by specifying a vector
of numbers (of length 1 or greater) inside `[]`

.

```
# create a vector 10 to 1
# putting () around a command will cause the result to be printed
(a <- seq(10,1,-1))
## [1] 10 9 8 7 6 5 4 3 2 1
# second element
a[2]
## [1] 9
# first 5 elements
a[seq(1,5)]
## [1] 10 9 8 7 6
# first, third, and fourth elements
a[c(1,3,4)]
## [1] 10 8 7
```

Create the vector

`y`

as the integers counting down from 10 to 1. Extract the second, fifth, and seventh element of this vector`y`

.

Vectors elements can also be subset with a logical (TRUE/FALSE)
vector, known as *logical subsetting*.

This allows us to subset a vector by checking if a condition is satisifed:

```
# this returns a logical vector...
scores < 30
## [1] FALSE TRUE FALSE TRUE
# ...that we can now use to subset
scores[scores<30]
## [1] 24 10
```

Use conditional selection to find the numbers in

`y`

(integers from 10 to 1) that when multiplied by 2, the result is greater than 15.

`R`

works most easily with datasets stored as text files.
Typically, values in text files are separated, or delimited, by tabs or
spaces:
or by commas (CSV file):gender id race ses schtyp prgtype read write math science socst 0 70 4 1 1 general 57 52 41 47 57 1 121 4 2 1 vocati 68 59 53 63 31 0 86 4 3 1 general 44 33 54 58 31 0 141 4 3 1 vocati 63 44 47 53 56

gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst 0,70,4,1,1,general,57,52,41,47,57 1,121,4,2,1,vocati,68,59,53,63,61 0,86,4,3,1,general,44,33,54,58,31 0,141,4,3,1,vocati,63,44,47,53,56

R provides several related functions to read data stored as files.

Use `read.csv()`

to read in data stored as CSV and
`read.delim()`

to read in text data delimited by other
characters (such as tabs or spaces).

For `read.delim()`

, specify the delimiter in the
`sep=`

argument.

Both `read.csv()`

and `read.delim()`

assume the
first row of the text file is a row of variable names. If this is not
true, use the argument `header=FALSE`

.

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

```
# basic syntax of read.csv, not run
data <- read.csv("/path/to/file.csv")
# specification for tab-delimited file
dat.tab <- read.delim("/path/to/file.txt", sep="\t")
```

Create a dataset called

`dat_csv`

by loading a dataset from our server at this address: https://stats.oarc.ucla.edu/stat/data/hsbraw.csv .

We can export our data to a .csv file with
`write.csv()`

.

If you need to save multiple objects from your session, you can save
whatever objects you need with `save()`

, which creates a
binary `.Rdata`

file, which can loaded for later use with
`load()`

.

We did not specify realistic path names below.

Packages to read and write data in other software formats:

`readxl`

: Excel files`haven`

: Stata, SAS, and SPSS

Data sets for statistical analysis are typically stored in *data
frames* in `R`

. The objects created by
`read.csv()`

and `read.table()`

are data
frames.

Data frames are rectangular, where the columns are variables and the rows are observations of those variables.

Data frame columns can be of different data types (some double, some character, etc.) – but they must be equal length

Real datasets usually combine variables of different types, so data frames are well suited for storage.

`View()`

,
`head()`

and `tail()`

Use `View()`

on a dataset to open a spreadsheet-style view
of a dataset. In RStuido, clicking on a dataset in the
**Environment** pane will `View()`

it.

View the dataset

`dat_csv`

.

For large data files, use `head()`

and `tail()`

to look at a specified number of rows at the begininning or end of a
dataset, respectively.

```
# first 2 rows
head(dat_csv, 2)
## id female ses schtyp prog read write math science socst honors
## 1 45 female low public vocation 34 35 41 29 26 not enrolled
## 2 108 male middle public general 34 33 41 36 36 not enrolled
## awards cid
## 1 0 1
## 2 0 1
```

```
# last 8 rows
tail(dat_csv, 8)
## id female ses schtyp prog read write math science socst
## 193 174 male middle private academic 68 59 71 66 56
## 194 95 male high public academic 73 60 71 61 71
## 195 61 female high public academic 76 63 60 -99 -99
## 196 100 female high public academic 63 65 71 69 71
## 197 143 male middle public vocation 63 63 75 72 66
## 198 68 male middle public academic 73 67 71 63 66
## 199 57 female middle public academic 71 65 72 66 56
## 200 132 male middle public academic 73 62 73 69 66
## honors awards cid
## 193 not enrolled 2 20
## 194 enrolled 2 20
## 195 enrolled 4 20
## 196 -99 20
## 197 enrolled 4 20
## 198 enrolled 7 20
## 199 enrolled 5 20
## 200 enrolled 3 20
```

With a two-dimensional structure, data frames can be subset with
matrix notation `[rows, columns]`

.

Use vectors to subset multiple rows/columns.

Omitting `rows`

or `columns`

specifies all rows
and columns, respectively.

```
# use data.frame() to create a data frame manually
mydata <- data.frame(patient=c("Smith", "Jones", "Williams"),
height=c(72, 61, 66),
diabetic=c(1, 0, 0))
# row 3 column 2
mydata[3,2]
## [1] 66
# first two rows of column height
mydata[1:2, "height"]
## [1] 72 61
# all rows of columns patient and diabetic
mydata[,c("patient", "diabetic")]
## patient diabetic
## 1 Smith 1
## 2 Jones 0
## 3 Williams 0
```

Extract the 2nd, 5th, and 10th rows of the variable

`math`

in the`dat_csv`

data set.

Variables, or columns, of a data frame can be selected with the
`$`

operator, and the resulting object is a vector.

We can further subset elements of the selected column vector using
`[]`

.

```
# subsetting creates a numeric vector
mydata$height
## [1] 72 61 66
# just the second and third elements
mydata$height[2:3]
## [1] 61 66
```

Extract the 2nd, 5th, and 10th rows of the variable

`math`

in the`dat_csv`

data set using the`$`

operator

`colnames(data_frame)`

returns the column names of
*data_frame* (or matrix).

`colnames(data_frame) <- c("some", "names")`

assigns
column names to *data_frame*.

```
# get column names
colnames(mydata)
## [1] "patient" "height" "diabetic"
# assign column names (capitalizing them)
colnames(mydata) <- c("Patient", "Height", "Diabetic")
colnames(mydata)
## [1] "Patient" "Height" "Diabetic"
# to change one variable name, just use vector indexing
colnames(mydata)[3] <- "Diabetes"
colnames(mydata)
## [1] "Patient" "Height" "Diabetes"
```

Use `dim()`

on two-dimensional objects to get the number
of rows and columns.

Use `str()`

, to see the structure of the object, including
its *class* and the data types of elements. We also see the first
few rows of each variable.

```
# number of rows and columns
dim(mydata)
## [1] 3 3
#d is of class "data.frame"
#all of its variables are of type "integer"
str(mydata)
## 'data.frame': 3 obs. of 3 variables:
## $ Patient : chr "Smith" "Jones" "Williams"
## $ Height : num 72 61 66
## $ Diabetes: num 1 0 0
```

Examine the structure of

`dat_csv`

with`str()`

.

You can add variables to data frames by declaring them to be column variables of the data frame as they are created.

Trying to add a column of the wrong length will result in an error.

```
# this will add a column variable called logwrite to d
mydata$logHeight <- log(mydata$Height)
# now we see logwrite as a column in d
colnames(mydata)
## [1] "Patient" "Height" "Diabetes" "logHeight"
# d has 200 rows, and the rep vector has 300
mydata$z <- rep(0, 5)
## Error in `$<-.data.frame`(`*tmp*`, z, value = c(0, 0, 0, 0, 0)): replacement has 5 rows, data has 3
```

Here are some useful functions to create variables from existing variables:

`log()`

: logarithm`min_rank()`

: rank values`cut()`

: cut a continuous variable into intervals with new integer value signifying into which interval original value falls

`scale()`

: standardizes variable (substracts mean and divides by standard deviation)`lag()`

,`lead()`

: lag and lead a variable`cumsum()`

: cumulative sum`rowMeans()`

,`rowSums()`

: means and sums of several columns

Add a variable to

`dat_csv`

called`cmath`

that is the mean-centered math variable. To create this variable, subtract the mean of the`math`

variable from the`math`

variable itself.

Add a variable to

`dat_csv`

called`zmath`

that is the standardized math variable. To create this variable, divide (using`/`

) the`cmath`

variable by its standard deviation, using th`sd()`

function.

Add a variable to

`dat_csv`

called`zmath2`

that is also the standardized math variable by using the`scale()`

function on`dat_csv$math`

.

Use

`head()`

on`dat_csv`

to examine the first few rows of the variables you just created.

Taking the time to prepare your data before analysis can save you frustration and time spent cleaning up errors:

- Cut the data set down to only those observations (rows) and variables (columns) needed for analysis
- Combining data sets
- Make sure variable values are free of errors, such as impossible values or numeric codes for missing values

The package `dplyr`

contains several easy-to-use data
management functions that we will learn to use to manage our data.

If you have not already, load

`dplyr`

into the current session with`library()`

.

`filter()`

The `dplyr`

function `filter()`

will subset the
rows (observations) of a data frame according to some condition
(e.g. all rows where a column value is greater than *x*, all rows
where a column value is equal to *y*).

```
# creating some data manually
dog_data <- data.frame(id = c("Duke", "Lucy", "Buddy", "Daisy", "Bear", "Stella"),
weight = c(25, 12, 58, 67, 33, 9),
sex=c("M", "F", "M", "F", "M", "F"),
location=c("north", "west", "north", "south", "west", "west"))
# dogs weighing more than 40
filter(dog_data, weight > 40)
## id weight sex location
## 1 Buddy 58 M north
## 2 Daisy 67 F south
# female dogs in the north or south locations
filter(dog_data, (location == "north" | location == "south") & sex == "F")
## id weight sex location
## 1 Daisy 67 F south
```

Create a data set from

`dat_csv`

called`low_read`

that contains observations where the`read`

score is less than or equal to 50.

Create a data set from

`dat_csv`

called`mid_read`

that contains observations where the`read`

score is greater than 50 but also less than or equal to 60.

Often, datasets come with many more variable than we want. We can use
the `dplyr`

function `select()`

to keep only the
variables we need.

```
# select 2 variables
select(dog_data, id, sex)
## id sex
## 1 Duke M
## 2 Lucy F
## 3 Buddy M
## 4 Daisy F
## 5 Bear M
## 6 Stella F
```

Use `-`

to unselect (select out) columns.

Sometimes we are given our dataset in parts, with observations spread over many files (collected by different researchers, for example). To create one dataset, we need to append the datasets together row-wise.

The function `rbind()`

appends data frames together. The
variables must be the same between datasets.

Here, we `rbind()`

a new data set of dogs,
`more_dogs`

, to our current `dog_data`

.

```
# make a data.frame of new dogs
more_dogs <- data.frame(id = c("Jack", "Luna"),
weight=c(38, -99),
sex=c("M", "F"),
location=c("east", "east"))
# make sure that data frames have the same columns
names(dog_data)
## [1] "id" "weight" "sex" "location"
names(more_dogs)
## [1] "id" "weight" "sex" "location"
# appended dataset combines rows
all_dogs <- rbind(dog_data, more_dogs)
all_dogs
## id weight sex location
## 1 Duke 25 M north
## 2 Lucy 12 F west
## 3 Buddy 58 M north
## 4 Daisy 67 F south
## 5 Bear 33 M west
## 6 Stella 9 F west
## 7 Jack 38 M east
## 8 Luna -99 F east
```

Append

`low_read`

and`mid_read`

and call the resulting data set`low_and_mid_read`

. Check in the`Environment`

pane that`low_and_mid_read`

has the correct number of observations.

We often receive separate datasets with different variables (columns)
that must be *merged* on a key variable.

Merging is an involved topic, with many different kinds of merges possible, depending on whether every observation in one dataset can be matched to an observation in the other dataset. Sometimes, you’ll want to keep observations in one dataset, even if it is not matched. Other times, you will not.

We will solely demonstrate merges where only matched observations are kept.

We will use the `dplyr`

function `inner_join()`

to perform the merge. The base `R`

function
`merge()`

can be used for the same type of merge.

`inner_join()`

will search both datasets for any variables
with the same name, and will use those as matching variables. If you
need to control which variables are used to match, use the
`by=`

argument.

```
# new dog variable
# matching variables do not have to be sorted
dog_vax <- data.frame(id = c("Luna", "Duke", "Buddy", "Stella", "Daisy", "Lucy", "Jack", "Bear"),
vaccinated = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
# id appears in both datasets, so will be used to match observations
dogs <- inner_join(all_dogs, dog_vax)
## Joining, by = "id"
dogs
## id weight sex location vaccinated
## 1 Duke 25 M north TRUE
## 2 Lucy 12 F west FALSE
## 3 Buddy 58 M north TRUE
## 4 Daisy 67 F south TRUE
## 5 Bear 33 M west FALSE
## 6 Stella 9 F west TRUE
## 7 Jack 38 M east FALSE
## 8 Luna -99 F east TRUE
```

Missing values in `R`

are represented by the reserved
symbol `NA`

(cannot be used for variable names).

Blank fields in a text file will generally be converted to
`NA`

when loaded into `R`

.

Often, datasets use codes, such as impossible numeric values (e.g. -99) to denote missing values.

We can convert missing data codes like -99 in variables to
`NA`

with conditional selection.

In `R`

, `NA`

means “undefined”, so most
operations involving an `NA`

value will result in
`NA`

(as the result is “undefined”):

```
# a sum involving "undefined" is "undefined"
1 + 2 + NA
## [1] NA
# NA could be larger or smaller or equal to 2
c(1, 2, 3, NA) > 2
## [1] FALSE FALSE TRUE NA
# mean is undefined because of the presence of NA
dogs$weight
## [1] 25 12 58 67 33 9 38 NA
mean(dogs$weight)
## [1] NA
```

However, many functions allow the argument `na.rm`

(or
soemthing similar) to be set to `TRUE`

, which will first
remove any `NA`

values from the operation before calculating
the result:

You cannot check for equality to `NA`

because means
“undefined”. It will always result in `NA`

.

Use `is.na()`

instead.

```
# one of the values is NA
x <- c(1, 2, NA)
# check for equality to NA using == is wrong
# RStudio may give you a warning about this (to use is.na() instead)
x == NA
## [1] NA NA NA
# this is correct
is.na(x)
## [1] FALSE FALSE TRUE
```

In

`dat_csv`

, the variable`science`

contains -99 values to signify missing. How can you identify which rows have -99 values?

Convert all of these -99 values to`NA`

.

Calculate the mean of`science`

ignoring the missing values.

Common numeric summaries for continuous variables are the mean,
median, and variance, obtained with `mean()`

,
`median()`

, and `var()`

(`sd()`

for
standard deviation), respectively.

`summary()`

on a numeric vector provides the min, max,
mean, median, and first and third quartiles (interquartile range).

```
# create a new bloodtest data set
bloodtest <- data.frame(id = 1:10,
gender = c("female", "male", "female", "female", "female", "male", "male", "female", "male", "female"),
hospital = c("CLH", "MH", "MH", "MH", "CLH", "MH", "MDH", "MDH", "CLH", "MH"),
doc_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
insured = c(0, 1, 1, 1, 0, 1, 1, 0, 1, 1),
age = c(23, 45, 37, 49, 51, 55, 56, 37, 26, 40),
test1 = c(47, 67, 41, 65, 60, 52, 68, 37, 44, 44),
test2 = c(46, 57, 47, 65, 62, 51 ,62 ,44 ,46, 61),
test3 = c(49, 73, 50, 64, 77, 57, 75, 55, 62, 55),
test4 = c(61, 61, 51, 71, 56, 57, 61, 46, 46, 46))
mean(bloodtest$age)
## [1] 41.9
median(bloodtest$age)
## [1] 42.5
var(bloodtest$age)
## [1] 130.5444
summary(bloodtest$test1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.00 44.00 49.50 52.50 63.75 68.00
```

Correlations provide quick assessments of whether two continuous variables are linearly related to one another.

The `cor()`

function estimates correlations. If supplied
with 2 vectors, `cor()`

will estimate a single correlation.
If supplied a data frame with several variables, `cor()`

will
estimate a correlation matrix.

```
# just a single correlation
cor(bloodtest$test1, bloodtest$test2)
## [1] 0.7725677
# use dplyr select() to pull out just the test variables
scores <- select(bloodtest, test1, test2, test3, test4)
cor(scores)
## test1 test2 test3 test4
## test1 1.0000000 0.7725677 0.8174523 0.7959618
## test2 0.7725677 1.0000000 0.6691743 0.5298743
## test3 0.8174523 0.6691743 1.0000000 0.3612138
## test4 0.7959618 0.5298743 0.3612138 1.0000000
```

Create a correlation table of the

`dat_csv`

variables`read`

,`write`

, and`math`

.

The statistics mean, median and variance cannot be calculated meaningfully for categorical variables (unless just 2 categories).

Instead, we often present frequency tables of the distribution of membership to each category.

Use `table()`

to produce frequency tables.

Use `prop.table()`

on the tables produced by
`table()`

(i.e. the output) to see the frequencies expressed
as proportions.

Some of the categorical variables in this dataset are:

Two-way and multi-way frequency tables (crosstabs) are used to explore the relationships between categorical variables.

We can use `table()`

and `prop.table()`

again.
Within `prob.table()`

, use `margin=1`

for row
proportions and `margin=2`

for column proportions. Omitting
`margin=`

will give proportions of the total.

```
# this time saving the freq table to an object
my2way <- table(bloodtest$gender, bloodtest$hospital)
# counts in each crossing of prog and ses
my2way
##
## CLH MDH MH
## female 2 1 3
## male 1 1 2
# row proportions,
# proportion of prog that falls into ses
prop.table(my2way, margin=1)
##
## CLH MDH MH
## female 0.3333333 0.1666667 0.5000000
## male 0.2500000 0.2500000 0.5000000
# columns proportions,
# proportion of ses that falls into prog
prop.table(my2way, margin=2)
##
## CLH MDH MH
## female 0.6666667 0.5000000 0.6000000
## male 0.3333333 0.5000000 0.4000000
```

Determine the proportion of each socio-economic group (variable

`ses`

) within each school type (variable`schtyp`

) in the`dat_csv`

data set.

`R`

The `stats`

package, loaded with base `R`

,
provides a wide array of commonly used statistical tools, including:

- chi-square tests and several related/similar tests
- t-tests
- correlation and covariance
- ANOVA and linear regression models
- generalized linear regression models (logistic, poisson, etc.)
- time series analysis
- statistical distributions (random numbers, density, distribution, and quantile functions)

Even more tools are available in various downloadable packages.

We will be covering only tools available in `stats`

in
this seminar.

Chi-square tests are often used to test for association between two categorical variables. It tests whether the proportions of membership to categories of one variable are related to the proportion of membership to categories of another variable.

Use `chisq.test()`

to perform the chi-square test of
independence. Supply two categorical variables (can be numeric or
character) as arguments.

Here we test whether `hospital`

and being
`insured`

are independent.

```
# program and ses class appear to be associated
chisq.test(bloodtest$hospital, bloodtest$insured)
## Warning in chisq.test(bloodtest$hospital, bloodtest$insured): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: bloodtest$hospital and bloodtest$insured
## X-squared = 4.4444, df = 2, p-value = 0.1084
```

Because some of our expected cell sizes are less than 5,
`R`

warns us that the chi-squared test may not be close to
exact.

Many of R statistical modeling functions use a common formula syntax. At its most basic:

`y ~ a`

Here `y`

represents an outcome (DV) and `a`

represents a predictor (IV, covariate)

We add more predictors to the model with `+`

:

`y ~ a + b`

Interactions terms can be specified with `:`

between the
interacting variables:

`y ~ a + b + a:b`

A short-hand to request both the “main effects” and the interaction
of 2 variables uses `*`

. The following is equivalent to the
formula immediately above:

`y ~ a*b`

Two sample t-tests model a simple relationship – that the mean of an outcome variable (assumed to be normally distributed) is associated with a two-group predictor variable.

Use `t.test()`

with formula notation to perform an
independent samples t-test of whether test1 score is associated with
gender:

```
# formula notation for independent samples t-test
t.test(test1 ~ gender, data=bloodtest)
##
## Welch Two Sample t-test
##
## data: test1 by gender
## t = -1.1813, df = 6.2951, p-value = 0.2802
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -26.670132 9.170132
## sample estimates:
## mean in group female mean in group male
## 49.00 57.75
```

Test1 score does not appear to differ between the genders.

Perform a t-test to determine whether

`math`

scores are different between genders (variable`female`

) with data set`dat_csv`

.

With a paired samples (dependent samples) t-test, we test whether the means of two possibly correlated variables are different.

Below we use `t.test()`

to test whether the means of
patients’ test1 and test3 scores tend to be different. For a paired
test, do not use formula notation, but instead specify two vectors of
equal length and `paired=TRUE`

.

```
t.test(bloodtest$test1, bloodtest$test3, paired=TRUE)
##
## Paired t-test
##
## data: bloodtest$test1 and bloodtest$test3
## t = -4.3231, df = 9, p-value = 0.001925
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -14.014139 -4.385861
## sample estimates:
## mean difference
## -9.2
```

The paired t-test suggests that test3 scores are significantly different from test1 scores.

Linear regression expands the simple predictor-outcome model of t-tests by allowing more predictors. These predictors be distribtued in any way (not just binary predictors).

The `lm()`

function is used to fit linear regression
models.

Numeric and character variable predictors are acceptable. Character variables are essentially treated as factors (categorical variables), where by default, a dummy (0/1) variable is entered into the model for each level except for the first.

Below we fit a model where we predict a patients’s test1 score from their age and gender, and store the model as an object.

Model objects, the output of regression model fitting functions like
`lm()`

, are usually too complex to examine directly, as they
contain an abundance of diverse information related to the the fitted
model.

Instead, we often use a set of extractor functions to pull out the desired information from a model object.

`summary()`

: regression table of coefficients, standard errors, test statistics, and p-values, as well as overall model fit statistics`coef()`

: just model coefficients`residuals()`

: residuals`predict()`

: predictions based on fitted model`confint()`

: confidence intervals for coefficients

These functions will often (but not always) work with model objects
fit with functions other than `lm()`

.

```
# summary produces regression table and model fit stats
summary(m1)
##
## Call:
## lm(formula = test1 ~ age + gender, data = bloodtest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.646 -6.164 1.043 7.146 10.104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.4871 11.7845 2.078 0.0763 .
## age 0.6206 0.2824 2.198 0.0639 .
## gendermale 5.0265 6.2477 0.805 0.4475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.316 on 7 degrees of freedom
## Multiple R-squared: 0.4981, Adjusted R-squared: 0.3547
## F-statistic: 3.474 on 2 and 7 DF, p-value: 0.08957
```

Peform a linear regression of the outcome

`read`

with predictors`math`

,`female`

, and`ses`

using`dat_csv`

. Call the model object`m1`

. Interpret your results.

```
# just the coefficients
coef(m1)
## (Intercept) age gendermale
## 24.4871383 0.6205788 5.0265273
# 95% confidence intervals
confint(m1)
## 2.5 % 97.5 %
## (Intercept) -3.37869862 52.352975
## age -0.04713382 1.288291
## gendermale -9.74700686 19.800062
# first 5 observed values, predicted values and residuals
# cbind() joins column vectors into a matrix
cbind(bloodtest$test1, predict(m1), residuals(m1))
## [,1] [,2] [,3]
## 1 47 38.76045 8.239550
## 2 67 57.43971 9.560289
## 3 41 47.44855 -6.448553
## 4 65 54.89550 10.104502
## 5 60 56.13666 3.863344
## 6 52 63.64550 -11.645498
## 7 68 64.26608 3.733923
## 8 37 47.44855 -10.448553
## 9 44 45.64871 -1.648714
## 10 44 49.31029 -5.310289
```

When an object of class `lm`

is supplied to
`anova()`

, an analysis of variance table of the fitted model
is produced. The ANOVA partioning uses sequential sums of squares (Type
I SS), so if other sums of squares are desired, see the
`Anova()`

function in the `car`

package.

```
# ANOVA sequential sums of squares
anova(m1)
## Analysis of Variance Table
##
## Response: test1
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 546.77 546.77 6.2997 0.0404 *
## gender 1 56.18 56.18 0.6473 0.4475
## Residuals 7 607.55 86.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The `anova()`

function is often used to conduct a
likelihood ratio test, that compares the fit of nested models to the
data. This test allows one to assess whether the addition of several
predictors improves the fit of the model.

Simply specify two nested models to `anova()`

to conduct
the likelihood ratio test:

```
# fit another linear regression model, adding hosiptal as predictor (two parameters added to model):
m2 <- lm(test1 ~ age + gender + hospital, data=bloodtest)
# printing an lm object will list the coefficients only
anova(m2, m1)
## Analysis of Variance Table
##
## Model 1: test1 ~ age + gender + hospital
## Model 2: test1 ~ age + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5 525.14
## 2 7 607.55 -2 -82.414 0.3923 0.6946
```

Hospital does not appear to improve the fit of the model
significantly, so we would typically choose `m1`

, the more
parsimoniuous model.

Use

`anova()`

to determine whether adding predictor`prog`

(requiring 2 parameters) significantly improves the fit over model`m1`

.

Supplying a model object to the generic `plot()`

function
will usually produce a set of regression diagnostics helpful for
assessing the assumptions of the regression model.

For `lm`

objects, we get the following plots:

- residual vs fitted
- normal q-q-plot of residuals
- scale-location
- residuals vs leverage

Let’s take a look at these 4 plots for our model. They will not show any strong evidence that the linear regression model is inappropriate.

```
# plots all 4 plots at once (otherwise one at a time)
layout(matrix(c(1,2,3,4),2,2))
# 4 diagnostic plots
plot(m1)
```

Inspect the diagnostic plots model

`m1`

.

Logistic regression models how variation in a binary outcome can be explained by a set of predictors.

We can model variation in non-normally distributed outcomes with
*generalized linear models*.

We use `glm()`

for generalized linear models, including
logistic regression. Specify the distribution of the outcome in the
`family=`

argument. A link function can be specified within
`family=`

, but the canonical link function
(e.g. `link="logit"`

link for `family=binomail`

)
is used as the default, so is not necessary.

Here we model the probability of being insured as predicted by age.
The `glm()`

function will model the probability (odds) of
`insured=1`

.

```
# family=binomail uses link=logit by default
m_ins <- glm(insured ~ age, data=bloodtest, family=binomial)
```

We use the same functions as in linear regression to extract
information from our `glm`

model object.

Using `summary()`

on an object of class `glm`

produces a table of regression coefficients again:

```
summary(m_ins)
##
## Call:
## glm(formula = insured ~ age, family = binomial, data = bloodtest)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8364 -0.6739 0.6244 0.8312 1.1948
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62402 2.75298 -0.590 0.555
## age 0.06089 0.06739 0.904 0.366
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12.217 on 9 degrees of freedom
## Residual deviance: 11.339 on 8 degrees of freedom
## AIC: 15.339
##
## Number of Fisher Scoring iterations: 4
```

Odds ratios for logistic regression are obtained by exponentiating the coefficients and confidence intervals:

```
# ORs
exp(coef(m2))
## (Intercept) age gendermale hospitalMDH hospitalMH
## 7.879529e+09 2.177465e+00 1.241206e+02 1.387734e-04 2.267828e-03
# confidence intervals on ORs
exp(confint(m2))
## 2.5 % 97.5 %
## (Intercept) 1.843467e-05 3.367946e+24
## age 8.576561e-01 5.528270e+00
## gendermale 2.452115e-06 6.282708e+09
## hospitalMDH 3.162568e-16 6.089375e+07
## hospitalMH 5.916464e-13 8.692766e+06
```

A huge number of statistically-related packages make `R`

perhaps the most powerful and comprehensive statistical software
currently available. See the following packages for common statistical
analyses:

`lme4`

,`MCMCglmm`

: mixed (multilevel) analysis`survival`

: survival analysis`ordinal`

: ordinal logistic regression`nnet`

: multinomial logistic regression`lavaan`

: latent variable and structural equation modeling (SEM)`survey`

: complex survey analysis`MICE`

,`amelia`

: multiple imputation`boot`

: bootstrapping`rstan`

,`rethinking`

,`brms`

: Bayesian models

`R`

graphicsBase `R`

comes with several powerful graphical functions
that give the user a great deal of control over the appearance of the
graph.

The graphing functions most commonly used are:

`plot()`

: scatter plots`hist()`

: histograms`boxplot()`

: box plots (box-and-whisker)`barplot()`

: bar plots

Scatter plots visualize the covariation of two variables, both typically continuous.

Specify two variables to `plot()`

to produce a scatter
plot:

Changing the axis labels and adding a title are easy with
`xlab=`

, `ylab=`

, and `main=`

:

We can change the color of the symbols with `col=`

and the
shape of the symbols with `pch=`

.

- Run the function
`colors()`

to see a list of color names - See
`?pch`

for a list of plotting symbols

```
plot(bloodtest$test1, bloodtest$test2,
xlab="Test 1",
ylab="Test 2",
main="Plot of Test1 vs Test2",
col="steelblue",
pch=17)
```

Create a scatter plot of the

`dat_csv`

variables`read`

(x-axis) and`write`

(y-axis). Change the shape of the symbols to solid squares and their color to red.

Boxplots are often used to compare the distribution of a continuous variable across the levels of a categorical variable.

Use formula notation in `boxplot()`

to specify boxplots of
the variable on the left side of `~`

by the variable on the
right side.

The `R`

plotting functions share many of the same
arguments. Here we use `xlab=`

, `ylab=`

, and
`main=`

again to change the titles, and `col=`

to
change the fill color of the boxes.

```
boxplot(bloodtest$test2 ~ bloodtest$insured,
xlab="Insured",
ylab="Test 2",
main = "Boxplots of Test2 by Insurance Status",
col="lightblue")
```

One common use of bar plots is to visualize the frequencies of levels of grouping variables, where the height of the bar represents the number of observations falling into that grouping.

We can thus think of bar plots as graphical representations of frequency tables.

`R`

makes this connection explicit by allowing the output
of `table()`

to be used as the input to
`barplot()`

.

For example, let’s plot the frequencies of groupings made by the
variables `gender`

and `hospital`

. We add a legend
to the plot with `legend.text=TRUE`

.

Now let’s request side-by-side bars instead of stacked bars with
`beside=TRUE`

:.

We also use `col=`

again to color the bars, but now we
have 2 colors to specify as a vector:

```
tab <- table(bloodtest$gender, bloodtest$hospital)
barplot(tab,
legend.text = TRUE,
beside=TRUE,
col=c("lawngreen", "sandybrown"),
xlab="hospital")
```

Create a bar plot of

`ses`

by`prog`

in the data set`dat_csv`

. Use the colors red, green, and blue to color the bars. Add a legend.

`ggplot2`

for graphicsAlthough Base `R`

graphics functions are powerful and can
be used to make publication-quality graphics, for new `R`

users they are mostly ideal for creating quick, exploratory graphs.

Some common plotting features such as legends and multiple panels can
be difficult to construct in base `R`

graphics.

The package `ggplot2`

is a better alternative to create
complex, layered, and publication-quality graphics.

`ggplot2`

uses a structured*grammar of graphics*that provides an intuitive framework for building graphics layer-by-layer, rather than memorizing lots of plotting commands and options`ggplot2`

graphics often require less work to make beautiful and eye-catching

Please load the

`ggplot2`

package into your session now with`library()`

.

`ggplot2`

plotThe basic specification for a `ggplot2`

plot is to specify
which variables are mapped to which aspects of the graph (called
*aesthetics*) and then to choose a shape (called a *geom*)
to display on the graph.

Although the full syntax of `ggplot2`

is beyond the scope
of this seminar, we can produce many plots with some variation of the
following syntax:

`ggplot(`

*dataset*, aes(x=*xvar*, y=*yvar*))
+ *geom_function()*

(Note that the package is named `ggplot2`

while this
function is called `ggplot()`

)

That syntax uses the data in

, puts
*dataset*

on the x-axis, *xvar*

on
the y-axis, and uses *yvar*

to produce
shapes for the graph. For example *geom_function()*`geom_point()`

will produce
a scatter plot, while `geom_bar()`

produces bar plots.

Inside `aes()`

, we can map more variables to graphical
aspects of the plot, such as the `color`

, `size`

and `shape`

of plotted objects.

For a much more detailed explanation of the grammar of graphics
underlying `ggplot2`

, see our
Introduction
to ggplot2 seminar.

`ggplot2`

To demonstrate how we layer graphics with the `ggplot2`

package, we will be using the `dat_csv`

data set.

Here we specify a scatter plot of math vs write.

We add layers with `+`

.

Here we add a layer that produces a regression line (and confidence
interval) with `geom_smooth()`

.

```
# a scatterplot of math vs write with best fit line
ggplot(dat_csv, aes(x=math, y=write)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
```

With `ggplot`

, it is easy to link the `color`

,
`shape`

, and `size`

of the symbols to a
variable.

Notice how `color`

and `fill`

will also group
the data to produce separate plots.

Also notice that `ggplot`

will insert legends for you by
default.

```
# a scatterplot and best fit line, by gender
# color affects the best fit line, fill affects the confidence intervals
ggplot(dat_csv, aes(x=math, y=write, color=female, fill=female)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
```

We can easily make a panel of graphs with the function
`facet_wrap`

. Here we create separate plots for each
program.

```
# panel of scatterplot and best fit line, colored by gender, paneled by prog
ggplot(dat_csv, aes(x=math, y=write, color=female, fill=female)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~prog)
## `geom_smooth()` using formula 'y ~ x'
```

`ggplot2`

comes with several *complete theme*
functions that change the overall appearance of the plot. We try 2
themes below:

```
# panel of scatterplot and best fit line, colored by gender, paneled by prog
ggplot(dat_csv, aes(x=math, y=write, color=female, fill=female)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~prog) +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
```

```
# panel of scatterplot and best fit line, colored by gender, paneled by prog
ggplot(dat_csv, aes(x=math, y=write, color=female, fill=female)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~prog) +
theme_dark()
## `geom_smooth()` using formula 'y ~ x'
```

Use ggplot2 to make a scatter plot of

`read`

(x-axis) vs`write`

(y-axis) from data set`dat_csv`

. Color the dots by`ses`

.

Now add a best fit line with`geom_smooth()`

. Why are there 3 lines?

R Markdown allows users to create reproducible documents that weave narrative text together with R code and the output it produces when executed.

One of R Markdown’s great strengths is the variety of output documents it can produce such as HTML webpages, LaTeX documents, and Word documents.

These documents are dynamically generated – whenever we need to change code or data, we can simply update the R Markdown file, compile it, and the output will be automatically updated in the resulting docuemnt.

These documents can then be shared with an audience to provide the most up-to-date content.

For example, this slide-show was entirely produced in R Markdown.
When R Markdown encounters the following code block in the file used to
create this slide-show, `RMarkdown`

embeds both the code
block itself and the output (a graph) into the slide show:

We can open a new Markdown file template through the
`File`

menu in RStudio.

A. Choose

`File`

->`New File`

->`R Markdown...`

B. Fill the`Title`

field and`Author`

fields with “Practice” and your name, respectively.

C. In the left menu, select`Document`

, and for`Default Output Format`

select option`HTML`

(these are the defaults).

D. Click OK

R Markdown files typically use the extension `.Rmd`

or
`.rmd`

R Markdown files consist of 3 elements:

- YAML header: The header between
`---`

at the top; determines document type and overall appearance - Markdown: code tags like
`##`

and`**`

used to format text - R code chunks: delineated by ``` and used to process data and output results

Once we are pleased with its contents, we can compile the R Markdown
file and render it into its final output format by clicking the
`Knit`

button.

Click the

`Knit`

button on your R Markdown document. Save the file with the name`myfirst.rmd`

. Observe how each element in the R Markdown file becomes an element in the output document.Note:If`R`

tells you that you need to install some packages before knitting, you must install the packages for`R`

to output the documents.

Markdown is a *lightweight* markup language, which is a system
of code tags used to format documents. Tags are used to define sections,
change the appearance of text, build tables, link images, and so on.

Hypertext Markup Language (HTML) is a markup language designed to be used for web pages. The placement and appearance of text, images, frames, tables, and other elements can all be specified in HTML.

Markdown was originally designed to be a shorthand for HTML. For
example in HTML, to italicize text, you enclose it in
`<em> </em>`

. In Markdown, you just enclose it in
`* *`

.

Some important Markdown tags:

`*italics*`

produces*italics*`**bold**`

produces**bold**`# Header`

produces a level 1 header (largest); add more`#`

for lower level headers.

Do not insert spaces in between formatting tags and the text.

Identify which text in your .rmd file has been formatted with Markdown tags.

Try changing the bolded word to italics.

Add a level 4 header to the document called “Header 4”, knit the document, and compare its size to the other 2 headers.

For R Markdown files, R code is placed in code chunks that can be
interleaved with the text of the document. All of the code chunks
execute sequentially in one session when the `.rmd`

file is
rendered, so objects created in one chunk are available to all
subsequent chunks.

R Markdown actually uses the package `knitr`

to process
the R code chunks.

A large array of
`knitr`

options allows some control over the appearance
of R code, text output and graphical output in the final document.

R code chunks are delimited by ````{r `

at the beginning and *chunk_label*,
*options*}`````

at the
end.

To create a code chunk:

- Use the shortcut
`Ctrl + Alt + I`

(`Cmd + Option + I`

on Macs) - Use the green Insert code button towards the top right of the script
editor (will appear if the file extension is
`.rmd`

) - Type out the tags

A. Add an R code chunk using the keyboard shortcut or the Insert button to your R Markdown file.

B. Inside this code chunk, load the trees data set with the code`data("trees")`

.

C. Produce a correlation matrix of the 3 variables in the`trees`

data using the code`cor(trees)`

.

D. Knit the .Rmd file and take note of the correlation matrix.

E. Then, restrict the trees dataset to observations with

`Height`

greater than 80 by adding the code`trees <- subset(trees, trees$Height>80)`

immediately after`data("trees")`

.

F. Knit the .Rmd file again and notice how the correlation matrix changes.

The YAML header controls the overall appearance of the document,
including the document type. It is located between the `---`

at the top of the document.

The YAML (Yet Another Markup Language) header works with a program called pandoc, that can translate code tags between different document types.

The default document type is an `html_document`

, which is
typically a webpage. Other document types include:

- HTML slideshows
- LaTeX pdf documents
- Word documents
- PowerPoint presentations
- Books

In the YAML header of your R Markdown file, change the

`output:`

to`word_document`

.

Now change`output:`

to`slidy_presentation`

.

If you have LaTeX installed on your computer, you can change the`output:`

to`pdf_document`

.

- UCLA OARC Statistical Methods and Data Analytics Website, our website with thousands of pages for help with statistical analysis
- StackOverflow, a message board where R-related and statistics-related questions are answered
*R for Data Science*, an online book written by the people who make RStudio and the tidyverse, that discusses how to use R for all steps of data analysis, with special emphasis on using the tidyverse