R Graphics: Introduction to ggplot2

Lionel Messi’s Majestic Season

Image Caption

What Are the Demographics Of Heaven?

Image Caption

The Nation’s Best Punter Is Changing The Game

Image Caption

A Nerd’s Guide To The 2,229 Paintings at MoMA

Image Caption

Image Caption

Purpose of this seminar

This seminar introduces how to use the R ggplot2 package, particularly for producing statistical graphics for data analysis.

Today, we will:

Text in this font signifies R code or variables in a data set

Text that appears like this represents an instruction to practice ggplot2 coding

Seminar packages

Load the packages into the current R session with library(). In addition to ggplot2, we load package MASS (installed with R) for data sets.

#load libraries into R session
library(ggplot2)
library(MASS)

Please use library() to load packages ggplot2 and MASS.

The ggplot2 package

ggplot2 documentation

https://ggplot2.tidyverse.org/reference/

The official reference webpage for ggplot2 has help files for its many functions an operators with many examples.

The grammar of graphics

What is a grammar of graphics?

A grammar of a language defines the rules of structuring words and phrases into meaningful expressions.

A grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.

Leland Wilkinson (2005) designed the grammar upon which ggplot2 is based.

Elements of grammar of graphics

  1. Data: variables mapped to aesthetic features of the graph.
  2. Geoms: objects/shapes on the graph.
  3. Stats: statistical transformations that summarize data,(e.g mean, confidence intervals).
  4. Scales: mappings of aesthetic values to data values. Legends and axes visualize scales.
  5. Coordinate systems: the plane on which data are mapped on the graphic.
  6. Faceting: splitting the data into subsets to create multiple variations of the same graph (paneling).

The ggplot() function and aesthetics

All graphics begin with specifying the ggplot() function (Note: not ggplot2, the name of the package)

In the ggplot() function we specify:

Example syntax for ggplot() specification (italicized words are to be filled in by you):


ggplot(data, aes(x=xvar, y=yvar))


Notice that the aesthetics are specified inside aes(), which is itself nested inside of ggplot().

Aesthetics specified inside of ggplot() are inherited by subsequent layers:

# scatter plot of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point() 
geom_point() inherits x and y aesthetics

geom_point() inherits x and y aesthetics

Layers and overriding aesthetics

Specifying just x and y aesethetics alone will produce a plot with just the 2 axes.

ggplot(data = txhousing, aes(x=volume, y=sales))
without a geom or stat, just axes

without a geom or stat, just axes

Add layers with + to add graphical components.

Layers consist of geoms, stats, scales, faceting, and themes, which we will discuss in detail.

Remember that subsequent layers inherit aesthetics from ggplot(). However, specifying new aesthetics in a layer will override the aesthetics specified in ggplot().

# scatter plot of volume vs sales
#  with rug plot colored by number of listings
ggplot(txhousing, aes(x=volume, y=sales)) + # x=volume and y=sales inherited by all layers  
  geom_point() +
  geom_rug(aes(color=listings))   # color will only apply to the rug plot because not specified in ggplot()
both geoms inherit aesthetics from gglot, but geom_rug() also adds color aesthetic

both geoms inherit aesthetics from gglot, but geom_rug() also adds color aesthetic

Aesthetics

Aesthetics are the visual properties of objects on the graph.

Which aesthetics are required and which are allowed vary by geom.

Commonly used aesthetics:

Mapping vs setting

Map aesthetics to variables inside the aes() function. By mapping, we mean the aesthetic will vary as the variable varies.

For example, mapping x=volume causes the position of the plotted data to vary with values of variable “volume”. Similarly, mapping color=listings causes the color of objects to vary with values of variable “listings”.

# mapping color to listings inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(aes(color=listings))
color of points varies with number of listings

color of points varies with number of listings

Set aesthetics to a constant outside the aes() function.

Compare the following graphs:

# setting color to green outside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(color="green")
color of points set to constant green

color of points set to constant green

Setting an aesthetic to a constant within aes() can lead to unexpected results, as the aesthetic is then set to a default value rather than the specified value.

# color="green" inside of aes()
# geom_point() cannot find a variable called "green" and
#   uses a default color instead
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(aes(color="green"))
aesthetic set to constant within aes() leads to unexpected results

aesthetic set to constant within aes() leads to unexpected results

Exercise 1

To practice using the grammar of graphics, we will use the Sitka dataset (from the MASS package).

library(MASS)

Note: Data sets that are loaded into R with a package are immediately available for use. To see the object appear in RStudio’s Environment pane (so you can click to view it), run data() on the data set, and then click the data set name in the Environment pane.

Run the command data(Sitka) and then click on Sitka in the Environment pane.

The Sitka dataset describes the growth of trees over time, some of which were grown in ozone-enriched chambers. The data frame contains 395 rows of the following 4 columns:

A. Create a scatter plot of Time vs size to view the growth of trees over time.

B. Color the scatter plot points by the variable treat.

C. Add an additional geom_smooth() (loess) layer to the graph.

D. SET the color of the loess smooth to “green” rather than have it colored by treat. Why is there only one smoothed curve now?

Geoms

geoms: bar, boxplot, density, histogram, line, point

geoms: bar, boxplot, density, histogram, line, point

Geom functions differ in the geometric shapes produced for the plot.

Some example geoms:

Geoms and aesthetics

Each geom has required aesthetics. For example, geom_point() requires both x and y, the minimal specification for a scatterplot.

Geoms differ in which aesthetics they accept. For example, geom_point() accepts the aesthetic shape, which defines the shapes of points on the graph, while geom_bar() does not accept shape.

Check the geom help files for required and understood aesthetics. In the Aesthetics section of the geom’s help file, required aesthetics are bolded.

We will tour some commonly used geoms.

Histograms

# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median)) + 
  geom_histogram() 
histograms visualize distribution of variable mapped to x

histograms visualize distribution of variable mapped to x

Histograms are popular choices to depict the distribution of a continuous variable.

geom_histogram() cuts the continuous variable mapped to x into bins, and counts the number of values within each bin.

Density plots

# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median)) + 
  geom_density() 
density plots visualize smoothed distribution of variable mapped to x

density plots visualize smoothed distribution of variable mapped to x

Denisty plots are basically smoothed histograms.

Density plots, unlike histograms, can be plotted separately by group by mapping a grouping variable to color.

# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median, color=factor(month))) + 
  geom_density() 
densities of median price by month

densities of median price by month

Boxplots

# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=factor(year), y=median)) + 
  geom_boxplot() 
boxplots are useful to compare distribution of y variable across levels of x variable

boxplots are useful to compare distribution of y variable across levels of x variable

Boxplots compactly visualize particular statistics of a distributions:

Boxplots are particularly useful for comparing distributions between groups.

geom_boxplot() will create boxplots of the variable mapped to y for each group defined by the values of the x variable.

Bar plots

ggplot(diamonds, aes(x=cut)) + 
  geom_bar() 
geom_bar displays frequencies of levels of <code>x</code> variable

geom_bar displays frequencies of levels of x variable

Bar plots are often used to display frequencies of factor (categorical) variables.

geom_bar() by default produces a bar plot where the height of the bar represents counts of each x-value.

The color that fills the bars is not controlled by color, but instead by fill, which can only be mapped to a factor (categorical) variable. We can visualize a crosstabulation of variables by mapping one of them to fill in geom_bar():

ggplot(diamonds, aes(x=cut, fill=clarity)) + 
  geom_bar() 
frequencies of cut by clarity

frequencies of cut by clarity

Scatter plots

# scatter of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) + 
  geom_point() 
scatter plot of volume vs sales

scatter plot of volume vs sales

Scatter plots depict the covariation between pairs of variables (typically both continuous).

geom_point() depicts covariation between variables mapped to x and y.

Scatter plots are among the most flexible graphs, as variables can be mapped to many aesthetics such as color, shape, size, and alpha.

ggplot(txhousing, aes(x=volume, y=sales, 
                      color=listings, alpha=median, size=inventory)) + 
  geom_point() 
scatter plot of volume vs sales, colored by number of listings, transparent by median price, and sized by inventory

scatter plot of volume vs sales, colored by number of listings, transparent by median price, and sized by inventory

Line graphs

ggplot(txhousing, aes(x=date, y=sales, group=city)) + 
  geom_line() 
line graph of sales over time, separate lines by city

line graph of sales over time, separate lines by city

Line graphs depict covariation between variables mapped to x and y with lines instead of points.

geom_line() will treat all data as belonging to one line unless a variable is mapped to one of the following aesthetics to group the data into separate lines:

Let’s first examine a line graph with no grouping:

ggplot(txhousing, aes(x=date, y=sales)) + 
  geom_line() 
line graph of sales over time, no grouping results in garbled graph

line graph of sales over time, no grouping results in garbled graph

As you can see, unless the data represent a single series, line graphs usually call for some grouping.

Using color or linetype in geom_line() will implicitly group the lines.

ggplot(txhousing, aes(x=date, y=sales, color=city)) + 
  geom_line() 
line graph of sales over time, colored and grouped by city

line graph of sales over time, colored and grouped by city

Exercise 2

We will be using the Sitka data set again for this exercise.

A. Using 2 different geoms, compare the distribution of size between the two levels of treat. Use a different color for each distribution.

B. Use a bar plot to visualize the crosstabulation of Time and treat. Put Time on the x-axis.

C. Create a line graph of size over Time, with separate lines by tree and lines colored by treat.

D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?

Stats, Scales, Coordinate Systems, and Faceting

*Stats*

The stat functions statistically transform data, usually as some form of summary, such as the mean, or standard devation, or a confidence interval.

Each stat function is associated with a default geom, so no geom is required for shapes to be rendered.

stat_summary(), perhaps the most useful of all stat functions, applies a summary function to the variable mapped to y for each value of the x variable. The default summary function is mean_se(), with associated geom geom_pointrange(), which will produce a plot of the mean (dot) and standard error (lines) of the variable mapped to y for each value of the x variable.

# summarize sales (y) for each year (x)
ggplot(txhousing, aes(x=year, y=sales)) + 
  stat_summary() 
mean and standard errors of sales by year

mean and standard errors of sales by year

Create a new plot where x is mapped to Time and y is mapped to size. Then, add a stat_summary() layer.

What makes stat_summary() so powerful is that you can use any function that accepts a vector as the summary function (e.g. mean(), var(), max(), etc.) and the geom can also be changed to adjust the shapes plotted.

Scales

Scales define which aesthetic values are mapped to the data values.

Here is an example of a color scale that defines which colors are mapped to values of treat:

color treat
red ozone
blue control


We can use a scale function to change the colors to “green” and “orange”.

Or, we might have treat mapped to shape, and instead of squares and circles we want to use triangles and stars.

Scale functions have names with structure scale_aesthetic_suffix, where aesthetic is the name of an aesthetic like color or shape or x, and suffix is some descriptive word that defines the functionality of the scale.

Some example scales functions:

See the ggplot2 documentation page section on scales to see a full list of scale functions.

To control the aesthetic values to be used by the scale, specify a vector to the values argument (usually) of the scale function.

Here is a color scale that ggplot2 chooses for us:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point()
default color scale

default color scale

We can use scale_colour_manual() to specify which colors we want to use in values=:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 
using scale_color_manual to respecify colors

using scale_color_manual to respecify colors

Scale functions for the axes

Remember that x and y are aesthetics, and the two axes visualize the scale for these aesthetics.

Thus, we use scale functions to control to the scaling of these axes.

When y is mapped to a continuous variable, we will typically use scale_y_continuous() to control its scaling (use scale_y_discrete() if y is mapped to factor). Similar functions exist for the x aesthetic.

A description of some of the important arguments to scale_y_continuous():

Our current graph of volume vs sales has y-axis tick marks at 0, 5000, 10000, and 15000

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 
default x-tick marks

default x-tick marks

Let’s put tick marks at all grid lines along the y-axis using the breaks argument of scale_y_continuous:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))
changing y-axis tick marks

changing y-axis tick marks

Now let’s relabel the tick marks to reflect units of thousands (of dollars) using labels:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels=c(0,2.5,5,7.5,10,12.5,15,17.5))
relabeling y-axis tick marks

relabeling y-axis tick marks

And finally, we’ll retitle the y-axis using the name argument to reflect the units:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels=c(0,2.5,5,7.5,10,12.5,15,17.5),
                     name="price(thousands of dollars)")
new y-axis title

new y-axis title

Modifying axis limits and titles

Although we can use scale functions like scale_x_continuous() to control the limits and titles of the x-axis, we can also use the following shortcut functions:

To set axis limits, supply a vector of 2 numbers (inside c(), for example) to one of the limits functions:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  xlim(c(1,3)) # cut ranges from 0 to 5 in the data
restricting axis limits will zoom in

restricting axis limits will zoom in

We can use labs() to specify an overall titles for the overall graph, the axes, and legends (guides).

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  labs(x="CARAT", y="PRICE", color="CUT", title="CARAT vs PRICE by CUT")
respecifying all titles with labs

respecifying all titles with labs

Guides visualize scales

Guides (axes and legends) visualize a scale, displaying data values and their matching aesthetic values.

Most guides are displayed by default. The guides() function sets and removes guides for each scale.

Here we use guides() to remove the color scale legend:

# notice no legend on the right anymore
ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  guides(color="none")
color legend removed

color legend removed

Coordinate systems

Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, ggplot2 provides multiple coordinate systems, including polar, flipped Cartesian and map projections.

Faceting (paneling)

Split plots into small multiples (panels) with the faceting functions, facet_wrap() and facet_grid(). The resulting graph shows how each plot varies along the faceting variable(s).

facet_wrap() wraps a ribbon of plots into a multirow panel of plots. Inside facet_wrap(), specify ~, then a list of splitting variables, separated by +. The number of rows and columns can be specified with arguments nrow and ncol.

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point() + 
  facet_wrap(~cut) # create a ribbon of plots using cut
carat vs price, paneled by cut with facet_wrap()

carat vs price, paneled by cut with facet_wrap()

facet_grid() allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the row-splitting variable before ~, and the column-splitting variable after. The character . specifies no faceting along that dimension.

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point() + 
  facet_grid(clarity~cut) # split using clarity along rows along columns using cut 
carat vs price, paneled by clarity and cut with facet_grid()

carat vs price, paneled by clarity and cut with facet_grid()

Exercise 3

We will again use the Sitka data set.

A. Recreate the line plot of Time vs size, with the color of the lines mapped to treat. Use scale_color_manual() to change the colors to “orange” and “purple”.

B. Use scale_x_continuous() to convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.

C. Split the scatter plot into a panel of scatter plots by tree. (Note: Make the graph area large; graph may take a few seconds to appear)

Themes

Themes

Themes control elements of the graph not related to the data. For example:

To modify these, we use the theme() function, which has a large number of arguments called theme elements, which control various non-data elements of the graph.

Some example theme() arguments and what aspect of the graph they control:

A full description of theme elements can be found on the ggplot2 documentation page.

Specifying theme() arguments

Most non-data element of the graph can be categorized as either a line (e.g. axes, tick marks), a rectangle (e.g. the background), or text (e.g. axes titles, tick labels). Each of these categories has an associated element_ function to specify the parameters controlling its apperance:


Inside theme() we control the properties of a theme element using the proper element_ function.

For example, the x- and y-axes are lines and are both controlled by theme() argument axis.line, so their visual properties, such as color and size (thickness), are specified as arguments to element_line():

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2)) # size in mm
using theme argument axis.line to modify x-axis and y-axis lines

using theme argument axis.line to modify x-axis and y-axis lines

However, the background of the graph, controlled by theme() argument panel.background is a rectangle, so parameters like fill color and border color can be specified element_rect().

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray")) # color is the border color
using theme element axis.line.x to modify x-axis line

using theme element axis.line.x to modify x-axis line

With element_text() we can control properties such as the font family or face ("bold", "italic", "bold.italic") of text elements like title, which controls the titles of both axes.

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold")) 
using theme argument title to adjust fonts of all titles

using theme argument title to adjust fonts of all titles

Note: "sans", "serif", and "mono" are the only font family choices available for ggplot2 without downloading additional R packages. See this RPubs webpage for more information.

Finally, some theme() arguments do not use element_ functions to control their properties, like legend.position, which simply accepts values "none", "left", "right", "bottom", and "top".

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold"),
        legend.position="bottom") 
using theme argument legend.position to position legend

using theme argument legend.position to position legend

We could then use legend.text=element.text() in theme() to rotate the legend labels (not shown).

Remember to use the ggplot2 theme documentation page when using theme().

Changing the overall look with complete themes

The ggplot2 package provides a few complete themes which make several changes to the overall background look of the graphic (see here for a full description).

Some examples:

The themes usually adjust the color of the background and most of the lines that make up the non-data portion of the graph.

theme_classic() mimics the look of base R graphics:

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme_classic()
theme_classic()

theme_classic()

theme_dark() makes a dramatic change to the look:

ggplot(txhousing, aes(x=volume, y=sales, color=listings)) + 
  geom_point() +
  theme_dark()
theme_dark()

theme_dark()

Saving plots to files

ggsave() makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R object.

ggsave attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf.

Other important arguments to ggsave():

#save last displayed plot as pdf
ggsave("plot.pdf")

#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) + 
  geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)

Exercise 4 dataset

For this exercise, we will use everything we have learned up to this point to create a graph for a new dataset, the Rabbit data set, a data set stored on the UCLA IDRE website, which we load with the following code:

Rabbit <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2024/05/Rabbit.csv")

The Rabbit data set describes an experiment where:


The data set contains 30 rows (5 rabbits measured 6 times) of the following 4 variables:

Exercise 4 task

Goal: create a meal-response curve for each rabbit under each treatment (Sedentary/Exercise), resulting in 10 curves (2 each for 5 rabbits)

Constraints: no color, but publication quality (imagine submitting to a journal that only accepts non-color figures)

We will build this graph in steps.

A. First, try creating a line graph with Meal on the x-axis and BPchange on the y-axis, with separate linetypes by Rabbit. Also, specify group=Rabbit.

Why does this graph look wrong?

B. Draw separate lines by Treatment. How can we accomplish this without color?

Some of the line patterns still look a little too similar to distinguish between rabbits.

C. Add a scatter plot where the shape of the points is mapped to Rabbit.

Next we will change the shapes used. See ?pch for a list of codes for shapes.

D. Use scale_shape_manual() to change the shapes of the points. Use the shapes corresponding to the codes (0, 3, 8, 16, 17).

Ok, the graph has all the data we want on it. Now, we’ll prepare it for publication.

E. Change the x-axis title to “Type of meal” and the y-axis title to “Change in blood pressure”.

Finally, we will change some of the theme() elements.

F. First, change the background from gray to white (or remove it) using panel.background in theme().

G. Next, change the color of the grid lines to “gray90”, a light gray using panel.grid.

H. Use title to change the titles (axes and legend) to bold face.

I. Use strip.text to change the facet titles to bold face.

J. Save your last plot as mygraph.png.

Factors vs numeric variables in ggplot2

New dataset birthwt {MASS}

The birthwt data set contains data regarding risk factors associated with low infant birth weight.

The data consist of 189 observations of 10 variables, all numeric:

Let’s take a look at the structure of the birthwt data set first, to get an idea of how the variables are measured.

Run data(birthwt). Click its names in the Environment pane of RStudio.

Aesthetics, numeric and factor (categorical) variables

For plotting, variables are either numeric variables, where the number value is a meaningful representation of a quantity, or factor variables, R’s representation of categorical variables.

In R, we can encode variables as “factors” with the factor() function.

Some aesthetics can be mapped to either numeric or categorical variables, and will be scaled differently.

These include:

Other aesthetics can only be mapped to categorical variables:

And finally some aethetics should only be mapped to numeric variables (a warning is issued if mapped to a categorical variable):

Aesthetic scales are formed differently for numeric and factor variables

Let’s examine how aesthetics behave when mapped to different type of variables using the birthwt dataset, in which all of the variables are numeric initially.

When color is mapped to a numeric variable, a color gradient scale is used:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point()
color gradient is appropriate for a continuous variable

color gradient is appropriate for a continuous variable

Note: even though we just used race as a numeric variable to demonstrate how ggplot handles it, we do not recommmend treating categorical variables as numeric.

When color is instead mapped to a factor variable, a color scale of evenly spaced hues is used. We can convert a numeric variable to a factor inside of aes() with factor():

ggplot(birthwt, aes(x=age, y=bwt, color=factor(race))) +
  geom_point()
evenly spaced hues emphasize contrasts between groups of a factor

evenly spaced hues emphasize contrasts between groups of a factor

An error results if we try to map shape to a numeric version of race, because shape only accepts factor variables.

Shape accepts the factor representation of race:

ggplot(birthwt, aes(x=age, y=bwt, shape=factor(race))) +
  geom_point()
evenly spaced hues emphasize contrasts between groups of a factor

evenly spaced hues emphasize contrasts between groups of a factor

Finally, some aesthetics like alpha and size should really only be used with truly numeric variables, and a warning will be issued if the variable is a factor:

ggplot(birthwt, aes(x=age, y=bwt, size=factor(race))) +
  geom_point()
## Warning: Using size for a discrete variable is not advised.
size and alpha should only be mapped to numeric variables

size and alpha should only be mapped to numeric variables

Convert categorical variables to factors before graphing

We recommend converting all categorical variable to factors prior to graphing for several reasons:

For example, we can convert the 0/1 smoke the 1/2/3 race variable to factors smokef and racef, respectively and label the values for each:

birthwt$smokef <- factor(birthwt$smoke, levels=0:1, labels=c("did not smoke", "smoked"))
birthwt$racef <- factor(birthwt$race, levels=1:3, labels=c("white", "black", "other"))

Now the labels will appear in the graph legend:

ggplot(birthwt, aes(y=bwt, fill=smokef)) +
  geom_boxplot()
birth weight distribution by mother's smoking status

birth weight distribution by mother’s smoking status

Exercise 5

A. For the birthwt data, convert ht to a factor and label the values 0 and 1 “non-hyper” and “hyper”, respectively.

B. Create boxplots of bwt (birth weight), colored by ht, with separate panels by smoke.

Overlapping graphics

Overlapping data points in scatter plots

When 2 data points have the same values plotted on the graphs, they will generally occupy the same position, causing one to obscure the other.

Here is an example where we map racef, a factor variable, to x and map age (in years) to y in a scatter plot:

ggplot(birthwt, aes(x=racef, y=age)) +
  geom_point()
too many discrete values leads to overlapping points

too many discrete values leads to overlapping points

There are 189 data points in the data set, but far fewer than 189 points visible in the graph, because many are completely overlapping.

To address this problem, we have a choice of “position adjustments” which can be specified to the position argument in a geom function.

For geom_point(), we usually use either:

By adding position="jitter" to the previous scatter plot, we can better see how many points there are at each age:

ggplot(birthwt, aes(x=racef, y=age)) +
  geom_point(position="jitter")
jittering adds random variation to the position of the points

jittering adds random variation to the position of the points

Overlapping bars in bar graphs

Remember that geom_bar() will plot the frequencies of the variable mapped to x as bars. If we map a second variable to fill, the bars will be colored by the second variable.

We can use the position argument in geom_bar() to control the placement of the bars with the same x value.

The following adjustments are generally used for geom_bar():

Each position adjustment emphasizes different quantities.

By default, geom_bar uses position="stack", a compromise where we can see both the counts and proportions well:

ggplot(birthwt, aes(x=low, fill=racef)) +
  geom_bar()
geom_bar() will stack bars with the same x-position

geom_bar() will stack bars with the same x-position

If we instead want to emphasize counts, we use position="dodge", which places the bars side-by-side:

ggplot(birthwt, aes(x=low, fill=racef)) +
  geom_bar(position="dodge")
dodging emphasizes counts

dodging emphasizes counts

Proportions are emphasized with position="fill", where the bars are stacked and their heights are standardized:

ggplot(birthwt, aes(x=low, fill=racef)) +
  geom_bar(position="fill")
filling emphasizes proportions

filling emphasizes proportions

*Error bars and confidence bands*

Error bars and confidence bands are both used to express ranges of statistics. To draw these, we’ll use geom_errorbar() and geom_ribbon(), repsectively.

To use both geoms, the following aesthetics are required:

For example, the following code estimates the mean birth weight and 95% confidence interval for the mean for the three races in data set birthwt. The means and confidence limits are stored in a new data.frame called bwt_bt_race.

bwt_by_racef <- do.call(rbind, 
                        tapply(birthwt$bwt, birthwt$racef, mean_cl_normal))
bwt_by_racef$racef <- row.names(bwt_by_racef)
names(bwt_by_racef) <- c("mean", "lower", "upper", "racef")
bwt_by_racef
##           mean    lower    upper racef
## white 3102.719 2955.235 3250.202 white
## black 2719.692 2461.722 2977.662 black
## other 2805.284 2629.127 2981.441 other

Now we can plot the means by race with geom_point() and the confidence limits with geom_errorbar():

ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
  geom_point() +
  geom_errorbar(aes(ymin=lower, ymax=upper))
mean birthweight by race

mean birthweight by race

Use width= to adjust the width of the error bars:

ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
  geom_point() +
  geom_errorbar(aes(ymin=lower, ymax=upper), width=.1)
mean birthweight by race, narrower error bars

mean birthweight by race, narrower error bars

Confidence bands work similarly. We’ll need values for the maximum and minium again for geom_ribbon().

This time, we’ll create a plot of predicted values with confidence bands from a regression of birthweight on age. First, we’ll run the model and add the predicted values and the confidence limits to the original data set for plotting:

# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
##    low age lwt race smoke ptl        ht ui ftv  bwt        smokef racef
## 85   0  19 182    2     0   0 non-hyper  1   0 2523 did not smoke black
## 86   0  33 155    3     0   0 non-hyper  0   3 2551 did not smoke other
## 87   0  20 105    1     1   0 non-hyper  0   1 2557        smoked white
## 88   0  21 108    1     1   0 non-hyper  1   2 2594        smoked white
## 89   0  18 107    1     1   0 non-hyper  1   0 2600        smoked white
## 91   0  21 124    3     0   0 non-hyper  0   0 2622 did not smoke other
##         fit      lwr      upr
## 85 2891.909 2757.969 3025.849
## 86 3065.925 2846.442 3285.408
## 87 2904.339 2781.794 3026.883
## 88 2916.768 2803.295 3030.242
## 89 2879.479 2732.358 3026.600
## 91 2916.768 2803.295 3030.242

Now we’ll use geom_line() to show the best fit line, and geom_ribbon() to show the confidence bands:

ggplot(birthwt, aes(x=age, y=fit)) + 
  geom_line() +
  geom_ribbon(aes(ymin=lwr, ymax=upr))
best fit line with confidence bands

best fit line with confidence bands

Yikes! That confidence band is too dark. Use alpha to lighten the bands by making them more transparent. Remember, because we are setting the entire band to be a constant transparency, we will specify alpha outside of aes().

ggplot(birthwt, aes(x=age, y=fit)) + 
  geom_line() +
  geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=.5)
best fit line with confidence bands

best fit line with confidence bands

Annotating a graph

At times we need to add notes or annotations directly to the graph that are not represented by any variables in the graph data set. Examples:

To add non-data related elements, use the annotate() function.

To use annotate(), the first argument is the name of a geom (for example "text" or "rect"). Subsequent arguments are positioning aesthetics such as x= and y= and any additional aesthetics needed for that particular geom.

Let’s imagine that we want to annotate the data point in the far upper right corner of this graph we have seen before:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point()
we want to annotate point in upper right

we want to annotate point in upper right

Suppose we want to label the outlier as a possible data error. To add annotation text, we will use geom_text() in annotate(). We will need to specify x= and y= positions for the text, and the contents of the text in label=.

We see that the outlier lies at x=45, y=5000. To place the text a little to the left of the point, we will use x=42 and y=5000. Proper positioning will take some experimentation. We specify the text to be displayed with label="Data error?".

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() + 
  annotate("text", x=42, y=5000, label="Data error?")  # notice first argument is "text", not geom_text
Annotating outlier as possible data entry error

Annotating outlier as possible data entry error

As another example, let’s highlight a portion of the graph that features birthweights within 1 standard deviation of the mean weight. We will create a rectangle using geom_rect() that spans the x-axis for its full width from xmin=13 and xmax=46, and the y-axis from ymin=2215 to ymax=3673 (mean-sd, mean+sd). We will set alpha=.2 to make the box transparent.

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() + 
  annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2)  # notice first argument is "text", not geom_text
Birthweights within one standard deviation of mean

Birthweights within one standard deviation of mean

*Working with colors*

Specifying colors in R

We can specify a specific color in R in several ways:

We have already used string names like “white” and “green” to specify colors.

You can issue colors() in R to see a full list of available color names. See here for a chart of these colors. Here we show the first 30 names out of 657:

head(colors(), n=30)
##  [1] "white"          "aliceblue"      "antiquewhite"   "antiquewhite1" 
##  [5] "antiquewhite2"  "antiquewhite3"  "antiquewhite4"  "aquamarine"    
##  [9] "aquamarine1"    "aquamarine2"    "aquamarine3"    "aquamarine4"   
## [13] "azure"          "azure1"         "azure2"         "azure3"        
## [17] "azure4"         "beige"          "bisque"         "bisque1"       
## [21] "bisque2"        "bisque3"        "bisque4"        "black"         
## [25] "blanchedalmond" "blue"           "blue1"          "blue2"         
## [29] "blue3"          "blue4"

We can also use hex color codes. These hex codes usually consist of # followed by 6 numbers/letters (each a hexadecimal digit ranging from 0 to F), where the first two digits represent redness, the second two greenness, and the last two blueness.

For example, the hex code #009900 would represent a green shade, while hex code #FF00EE would represent a purple shade. Tools like this can help you identify the hex code for a particular color.

ggplot(birthwt, aes(x=age, y=bwt)) +
  geom_point(color="#E36D11")
Using a hex code to specify a shade of orange

Using a hex code to specify a shade of orange

Finally, we can use RGB (red, green, blue) values to specify a color. Specify three numbers between 0 and 1 to rgb() function, and it will return the hex code for that color. Let’s try a purple:

# rgb() returns a hex code
rgb(.75, 0, 1)
## [1] "#BF00FF"

ggplot(birthwt, aes(x=age, y=bwt)) +
  geom_point(color=rgb(.75, 0, 1))
Using rgb() to specify a shade of purple

Using rgb() to specify a shade of purple

We have already used scale_color_manual() to alter color scales, but note that you can use hex codes and rgb() to specify the colors:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() +
  scale_color_manual(values=c(rgb(.5,.5,.2), "#FF1035", "blue"))
manually changing colors with hex codes and rgb()

manually changing colors with hex codes and rgb()

Color scales by variable type

Part of the challenge of making effective and attractive color graphs is choosing a color palette that serves both purposes of representing variation and catching the eye.

When you map a variable to color or fill, the ggplot2 package will use the variable’s type (i.e. numeric, factor, ordinal) to choose a color scale.

If you use map a numeric variable to color you will usually get a color gradient based on a single hue:

ggplot(birthwt, aes(x=lwt, y=bwt, color=as.numeric(racef))) + 
  geom_point() 
color gradient scale for numeric variables

color gradient scale for numeric variables

A color gradient is a natural analog to a numeric variable. In the above graph, as the color becomes “bluer”, the race value becomes higher (assuming the value has numeric meaning).

On the other hand, if we map a factor variable to color, we get a set of distinct hues evenly spaced around the color wheel:

ggplot(birthwt, aes(x=lwt, y=bwt, color=racef)) + 
  geom_point() 
evenly spaced distinct hues for factor variables

evenly spaced distinct hues for factor variables

The categories of a factor variable are considered unordered, so using completely different hues to represent them makes sense.

Those were just the default color scales that ggplot2 chooses for you, by guessing the appropriate scale from the variable’s type. There are many ways to form color scales using ggplot2 so you have lots of options when choosing a palette.

Color scale functions

Here are some color scale functions used to form color scales (there is an analogous scale function for the fill aesthetic for each of the below):

With scale_color_gradient, we can define the colors that define the ends of the gradient with arguments low and high. The default gradient runs from a blueish-black at the low end to a light-blue at the high end. We can redefine the scale to go from a very light green (honeydew) to a dark green:

ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
  geom_point() +
  scale_color_gradient(low="honeydew", high="darkgreen")
Defining our own color gradient

Defining our own color gradient

Because we are only changing a single hue with color gradients, perhaps it is easier to use rgb(), where we can specify the intensity of each hue:

ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
  geom_point() +
  scale_color_gradient(low=rgb(.1, .2, .1), high=rgb(.1, 1, .1))
Defining our own color gradient with rgb()

Defining our own color gradient with rgb()

With scale_color_hue(), we define a color scale by specifying a range of colors to use, and then evenly spaced hues will be selected from this range. Here are the relevant arguments to scale_color_hue():

Varying any of the 3 above will alter the color palette.

First, we change the range of colors with h to be much smaller:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() +
  scale_color_hue(h=c(0,90))
restricting range of colors with scale_color_hue()

restricting range of colors with scale_color_hue()

We can also use the original range, but change the starting hue with h.start to get a completely different set:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() +
  scale_color_hue(h.start=20)
changing starting hue with scale_color_hue()

changing starting hue with scale_color_hue()

ColorBrewer

ColorBrewer is a webpage resource designed by Cynthia Brewer that lists many color schemes designed for different purposes:

The ColorBrewer palettes are not only designed to be highly functional, they are also very attractive, with colors that complement each other well.

The ColorBrewer palettes have been integrated into R, and are available in ggplot through scale_color_brewer() and scale_fill_brewer().

Arguments to scale_color_brewer() and scale_fill_brewer():

We’ll use a sequential palette first, although it should not be used with race since race does not progress from low to high values:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() +
  scale_color_brewer(type="seq", palette="RdPu")
Sequential palette not a great choice for race

Sequential palette not a great choice for race

Instead, we should use a qualitative palette with race:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() +
  scale_color_brewer(type="qual", palette=8) # requests the 8th qualitative palette
Qualitative palette better for race

Qualitative palette better for race

The ggplot2 book

For more in-depth information, read ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, creator of the ggplot2 package:

This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.

The ggplot2 extensions

Additional packages that enhance the functionality and features of the ggplot2

Final exercises

New data set hsb

For the final set of exercises, we will be using a data set stored on the UCLA IDRE website, which we load with the following code:

hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")

This data set contains demographic and academic data for 200 high school students. We will be using the following variables:

Use the following code to load the hsb data set:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")

Exercise 6

Create a scatter plot of math (x) vs read (y), with different shapes by prog. Color all of the points red.

Find the outlier at math=35, read=63, Add annotation text next to this outlier that says “error?”

Exercise 7

Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables: female, prog, schtyp, ses.

From such a graph we can know, for example, how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.

Exercise 8

Try to recreate this graph:

Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.

END THANK YOU!

Solutions to exercises

Exercise 1

#A
ggplot(Sitka, aes(x=Time, y=size)) +
  geom_point() 


#B
ggplot(Sitka, aes(x=Time, y=size, color=treat)) +
  geom_point() 


#C
ggplot(Sitka, aes(x=Time, y=size, color=treat)) +
  geom_point() +
  geom_smooth()


#D
ggplot(Sitka, aes(x=Time, y=size, color=treat)) +
  geom_point() +
  geom_smooth(color="green")

Exercise 2

#A
ggplot(Sitka, aes(x=size, color=treat)) +
  geom_density()


ggplot(Sitka, aes(y=size, fill=treat)) +
  geom_boxplot()


#B
ggplot(Sitka, aes(x=Time, fill=treat)) + 
  geom_bar()


#C
ggplot(Sitka, aes(x=Time, y=size, group=tree, color=treat)) +
  geom_line()


#D
ggplot(Sitka, aes(x=Time, y=size, group=tree, linetype=treat)) +
  geom_line()

Exercise 3

#A
ggplot(Sitka, aes(x=Time, y=size, group=tree, color=treat)) +
  geom_line() +
  scale_color_manual(values=c("orange", "purple"))


#B
ggplot(Sitka, aes(x=Time, y=size, group=tree, color=treat)) +
  geom_line() +
  scale_color_manual(values=c("orange", "purple")) +
  scale_x_continuous(breaks=c(150,180,210,240), labels=c("5", "6", "7", "8")) +
  labs(x="time(months)")


#C
ggplot(Sitka, aes(x=Time, y=size, group=tree, color=treat)) +
  geom_line() +
  scale_color_manual(values=c("orange", "purple")) +
  scale_x_continuous(breaks=c(150,180,210,240), labels=c("5", "6", "7", "8")) +
  labs(x="time(months)") +
  facet_wrap(~tree)

Exercise 4

#A
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, linetype=Rabbit)) +
  geom_line()


#B
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, linetype=Rabbit)) +
  geom_line() +
  facet_wrap(~Treatment)


#C
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, linetype=Rabbit, shape=Rabbit)) +
  geom_line() +
  facet_wrap(~Treatment) +
  geom_point()


#D
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
  geom_line() +
  facet_wrap(~Treatment) + 
  geom_point() + 
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) 


#E
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
  geom_point() + 
  geom_line() +
  facet_wrap(~Treatment) + 
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
  labs(x="Dose(mcg)", y="Change in blood pressure")


#F,G,H,I
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
  geom_point() +
  geom_line() +
  facet_wrap(~Treatment) +
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
  labs(x="Type of meal", y="Change in blood pressure") +
  theme(panel.background = element_rect(fill="white"),
        panel.grid=element_line(color="gray90"),
        title=element_text(face="bold"),
        strip.text=element_text(face="bold"))



#J
ggsave("mygraph.png")

Exercise 5

#A
birthwt$ht <- factor(birthwt$ht, levels=0:1, labels=c("non-hyper", "hyper"))

#B
ggplot(birthwt, aes(y=bwt, fill=ht)) +
  geom_boxplot() +
  facet_wrap(~smoke)

Exercise 6

hsb <- read.csv("https://stats.oarc.ucla.edu/stat/data/hsbdemo.csv")

ggplot(hsb, aes(x=math, y=read, shape=prog)) +
  geom_point(color="red") +
  annotate("text", x=35, y=64, label="error?") 

Exercise 7

# don't forget to use position="dodge" for counts
ggplot(hsb, aes(x=female, fill=prog)) +
  geom_bar(position="dodge", width=.5) +
  facet_grid(schtyp ~ ses)

Exercise 8

ggplot(hsb, aes(x=read, y=write, color=math)) +
  geom_point() +
  geom_smooth(color="red") +
  labs(x="Reading Score", y="Writing Score", color="Math Score") +
  theme(title=element_text(family="mono", color="red"),
        panel.background=element_blank())