Sitka
datasetggplot()
function and aestheticstheme()
argumentsRabbit
data set {MASS
}ggplot2
birthwt
{MASS
}hsb
This seminar introduces how to use the R ggplot2
package, particularly for producing statistical graphics for data analysis.
Text in this font
signifies R
code or variables in a data set
Text that appears like this represents an instruction to practice
ggplot2
coding
Next we load the packages into the current R
session with library()
. In addition to ggplot2
, we load package MASS
(installed with R
) for data sets.
#load libraries into R session
library(ggplot2)
library(MASS)
Please use
library()
to load packagesggplot2
andMASS
.
ggplot2
packagehttps://ggplot2.tidyverse.org/reference/
The official reference webpage for ggplot2
has help files for its many functions an operators. Many examples are provided in each help file.
A grammar of a language defines the rules of structuring words and phrases into meaningful expressions.
A grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.
Leland Wilkinson (2005) designed the grammar upon which ggplot2
is based.
Sitka
datasetTo practice using the grammar of graphics, we will use the Sitka
dataset (from the MASS
package).
Note: Data sets that are loaded into R
with a package are immediately available for use. To see the object appear in RStudio’s Environment
pane (so you can click to view it), run data()
on the data set, and then another function like str()
on the data set.
Use
data()
and thenstr()
onSitka
to make it appear in the Environment pane.
The Sitka
dataset describes the growth of trees over time, some of which were grown in ozone-enriched chambers. The data frame contains 395 rows of the following 4 columns:
Here are the first few rows of Sitka
:
size | Time | tree | treat |
---|---|---|---|
4.51 | 152 | 1 | ozone |
4.98 | 174 | 1 | ozone |
5.41 | 201 | 1 | ozone |
5.90 | 227 | 1 | ozone |
6.15 | 258 | 1 | ozone |
4.24 | 152 | 2 | ozone |
ggplot()
function and aestheticsAll graphics begin with specifying the ggplot()
function (Note: not ggplot2
, the name of the package)
In the ggplot()
function we specify the data set that holds the variables we will be mapping to aesthetics, the visual properties of the graph. The data set must be a data.frame
object.
Example syntax for ggplot()
specification (italicized
words are to be filled in by you):
ggplot(data, aes(x=xvar, y=yvar))
data
: name of the data.frame
that holds the variables to be plottedx
and y
: aesthetics that position objects on the graphxvar
and yvar
: names of variables in data
mapped to x
and y
Notice that the aesthetics are specified inside aes()
, which is itself nested inside of ggplot()
.
The aesthetics specified inside of ggplot()
are inherited by subsequent layers:
# scatter plot of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point()
geom_point() inherits x and y aesthetics
Initiate a graph of
Time
vssize
by mappingTime
tox
andsize
toy
from the data setSitka
.
Without any additional layers, no data will be plotted.
Specifying just x
and y
aesethetics alone will produce a plot with just the 2 axes.
ggplot(data = txhousing, aes(x=volume, y=sales))
without a geom or stat, just axes
We add layers with the character +
to the graph to add graphical components.
Layers consist of geoms, stats, scales, and themes, which we will discuss in detail.
Remember that each subsequent layer inherits its aesthetics from ggplot()
. However, specifying new aesthetics in a layer will override the aesthetics speficied in ggplot()
.
# scatter plot of volume vs sales
# with rug plot colored by median sale price
ggplot(txhousing, aes(x=volume, y=sales)) + # x=volume and y=sales inherited by all layers
geom_point() +
geom_rug(aes(color=median)) # color will only apply to the rug plot because not specified in ggplot()
both geoms inherit aesthetics from gglot, but geom_rug() also adds color aesthetic
Add a
geom_point()
layer to theSitka
graph we just initiated.
Add an additional
geom_smooth()
layer to the graph.
Both geom layers inherit x
and y
aesthetics from ggplot()
.
Specify
aes(color=treat)
inside ofgeom_point()
.
Notice that the coloring only applies to geom_point()
.
Aesthetics are the visual properties of objects on the graph.
Which aesthetics are required and which are allowed vary by geom.
Commonly used aesthetics:
x
: positioning along x-axisy
: positioning along y-axiscolor
: color of objects; for 2-d objects, the color of the object’s outline (compare to fill below)fill
: fill color of objectslinetype
: how lines should be drawn (solid, dashed, dotted, etc.)shape
: shape of markers in scatter plotssize
: how large objects appearalpha
: transparency of objects (value between 0, transparent, and 1, opaque – inverse of how many stacked objects it will take to be opaque)Change the aesthetic
color
mapped totreat
in our previous graph toshape
.
Map aesthetics to variables inside the aes()
function. By mapping, we mean the aesthetic will vary as the variable varies. For example, mapping x=time
causes the position of the plotted data to vary with values of variable “time”. Similary, mapping color=group
causes the color of objects to vary with values of variable “group”.
# mapping color to median inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color=median))
color of points varies with median price
Set aesthetics to a constant outside the aes()
function.
Compare the following graphs:
# setting color to green outside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(color="green")
color of points set to constant green
Create a new graph for data set
Sitka
, a scatter plot ofTime
(x-axis) vssize
(y-axis), where all the points are colored “green”.
Setting an aesthetic to a constant within aes()
can lead to unexpected results, as the aesthetic is then set to a default value rather than the specified value.
# color="green" inside of aes()
# geom_point() cannot find a variable called "green" and
# uses a default color instead
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color="green"))
aesthetic set to constant within aes() leads to unexpected results
geoms: bar, boxplot, density, histogram, line, point
Geom functions differ in the geometric shapes produced for the plot.
Some example geoms:
geom_bar()
: bars with bases on the x-axisgeom_boxplot()
: boxes-and-whiskersgeom_errorbar()
: T-shaped error barsgeom_density()
: density plotsgeom_histogram()
: histogramgeom_line()
: linesgeom_point()
: points (scatterplot)geom_ribbon()
: bands spanning y-values across a range of x-valuesgeom_smooth()
: smoothed conditional means (e.g. loess smooth)geom_text()
: textEach geom is defined by aesthetics required for it to be rendered. For example, geom_point()
requires both x
and y
, the minimal specification for a scatterplot.
Geoms differ in which aesthetics they accept as arguments. For example, geom_point()
accepts the aesthetic shape
, which defines the shapes of points on the graph, while geom_bar()
does not accept shape
.
Check the geom function help files for required and understood aesthetics. In the Aesthetics section of the geom’s help file, required aesthetics are bolded.
We will tour some commonly used geoms.
ggplot(txhousing, aes(x=median)) +
geom_histogram()
histograms visualize distribution of variable mapped to x
Histograms are popular choices to depict the distribution of a continuous variable.
geom_histogram()
cuts the continuous variable mapped to x
into bins, and count the number of values within each bin.
Create a histogram of
size
from data setSitka
.
ggplot2
issues a message urging you to pick a number of bins for the histogram (it defaults to 30), using the bins
argument.
Specify
bins=20
inside ofgeom_histogram()
. Note:bins
is not an aesthetic, so should not be specified withinaes()
.
ggplot(txhousing, aes(x=median)) +
geom_density()
density plots visualize smoothed distribution of variable mapped to x
Denisty plots are basically smoothed histograms.
Density plots, unlike histograms, can be plotted separately by group by mapping a grouping variable to color
.
ggplot(txhousing, aes(x=median, color=factor(month))) +
geom_density()
densities of median price by month
ggplot(txhousing, aes(x=factor(year), y=median)) +
geom_boxplot()
boxplots are useful to compare distribution of y variable across levels of x variable
Boxplots compactly visualize particular statistics of a distributions:
Boxplots are perhaps are particularly useful for comparing whole distributions of a continuous variable between groups.
geom_boxplot()
will create boxplots of the variable mapped to y
for each group defined by the values of the x
variable.
Create a new graph where we compare distributions of
size
across levels oftreat
from datasetSitka
.
ggplot(diamonds, aes(x=cut)) +
geom_bar()
geom_bar displays frequencies of levels of x
variable
Bar plots are often used to display frequencies of factor (categorical) variables.
geom_bar()
by default produces a bar plot where the height of the bar represents counts of each x-value.
Start a new graph where the frequencies of
treat
from data setSitka
are displayed as a bar graph. Remember to mapx
totreat
.
The color that fills the bars is not controlled by aesthetic color
, but instead by fill
, which can only be mapped to a factor (categorical) variable. We can visualize a crosstabulation of variables by mapping one of them to fill
in geom_bar()
:
ggplot(diamonds, aes(x=cut, fill=clarity)) +
geom_bar()
frequencies of cut by clarity
Add the aesthetic mapping
fill=factor(Time)
toaes()
inside ofggplot()
of the previous graph.
# scatter of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point()
scatter plot of volume vs sales
Scatter plots depict the covariation between pairs of variables (typically both continuous).
geom_point()
depicts covariation between variables mapped to x
and y
.
Scatter plots are among the most flexible graphs, as variables can be mapped to many aesthetics such as color
, shape
, size
, and alpha
.
ggplot(txhousing, aes(x=volume, y=sales,
color=median, alpha=listings, size=inventory)) +
geom_point()
scatter plot of volume vs sales, colored by median price, transparent by number of listings, and sized by inventory
ggplot(txhousing, aes(x=date, y=sales, group=city)) +
geom_line()
line graph of sales over time, separate lines by city
Line graphs depict covariation between variables mapped to x
and y
with lines instead of points.
geom_line()
will treat all data as belonging to one line unless a variable is mapped to one of the following aesthetics to group the data into separate lines:
group
: lines will look the samecolor
: line colors will vary with mapped variablelinetype
: line patterns will vary with mapped variableLet’s first examine a line graph with no grouping:
ggplot(txhousing, aes(x=date, y=sales)) +
geom_line()
line graph of sales over time, no grouping results in garbled graph
As you can see, unless the data represent a single series, line graphs usually call for some grouping.
Using color
or linetype
in geom_line()
will implicitly group the lines.
ggplot(txhousing, aes(x=date, y=sales, color=city)) +
geom_line()
line graph of sales over time, colored and grouped by city
Let’s try graphing separate lines (growth curves) for each tree.
Create a new line graph for data set
Sitka
withTime
on the x-axis andsize
on they
axis, but also mapgroup
totree
.
We can specify color
and linetype
in addition to group
. The lines will still be separately drawn by group
, but can be colored or patterned by additional variables.
In our data, We might want to compare trajectories of growth between treatments.
Now add a specification mapping
color
totreat
(in addition togroup=tree
).
Finally, map
treat
tolinetype
instead ofcolor
.
The stat functions statistically transform data, usually as some form of summary, such as the mean, or standard devation, or a confidence interval.
Each stat function is associated with a default geom, so no geom is required for shapes to be rendered.
stat_summary()
, perhaps the most useful of all stat functions, applies a summary function to the variable mapped to y
for each value of the x
variable. The default summary function is mean_se()
, with associated geom geom_pointrange()
, which will produce a plot of the mean (dot) and standard error (lines) of the variable mapped to y
for each value of the x
variable.
# summarize sales (y) for each year (x)
ggplot(txhousing, aes(x=year, y=sales)) +
stat_summary()
mean and standard errors of sales by year
Create a new plot where
x
is mapped toTime
andy
is mapped tosize
. Then, add astat_summary()
layer.
What makes stat_summary()
so powerful is that you can use any function that accepts a vector as the summary function (e.g. mean()
, var()
, max()
, etc.) and the geom can also be changed to adjust the shapes plotted.
Scales define which aesthetic values are mapped to the data values.
Here is an example of a color scale that defines which colors are mapped to values of treat
:
color | treat |
---|---|
red | ozone |
blue | control |
Imagine that we might want to change the colors to “green” and “orange”.
The scale_
functions allow the user to control the scales for each aesthetic. 0 These scale functions have names with structure scale_aesthetic_suffix
, where aesthetic
is the name of an aesthetic like color
or shape
or x
, and suffix
is some descriptive word that defines the functionality of the scale.
Then, to specify the aesthetic values to be used by the scale, supply a vector of values to the values
argument (usually) of the scale function.
Some example scales functions:
scale_color_manual()
: define an arbitrary color scale by specifying each color manuallyscale_color_hue()
: define an evenly-spaced color scale by specifying a range of hues and the number of colors on the scalescale_shape_manual()
: define an arbitrary shape scale by specifying each shape manuallySee the ggplot2
documentation page section on scales to see a full list of scale functions.
Here is a color scale that ggplot2
chooses for us:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point()
default color scale
We can use scale_colour_manual()
to specify which colors we want to use:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
using scale_color_manual to respecify colors
Create a new graph for data set
Sitka
, a scatter plot ofTime
onx
andsize
ony
, with thecolor
of the points mapped totreat
.
Now usescale_color_manual()
, and inside specify thevalues
argument to be “orange” and “purple”.
Remember that x
and y
are aesthetics, and the two axes visualize the scale for these aesthetics.
Thus, we use scale functions to control to the scaling of these axes.
When y
is mapped to a continuous variable, we will typically use scale_y_continuous()
to control its scaling (use scale_y_discrete()
if y
is mapped to factor). Similar functions exist for the x
aesthetic.
A description of some of the important arguments to scale_y_continuous()
:
breaks
: at what data values along the range of of the axis to place tick marks and labelslabels
: what to label the tick marksname
: what to title the axisOur current graph of volume vs sales has y-axis tick marks at 0, 5000, 10000, and 15000
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
default x-tick marks
Let’s put tick marks at all grid lines along the y-axis using the breaks
argument of scale_y_continuous
:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))
changing y-axis tick marks
Now let’s relabel the tick marks to reflect units of thousands (of dollars) using labels
:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels=c(0,2.5,5,7.5,10,12.5,15,17.5))
relabeling y-axis tick marks
And finally, we’ll retitle the y-axis using the name
argument to reflect the units:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels=c(0,2.5,5,7.5,10,12.5,15,17.5),
name="price(thousands of dollars)")
new y-axis title
Use
scale_x_continuous()
to convert the x-axis of the previous graph from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.
Although we can use scale functions like scale_x_continuous()
to control the limits and titles of the x-axis, we can also use the following shortcut functions:
lims()
, xlim()
, ylim()
: set axis limitsxlab()
, ylab()
, ggtitle()
, labs()
: give labels (titles) to x-axis, y-axis, or graph; labs
can set labels for all aesthetics and titleTo set axis limits, supply a vector of 2 numbers (inside c()
, for example) to one of the limits functions:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
xlim(c(1,3)) # cut ranges from 0 to 5 in the data
restricting axis limits will zoom in
We can use labs()
to specify an overall titles for the overall graph, the axes, and legends (guides).
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
labs(x="CARAT", y="PRICE", color="CUT", title="CARAT vs PRICE by CUT")
respecifying all titles with labs
Guides (axes and legends) visualize a scale, displaying data values and their matching aesthetic values. The x-axis, a guide, visualizes the mapping of data values to position along the x-axis. A color scale guide (legend) displays which colors map to which data values.
Most guides are displayed by default. The guides()
function sets and removes guides for each scale.
Here we use guides()
to remove the color
scale legend:
# notice no legend on the right anymore
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
guides(color="none")
color legend removed
Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, ggplot2
provides multiple coordinate systems, including polar, flipped Carteisan and map projections.
Split plots into small multiples (panels) with the faceting functions, facet_wrap()
and facet_grid()
. The resulting graph shows how each plot varies along the faceting variable(s).
facet_wrap()
wraps a ribbon of plots into a multirow panel of plots. Inside facet_wrap()
, specify ~
, then a list of splitting variables, separated by +
. The number of rows and columns can be specified with arguments nrow
and ncol
.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~cut) # create a ribbon of plots using cut
carat vs price, paneled by cut with facet_wrap()
facet_grid()
allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the row-splitting variable before ~
, and the column-splitting variable after. The character .
specifies no faceting along that dimension.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_grid(clarity~cut) # split using clarity along rows along columns using cut
carat vs price, paneled by clarity and cut with facet_grid()
Create a panel of scatter plots of Time vs size (same as above), split by
treat
along the rows usingfacet_grid()
.
Themes control elements of the graph not related to the data. For example:
To modify these, we use the theme()
function, which has a large number of arguments called theme elements, which control various non-data elements of the graph.
Some example theme()
arguments and what aspect of the graph they control:
axis.line
: lines forming x-axis and y-axisaxis.line.x
: just the line for x-axislegend.position
: positioning of the legend on the graphpanel.background
: the background of the graphpanel.border
: the border around the graphtitle:
all titles on the graphA full description of theme elements can be found on the ggplot2
documentation page.
theme()
argumentsMost non-data element of the graph can be categorized as either a line (e.g. axes, tick marks), a rectangle (e.g. the background), or text (e.g. axes titles, tick labels). Each of these categories has an associated element_
function to specify the parameters controlling its apperance:
element_line()
- can specify color
, size
, linetype
, etc.element_rect()
- can specify fill
, color
, size
, etc.element_text()
- can specify family
, face
, size
, color
, angle
, etc.element_blank()
- removes theme elements from graphInside theme()
we control the properties of a theme element using the proper element_
function.
For example, the x- and y-axes are lines and are both controlled by theme()
argument axis.line
, so their visual properties, such as color
and size
(thickness), are specified as arguments to element_line()
:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2)) # size in mm
using theme argument axis.line to modify x-axis and y-axis lines
On the other hand, the background of the graph, controlled by theme()
argument panel.background
is a rectangle, so parameters like fill
color and border color
can be specified element_rect()
.
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray")) # color is the border color
using theme element axis.line.x to modify x-axis line
With element_text()
we can control properties such as the font family
or face
("bold"
, "italic"
, "bold.italic"
) of text elements like title
, which controls the titles of both axes.
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray"),
title=element_text(family="serif", face="bold"))
using theme argument title to adjust fonts of all titles
Note: "sans"
, "serif"
, and "mono"
are the only font family
choices available for ggplot2
without downloading additional R
packages. See this RPubs webpage for more information.
Finally, some theme()
arguments do not use element_
functions to control their properties, like legend.position
, which simply accepts values "none"
, "left"
, "right"
, "bottom"
, and "top"
.
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray"),
title=element_text(family="serif", face="bold"),
legend.position="bottom")
using theme argument legend.position to position legend
We could then use legend.text=element.text()
in theme()
to rotate the legend labels (not shown).
Remember to use the ggplot2
theme
documentation page when using theme()
.
Create a scatter plot of
Time
vssize
from theSitka
data set. Then usetheme()
argumentaxis.ticks
to “erase” the tick marks by coloring them white withelement_line()
.
The ggplot2
package provides a few complete themes which make several changes to the overall background look of the graphic (see here for a full description).
Some examples:
theme_bw()
theme_light()
theme_dark()
theme_classic()
The themes usually adjust the color of the background and most of the lines that make up the non-data portion of the graph.
theme_classic()
mimics the look of base R
graphics:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme_classic()
theme_classic()
theme_dark()
makes a dramatic change to the look:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme_dark()
theme_dark()
ggsave()
makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R
object.
ggsave
attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf.
Other important arguments to ggsave()
:
width
height
units
: units
of width
and height
of plot file ("in"
, "cm"
or "mm"
)dpi
: plot resolution in dots per inchplot
: name of object with stored plot#save last displayed plot as pdf
ggsave("plot.pdf")
#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) +
geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)
Save your last plot as
mygraph.png
. View the file on your computer.
Rabbit
data set {MASS
}To practice using the elements of the grammar of graphics, we will begin with the idea of the what we want to display, and step-by-step, we will add to and adjust the graphic until we feel it is ready to share with an audience.
For our next graph, we will be visualizing the data in the Rabbit
data set, also loaded with the MASS
package.
Run
data(Rabbit)
and thenstr(Rabbit)
to look at the structure of Rabbit and to bring it into the RStudio Environment.
The Rabbit
data set describes an experiment where:
The purpose was to test whether blood pressure changes were dependent on activation of serotonin receptors.
The data set contains 60 rows (5 rabbits measured 12 times) of the following 5 variables:
BPchange
: change in blood pressure relative to the start of the experimentDose
: dose of phenylbiguanide in microgramsRun
: label of trialTreatment
: Control or MDLAnimal
: animal ID (“R1” through “R5”)As is typical in drug studies, we want to create a dose-response curve. For this study, we want to see how blood pressure changes are related to the dose of phenylbiguanide, the blood-pressure raising drug, in the presence of both saline (control) and the serotonin antagonist MDL 7222 (treatment).
Issues to think about:
So, we want a graph that represents individual dose-response curves for each rabbit, and ideally separate curves for treatment and control conditions per rabbit.
Let’s build this graph step-by-step.
What geom will be required to create the dose-response curve?
geom_line()
What are its required aesthetic?
x
and y
Initiate a new graph plotting the relationship between
Dose
andBPchange
usinggeom_line()
for data setRabbit
. For now, we will not group the data for instructive purposes.
ggplot(Rabbit, aes(x=Dose, y=BPchange)) +
geom_line()
That obviously isn’t what we want – this graph treats the data as if it all belongs on one line. We want separate lines by rabbit (variable Animal
). How can we specify that?
Animal
to a grouping aesthetic, like group
, color
, linetype
.
Let’s imagine that we are constrained to produce a colorless figure, so that we cannot use color
. We will instead use linetype
to separate by Animal
.
Modify the previous plot by mapping
linetype
toAnimal
.
ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal)) +
geom_line()
That still doesn’t look right? Why?
We thus need to separate the treatment curve from the control curve for each animal. How can we accomplish that?
Treatment
to a grouping aesthetic like color
. That certainly works:ggplot(Rabbit, aes(x=Dose, y=BPchange, color=Treatment, linetype=Animal)) +
geom_line()
However, we are constraining ourselves to colorless graphs. Can you think of another way to separate the curves?
Use
facet_wrap()
to split the previous graph byTreatment
. Remember the~
.
ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal)) +
geom_line() +
facet_wrap(~Treatment)
Now we’re finally starting to see the graph that we want!
It can be a bit difficult to distinguish between the lines just based on linetype
. Let’s add some points to the graph, but vary the shapes of the points by Treatment
.
How do we add points to the graph that vary by shape?
geom_point()
with shape
mapped to Animal
.Add a scatter plot with the shape of points varying with
Animal
to the previous graph.
ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal, shape=Animal)) +
geom_line() +
facet_wrap(~Treatment) +
geom_point()
Making good progress!
Now, imagine we want to select the shapes that are plotted rather than use the defaults. What function (or what kind of function) will we need to change the shapes?
scale_shape_manual()
. The shapes are specified with integer codes ranging from 0 to 25. See the help page for points
(?points
or ?pch
) to see a table of codes and shapes.Use
scale_shape_manual()
to change the shapes to those corresponding to the codes (0, 3, 8, 16, 17) in the previous graph.
ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
geom_line() +
facet_wrap(~Treatment) +
geom_point() +
scale_shape_manual(values=c(0, 3, 8, 16, 17))
Ok! Now we are done adding data to the graph. Now we move on to some fine tuning.
First, let’s change the titles of the x-axis and the y-axis. What function can change both of these?
labs()
For the previous graph, change the title of the x-axis to “Dose(mcg)” and the title of the y-axis to “Change in blood pressure”.
ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
geom_point() +
geom_line() +
facet_wrap(~Treatment) +
scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
labs(x="Dose(mcg)", y="Change in blood pressure")
Almost there!
Imagine we don’t like the gray background with white gridlines, and instead want to use a white background with gray gridlines.
What function will we use to adjust each of these elements?
theme()
The theme()
argument that controls the graph background is panel.background
. Which element_
function should we use to specify the parameters for panel.background
?
element_rect()
theme()
argument panel.grid
controls the grid lines. Which element_
function should we use to specify parameters for gridlines?
element_line()
Use
theme()
with argumentspanel.background
andpanel.grid
to change the backgroundcolor
to"white"
and the gridlinecolor
to"gray90"
.
ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
geom_point() +
geom_line() +
facet_wrap(~Treatment) +
scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
labs(x="Dose(mcg)", y="Change in blood pressure") +
theme(panel.background = element_rect(fill="white"),
panel.grid=element_line(color="gray90"))
One final step!
Now we want to use theme()
to adjust the axis, legend, and panel titles (“Control” and “MDL”) to use bold fonts.
The theme arguments we will use are title
and strip.text
. Which element_
function should we use to specify parameters for these?
element_text()
We will use the face
argument in element_text
to set the titles to “bold”.
Use
theme()
to bold the axes titles, the legend title, and the panel titles in the previous graph.
ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
geom_point() +
geom_line() +
facet_wrap(~Treatment) +
scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
labs(x="Dose(mcg)", y="Change in blood pressure") +
theme(panel.background = element_rect(fill="white"),
panel.grid=element_line(color="gray90"),
title=element_text(face="bold"),
strip.text=element_text(face="bold"))
ggplot2
birthwt
{MASS
}The birthwt
data set contains data regarding risk factors associated with low infant birth weight.
The data consist of 189 observations of 10 variables, all numeric:
low
: 0/1 indicator of birth weight < 2.5 kgage
: mother’s agelwt
: mother’s weight in poundsrace
: mother’s race, (1=white, 2=black, 3=other)smoke
: 0/1 indicator of smoking during pregnancyptl
: number of previous premature laborsht
: 0/1 indicator of history of hypertensionui
: 0/1 indicator of uterine irritabilityftv
: numer of physician visits during first trimesterbwt
: birth weightLet’s take a look at the structure of the birthwt
data set first, to get an idea of how the variables are measured.
Run
data(birthwt)
. Then usestr()
to examine the structure of thebirthwt
dataset.
Variables in a dataset can generally be divided into numeric variables, where the number value is a meaningful representation of a quantity, and factor (categorical) variables, where number values are usually codes representing membership to a category rather than quantities.
In R
, we can encode variables as “factors” with the factor()
function.
Some aesthetics can be mapped to either numeric or categorical variables, and will be scaled differently.
These include:
x
and y
: continuous or discrete axescolor
and fill
: color gradient scales or evenly-spaced hue scalesOther aesthetics can only be mapped to categorical variables:
shape
linetype
And finally some aethetics should only be mapped to numeric variables (a warning is issued if mapped to a categorical variable):
size
alpha
Let’s examine how aesthetics behave when mapped to different type of variables using the birthwt
dataset, in which all of the variables are numeric initially.
When color
is mapped to a numeric variable, a color gradient scale is used:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point()
color gradient is appropriate for a continuous variable
Note: even though we just used race
as a numeric variable to demonstrate how ggplot
handles it, we do not recommmend treating categorical variables as numeric.
When color
is instead mapped to a factor variable, a color scale of evenly spaced hues is used. We can convert a numeric variable to a factor inside of aes()
with factor()
:
ggplot(birthwt, aes(x=age, y=bwt, color=factor(race))) +
geom_point()
evenly spaced hues emphasize contrasts between groups of a factor
An error results if we try to map shape
to a numeric version of race
, because shape
only accepts factor variables.
Shape accepts the factor representation of race
:
ggplot(birthwt, aes(x=age, y=bwt, shape=factor(race))) +
geom_point()
evenly spaced hues emphasize contrasts between groups of a factor
Finally, some aesthetics like alpha
and size
should really only be used with truly numeric variables, and a warning will be issued if the variable is a factor:
ggplot(birthwt, aes(x=age, y=bwt, size=factor(race))) +
geom_point()
## Warning: Using size for a discrete variable is not advised.
size and alpha should only be mapped to numeric variables
We recommend converting all categorical variable to factors prior to graphing for several reasons:
factor()
each time we create graphggplot2
will discourage mapping a factor variable to an inappropriate aesthetic like size
Use the following code to convert 3 of the variables inbirthwt
to factors:birthwt$low <- factor(birthwt$low, levels=0:1, labels=c("now low", "low")) birthwt$race <- factor(birthwt$race, levels=1:3, labels=c("white", "black", "other")) birthwt$ht <- factor(birthwt$ht, levels=0:1, labels=c("non-hypert", "hypert")) birthwt$smoke <- factor(birthwt$smoke, levels=0:1, labels=c("did not smoke", "smoked"))
Issue
str(birthwt)
again to examine the class of each variable.
Then create a scatter plot of
x=age
andy=bwt
for data setbirthwt
. Try mapping an appropriate variable (besidesrace
) toshape
and another toalpha
.
When 2 data points have the same values plotted on the graphs, they will generally occupy the same position, causing one to obscure the other.
Here is an example where we map race
, a factor variable, to x
and map age (in years) to y
in a scatter plot:
ggplot(birthwt, aes(x=race, y=age)) +
geom_point()
too many discrete values leads to overlapping points
There are 189 data points in the data set, but far fewer than 189 points visible in the graph, because many are completely overlapping.
To address this problem, we have a choice of “position adjustments” which can be specified to the position
argument in a geom function.
For geom_point()
, we usually use either:
position="jitter"
: add a little random noise to positionposition="identity"
: overlay points (the default for geom_point()
)By adding position="jitter"
to the previous scatter plot, we can better see how many points there are at each age:
ggplot(birthwt, aes(x=race, y=age)) +
geom_point(position="jitter")
jittering adds random variation to the position of the points
Remember that geom_bar()
will plot the frequencies of the variable mapped to x
as bars. If we map a second variable to fill
, the bars will be colored by the second variable.
We can use the position
argument in geom_bar()
to control the placement of the bars with the same x
value.
The following adjustments are generally used for geom_bar()
:
position="stack"
: stack elements vertically (the default for geom_bar()
position="dodge"
: move elements side-by-side (the default for geom_boxplot()
)position="fill"
: stack elements vertically, standardize heights to 1Each position adjustment emphasizes different quantities.
By default, geom_bar
uses position="stack"
, a compromise where we can see both the counts and proportions well:
ggplot(birthwt, aes(x=low, fill=race)) +
geom_bar()
geom_bar() will stack bars with the same x-position
If we instead want to emphasize counts, we use position="dodge"
, which places the bars side-by-side:
ggplot(birthwt, aes(x=low, fill=race)) +
geom_bar(position="dodge")
dodging emphasizes counts
Proportions are emphasized with position="fill"
, where the bars are stacked and their heights are standardized:
ggplot(birthwt, aes(x=low, fill=race)) +
geom_bar(position="fill")
filling emphasizes proportions
Use
geom_bar()
with data setbirthwt
variableslow
andsmoke
together with position adjustments to answer two questions:
A. Are there more low birth weight or non-low birth weight babies with mothers who smoked in the dataset?
B. Are babies from mother who smoked proportionally more likely to be low birth weight or non-low birth weight?
Error bars and confidence bands are both used to express ranges of statistics. To draw these, we’ll use geom_errorbar()
and geom_ribbon()
, repsectively.
To use both geoms, the following aesthetics are required:
x
: horizontal positioning of error bar or bandymin
: vertical position of lower error bar or bandymax
: vertical position of upper error bar or bandFor example, the following code estimates the mean birth weight and 95% confidence interval for the mean for the three races in data set birthwt
. The means and confidence limits are stored in a new data.frame called bwt_bt_race
.
bwt_by_race <- do.call(rbind,
tapply(birthwt$bwt, birthwt$race, mean_cl_normal))
bwt_by_race$race <- row.names(bwt_by_race)
names(bwt_by_race) <- c("mean", "lower", "upper", "race")
bwt_by_race
## mean lower upper race
## white 3102.719 2955.235 3250.202 white
## black 2719.692 2461.722 2977.662 black
## other 2805.284 2629.127 2981.441 other
Now we can plot the means by race with geom_point()
and the confidence limits with geom_errorbar()
:
ggplot(bwt_by_race, aes(x=race, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper))
mean birthweight by race
Use width=
to adjust the width of the error bars:
ggplot(bwt_by_race, aes(x=race, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.1)
mean birthweight by race, narrower error bars
Confidence bands work similarly. We’ll need values for the maximum and minium again for geom_ribbon()
.
This time, we’ll create a plot of predicted values with confidence bands from a regression of birthweight on age. First, we’ll run the model and add the predicted values and the confidence limits to the original data set for plotting:
# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt fit
## 85 now low 19 182 black did not smoke 0 non-hypert 1 0 2523 2891.909
## 86 now low 33 155 other did not smoke 0 non-hypert 0 3 2551 3065.925
## 87 now low 20 105 white smoked 0 non-hypert 0 1 2557 2904.339
## 88 now low 21 108 white smoked 0 non-hypert 1 2 2594 2916.768
## 89 now low 18 107 white smoked 0 non-hypert 1 0 2600 2879.479
## 91 now low 21 124 other did not smoke 0 non-hypert 0 0 2622 2916.768
## lwr upr
## 85 2757.969 3025.849
## 86 2846.442 3285.408
## 87 2781.794 3026.883
## 88 2803.295 3030.242
## 89 2732.358 3026.600
## 91 2803.295 3030.242
Now we’ll use geom_line()
to show the best fit line, and geom_ribbon()
to show the confidence bands:
ggplot(birthwt, aes(x=age, y=fit)) +
geom_line() +
geom_ribbon(aes(ymin=lwr, ymax=upr))
best fit line with confidence bands
Yikes! That confidence band is too dark. Use alpha
to lighten the bands by making them more transparent. Remember, because we are setting the entire band to be a constant transparency, we will specify alpha
outside of aes()
.
ggplot(birthwt, aes(x=age, y=fit)) +
geom_line() +
geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=.5)
best fit line with confidence bands
Use the following code to run a logistic regression of low birth weight (low
) on various terms in thebirthwt
data set and then convert the coefficients and their confidence intervals to odds ratios.m2 <- glm(low ~ smoke + age + ptl + ui, family="binomial", data=birthwt) ci <- confint(m2) odds_ratios <- data.frame(coef = exp(coef(m2)[-1]), lower = exp(ci[-1,1]), upper= exp(ci[-1,2])) odds_ratios$term <- row.names(odds_ratios)
Graph the odds ratios and their confidence intervals using
geom_point()
andgeom_errorbar()
.
At times we need to add notes or annotations directly to the graph that are not represented by any variables in the graph data set. For example, we may want to add a text label to a single point on a scatter plot, or perhaps highlight a portion of the graph with a colored box.
For this, we can use the annotate()
function. To use annotate()
, the first argument is the name of a geom (for example "text"
or "rect"
). Subsequent arguments are positioning aesthetics such as x=
and y=
and any additional aesthetics needed for that particular geom.
Let’s imagine that we want to annotate the data point in the far upper right corner of this graph we have seen before:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point()
we want to annotate point in upper right
Suppose we want to label the outlier as a possible data error. To add annotation text, we will use geom_text()
in annotate()
. We will need to specify x=
and y=
positions for the text, and the contents of the text in label=
.
We see that the outlier lies at x=45, y=5000. To place the text a little to the left of the point, we will use x=42
and y=5000
. Proper positioning will take some experimentation. We specify the text to be displayed with label=Data error?
.
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
annotate("text", x=42, y=5000, label="Data error?") # notice first argument is "text", not geom_text
Annotating outlier as possible data entry error
As another example, let’s highlight a portion of the graph that features birthweights within 1 standard deviation of the mean weight. We will create a rectangle using geom_rect()
that spans the x-axis for its full width from xmin=13
and xmax=46
, and the y-axis from ymin=2215
to ymax=3673
(mean-sd, mean+sd). We will set alpha=.2
to make the box transparent.
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2) # notice first argument is "text", not geom_text
Birthweights within one standard deviation of mean
We can specify a specific color in R
in several ways:
rgb()
We have already used string names like “white” and “green” to specify colors.
You can issue colors()
in R
to see a full list of available color names. See here for a chart of these colors. Here we show the first 30 names out of 657:
head(colors(), n=30)
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
## [9] "aquamarine1" "aquamarine2" "aquamarine3" "aquamarine4"
## [13] "azure" "azure1" "azure2" "azure3"
## [17] "azure4" "beige" "bisque" "bisque1"
## [21] "bisque2" "bisque3" "bisque4" "black"
## [25] "blanchedalmond" "blue" "blue1" "blue2"
## [29] "blue3" "blue4"
We can also use hex color codes. These hex codes usually consist of #
followed by 6 numbers/letters (each a hexadecimal digit ranging from 0 to F), where the first two digits represent redness, the second two greenness, and the last two blueness.
For example, the hex code #009900
would represent a green shade, while hex code #FF00EE
would represent a purple shade. Tools like this can help you identify the hex code for a particular color.
ggplot(birthwt, aes(x=age, y=bwt)) +
geom_point(color="#E36D11")
Using a hex code to specify a shade of orange
Finally, we can use RGB (red, green, blue) values to specify a color. Specify three numbers between 0 and 1 to rgb()
function, and it will return the hex code for that color. Let’s try a purple:
# rgb() returns a hex code
rgb(.75, 0, 1)
## [1] "#BF00FF"
ggplot(birthwt, aes(x=age, y=bwt)) +
geom_point(color=rgb(.75, 0, 1))
Using rgb() to specify a shade of purple
Part of the challenge of making effective and attractive color graphs is choosing a color palette that serves both purposes of representing variation and catching the eye.
When you map a variable to color
or fill
, the ggplot2
package will use the variable’s type (i.e. numeric, factor, ordinal) to choose a color scale.
If you use map a numeric variable to color
you will usually get a color gradient based on a single hue:
ggplot(birthwt, aes(x=lwt, y=bwt, color=as.numeric(race))) +
geom_point()
color gradient scale for numeric variables
A color gradient is a natural analog to a numeric variable. In the above graph, as the color becomes “bluer”, the race
value becomes higher (assuming the value has numeric meaning).
On the other hand, if we map a factor variable to color
, we get a set of distinct hues evenly spaced around the color wheel:
ggplot(birthwt, aes(x=lwt, y=bwt, color=factor(race))) +
geom_point()
evenly spaced distinct hues for factor variables
The categories of a factor
variable are considered unordered, so using completely different hues to represent them makes sense.
Those were just the default color scales that ggplot2
chooses for you, by guessing the appropriate scale from the variable’s type. There are many ways to form color scales using ggplot2
so you have lots of options when choosing a palette.
Here are some color
scale functions used to form color scales (there is an analogous scale function for the fill
aesthetic for each of the below):
scale_color_brewer
: use a ColorBrewer sequential, diverging, or qualitative scale (see next slide)scale_color_gradient
: create a low-high color gradient scale (default for numeric variables)scale_color_hue
: create a scale of evenly spaced hues around the color wheel (deafult for factors)scale_color_manual
: manually create color scaleWith scale_color_gradient
, we can define the colors that define the ends of the gradient with arguments low
and high
. The default gradient runs from a blueish-black at the low end to a light-blue at the high end. We can redefine the scale to go from a very light green (honeydew) to a dark green:
ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
geom_point() +
scale_color_gradient(low="honeydew", high="darkgreen")
Defining our own color gradient
Because we are only changing a single hue with color gradients, perhaps it is easier to use rgb()
, where we can specify the intensity of each hue:
ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
geom_point() +
scale_color_gradient(low=rgb(.1, .2, .1), high=rgb(.1, 1, .1))
Defining our own color gradient with rgb()
With scale_color_hue()
, we define a color scale by specifying a range of colors to use, and then evenly spaced hues will be selected from this range. Here are the relevant arguments to scale_color_hue()
:
h
: range of hues to use (on color wheel), should be a vector of 2 numbers between 0 and 360h.start
: starting hue (first color of palette, then next colors are chosen to be equally spaced apart)direction
: direction to travel around color wheel, either 1
=clockwise or -1
=counter-clockwiseVarying any of the 3 above will alter the color palette.
First, we change the range of colors with h
to be much smaller:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
scale_color_hue(h=c(0,90))
restricting range of colors with scale_color_hue()
We can also use the original range, but change the starting hue with h.start
to get a completely different set:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
scale_color_hue(h.start=20)
changing starting hue with scale_color_hue()
We have already used scale_color_manual()
to alter color scales, but note that you can use hex codes and rgb()
to specify the colors:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
scale_color_manual(values=c(rgb(.5,.5,.2), "#FF1035", "blue"))
manually changing colors with hex codes and rgb()
ColorBrewer is a webpage resource designed by Cynthia Brewer that lists many color schemes designed for different purposes:
The ColorBrewer palettes are not only designed to be highly functional, they are also very attractive, with colors that complement each other well.
The ColorBrewer palettes have been integrated into R
, and are available in ggplot
through scale_color_brewer()
and scale_fill_brewer()
.
Arguments to scale_color_brewer()
and scale_fill_brewer()
:
type
: one of "seq"
, "div"
, or "qual"
palette
: the name of the palette (e.g. “YlGnBu”) or a number indicating the index of the palette on ColorBrewerdirection
: 1
=default order, -1
=reverse orderWe’ll use a sequential palette first, although it should not be used with race
since race
does not progress from low to high values:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
scale_color_brewer(type="seq", palette="RdPu")
Sequential palette not a great choice for race
Instead, we should use a qualitative palette with race
:
ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
geom_point() +
scale_color_brewer(type="qual", palette=8) # requests the 8th qualitative palette
Qualitative palette better for race
Create a new bar graph of data set
birthwt
, where we have counts ofx=low
colored byfill=race
. Usescale_fill_hue()
andscale_fill_brewer()
to adjust the color scales to your liking.
For more in-depth information, read ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, creator of the ggplot2 package:
This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.
hsb
For the final set of exercises, we will be using a data set stored on the UCLA IDRE website, which we load with the following code:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
This data set contains demographic and academic data for 200 high school students. We will be using the following variables:
read
, write
, math
, science
: academic test scoresfemale
: gender, factor with levels “female” and “male”honors
: enrollment in honors program, factor with 2 levels “enrolled” and “not enrolled”ses
: socioeconomic status, factor with 3 levels, “low”, “middle”, “high”schtyp
: school type, factor with 2 levels, “private” and “public”Use the following code to load the hsb data set:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
Create a graph of boxplots of the variable
math
across levels of the variablehonors
. Color the inside of the boxes byfemale
. Change the inside colors to “blue” and “gold”.
ggplot(hsb, aes(x=honors, y=math, fill=female)) +
geom_boxplot() +
scale_fill_manual(values=c("blue", "gold"))
Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables:
female
,prog
,schtyp
,ses
. For example, from such a graph we can know how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.
Hint?
x
, one to fill
, and 2 can be used faceting!# don't forget to use position="dodge" for counts
ggplot(hsb, aes(x=female, fill=prog)) +
geom_bar(position="dodge", width=.5) +
facet_grid(schtyp ~ ses)
Try to recreate this graph:
Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.
This is just one solution:
ggplot(hsb, aes(x=read, y=write, color=math)) +
geom_point() +
geom_smooth(color="red") +
labs(x="Reading Score", y="Writing Score", color="Math Score") +
theme(title=element_text(family="mono", color="red"),
panel.background=element_blank())