This seminar introduces how to use the R ggplot2
package, particularly for producing statistical graphics for data
analysis.
Today, we will:
ggplot2
.Text in this font
signifies R
code or
variables in a data set
Text that appears like this represents an instruction to practice
ggplot2
coding
Load the packages into the current R
session with
library()
. In addition to ggplot2
, we load
package MASS
(installed with R
) for data
sets.
Please use
library()
to load packagesggplot2
andMASS
.
ggplot2
packagehttps://ggplot2.tidyverse.org/reference/
The official reference webpage for ggplot2
has help
files for its many functions an operators with many examples.
A grammar of a language defines the rules of structuring words and phrases into meaningful expressions.
A grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.
Leland Wilkinson (2005) designed the grammar upon which
ggplot2
is based.
ggplot()
function and aestheticsAll graphics begin with specifying the ggplot()
function
(Note: not ggplot2
, the name of the package)
In the ggplot()
function we specify:
data.frame
objectExample syntax for ggplot()
specification
(italicized
words are to be filled in by you):
ggplot(data, aes(x=xvar,
y=yvar))
data
: name of the data.frame
that
holds the variables to be plottedx
and y
: aesthetics that position objects
on the graphxvar
and yvar
: names of
variables in data
mapped to x
and
y
Notice that the aesthetics are specified inside aes()
,
which is itself nested inside of ggplot()
.
Aesthetics specified inside of ggplot()
are
inherited by subsequent layers:
Specifying just x
and y
aesethetics alone
will produce a plot with just the 2 axes.
Add layers with +
to add graphical
components.
Layers consist of geoms, stats, scales, faceting, and themes, which we will discuss in detail.
Remember that subsequent layers inherit aesthetics from
ggplot()
. However, specifying new aesthetics in a layer
will override the aesthetics specified in ggplot()
.
# scatter plot of volume vs sales
# with rug plot colored by number of listings
ggplot(txhousing, aes(x=volume, y=sales)) + # x=volume and y=sales inherited by all layers
geom_point() +
geom_rug(aes(color=listings)) # color will only apply to the rug plot because not specified in ggplot()
Aesthetics are the visual properties of objects on the graph.
Which aesthetics are required and which are allowed vary by geom.
Commonly used aesthetics:
x
: positioning along x-axisy
: positioning along y-axiscolor
: color of objects; for 2-d objects, the color of
the object’s outline (compare to fill below)fill
: fill color of objectslinetype
: how lines should be drawn (solid, dashed,
dotted, etc.)shape
: shape of markers in scatter plotssize
: how large objects appearalpha
: transparency of objects (value between 0,
transparent, and 1, opaque – inverse of how many stacked objects it will
take to be opaque)Map aesthetics to variables inside the
aes()
function. By mapping, we mean the aesthetic will vary
as the variable varies.
For example, mapping x=volume
causes the position of the
plotted data to vary with values of variable “volume”. Similarly,
mapping color=listings
causes the color of objects to vary
with values of variable “listings”.
# mapping color to listings inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color=listings))
Set aesthetics to a constant outside the
aes()
function.
Compare the following graphs:
# setting color to green outside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(color="green")
Setting an aesthetic to a constant within aes()
can lead
to unexpected results, as the aesthetic is then set to a default value
rather than the specified value.
# color="green" inside of aes()
# geom_point() cannot find a variable called "green" and
# uses a default color instead
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color="green"))
To practice using the grammar of graphics, we will use the
Sitka
dataset (from the MASS
package).
Note: Data sets that are loaded into R
with a
package are immediately available for use. To see the object appear in
RStudio’s Environment
pane (so you can click to view it),
run data()
on the data set, and then click the data set
name in the Environment pane.
Run the command
data(Sitka)
and then click onSitka
in the Environment pane.
The Sitka
dataset describes the growth of trees over
time, some of which were grown in ozone-enriched chambers. The data
frame contains 395 rows of the following 4 columns:
A. Create a scatter plot of
Time
vssize
to view the growth of trees over time.
B. Color the scatter plot points by the variable
treat
.
C. Add an additional
geom_smooth()
(loess) layer to the graph.
D. SET the color of the loess smooth to “green” rather than have it colored by
treat
. Why is there only one smoothed curve now?
Geom functions differ in the geometric shapes produced for the plot.
Some example geoms:
geom_bar()
: bars with bases on the x-axisgeom_boxplot()
: boxes-and-whiskersgeom_errorbar()
: T-shaped error barsgeom_density()
: density plotsgeom_histogram()
: histogramgeom_line()
: linesgeom_point()
: points (scatterplot)geom_ribbon()
: bands spanning y-values across a range
of x-valuesgeom_smooth()
: smoothed conditional means (e.g. loess
smooth)geom_text()
: textEach geom has required aesthetics. For example,
geom_point()
requires both x
and
y
, the minimal specification for a scatterplot.
Geoms differ in which aesthetics they accept. For example,
geom_point()
accepts the aesthetic shape
,
which defines the shapes of points on the graph, while
geom_bar()
does not accept shape
.
Check the geom help files for required and understood aesthetics. In the Aesthetics section of the geom’s help file, required aesthetics are bolded.
We will tour some commonly used geoms.
# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median)) +
geom_histogram()
Histograms are popular choices to depict the distribution of a continuous variable.
geom_histogram()
cuts the continuous variable mapped to
x
into bins, and counts the number of values within each
bin.
# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median)) +
geom_density()
Denisty plots are basically smoothed histograms.
Density plots, unlike histograms, can be plotted separately by group
by mapping a grouping variable to color
.
# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=median, color=factor(month))) +
geom_density()
# A variable "median" represents the median sale price
ggplot(txhousing, aes(x=factor(year), y=median)) +
geom_boxplot()
Boxplots compactly visualize particular statistics of a distributions:
Boxplots are particularly useful for comparing distributions between groups.
geom_boxplot()
will create boxplots of the variable
mapped to y
for each group defined by the values of the
x
variable.
Bar plots are often used to display frequencies of factor (categorical) variables.
geom_bar()
by default produces a bar plot where the
height of the bar represents counts of each x-value.
The color that fills the bars is not controlled by
color
, but instead by fill
, which can only be
mapped to a factor (categorical) variable. We can visualize a
crosstabulation of variables by mapping one of them to
fill
in geom_bar()
:
Scatter plots depict the covariation between pairs of variables (typically both continuous).
geom_point()
depicts covariation between variables
mapped to x
and y
.
Scatter plots are among the most flexible graphs, as variables can be
mapped to many aesthetics such as color
,
shape
, size
, and alpha
.
ggplot(txhousing, aes(x=volume, y=sales,
color=listings, alpha=median, size=inventory)) +
geom_point()
Line graphs depict covariation between variables mapped to
x
and y
with lines instead of points.
geom_line()
will treat all data as belonging to one line
unless a variable is mapped to one of the following aesthetics to group
the data into separate lines:
group
: lines will look the samecolor
: line colors will vary with mapped variablelinetype
: line patterns will vary with mapped
variableLet’s first examine a line graph with no grouping:
As you can see, unless the data represent a single series, line graphs usually call for some grouping.
Using color
or linetype
in
geom_line()
will implicitly group the lines.
We will be using the Sitka
data set again for this
exercise.
A. Using 2 different geoms, compare the distribution of
size
between the two levels oftreat
. Use a different color for each distribution.
B. Use a bar plot to visualize the crosstabulation of
Time
andtreat
. PutTime
on the x-axis.
C. Create a line graph of
size
overTime
, with separate lines bytree
and lines colored bytreat
.
D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?
The stat functions statistically transform data, usually as some form of summary, such as the mean, or standard devation, or a confidence interval.
Each stat function is associated with a default geom, so no geom is required for shapes to be rendered.
stat_summary()
, perhaps the most useful of all stat
functions, applies a summary function to the variable mapped to
y
for each value of the x
variable. The
default summary function is mean_se()
, with associated geom
geom_pointrange()
, which will produce a plot of the mean
(dot) and standard error (lines) of the variable mapped to
y
for each value of the x
variable.
Create a new plot where
x
is mapped toTime
andy
is mapped tosize
. Then, add astat_summary()
layer.
What makes stat_summary()
so powerful is that you can
use any function that accepts a vector as the summary function
(e.g. mean()
, var()
, max()
, etc.)
and the geom can also be changed to adjust the shapes plotted.
Scales define which aesthetic values are mapped to the data values.
Here is an example of a color scale that defines which colors are
mapped to values of treat
:
color | treat |
---|---|
red | ozone |
blue | control |
We can use a scale function to change the colors to “green” and
“orange”.
Or, we might have treat
mapped to shape
,
and instead of squares and circles we want to use triangles and
stars.
Scale functions have names with structure
scale_aesthetic_suffix
, where
aesthetic
is the name of an aesthetic like
color
or shape
or x
, and
suffix
is some descriptive word that defines the
functionality of the scale.
Some example scales functions:
scale_color_manual()
: define an arbitrary color scale
by specifying each color manuallyscale_color_hue()
: define an evenly-spaced color scale
by specifying a range of hues and the number of colors on the scalescale_shape_manual()
: define an arbitrary shape scale
by specifying each shape manuallySee the ggplot2 documentation page section on scales to see a full list of scale functions.
To control the aesthetic values to be used by the
scale, specify a vector to the values
argument
(usually) of the scale function.
Here is a color scale that ggplot2
chooses for us:
We can use scale_colour_manual()
to specify which colors
we want to use in values=
:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
Remember that x
and y
are aesthetics, and
the two axes visualize the scale for these aesthetics.
Thus, we use scale functions to control to the scaling of these axes.
When y
is mapped to a continuous variable, we will
typically use scale_y_continuous()
to control its scaling
(use scale_y_discrete()
if y
is mapped to
factor). Similar functions exist for the x
aesthetic.
A description of some of the important arguments to
scale_y_continuous()
:
breaks
: at what data values along the range of of the
axis to place tick marks and labelslabels
: what to label the tick marksname
: what to title the axisOur current graph of volume vs sales has y-axis tick marks at 0, 5000, 10000, and 15000
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
Let’s put tick marks at all grid lines along the y-axis using the
breaks
argument of scale_y_continuous
:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))
Now let’s relabel the tick marks to reflect units of thousands (of
dollars) using labels
:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels=c(0,2.5,5,7.5,10,12.5,15,17.5))
And finally, we’ll retitle the y-axis using the name
argument to reflect the units:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels=c(0,2.5,5,7.5,10,12.5,15,17.5),
name="price(thousands of dollars)")
Although we can use scale functions like
scale_x_continuous()
to control the limits and titles of
the x-axis, we can also use the following shortcut functions:
lims()
, xlim()
, ylim()
: set
axis limitsxlab()
, ylab()
, ggtitle()
,
labs()
: give labels (titles) to x-axis, y-axis, or graph;
labs
can set labels for all aesthetics and titleTo set axis limits, supply a vector of 2 numbers (inside
c()
, for example) to one of the limits functions:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
xlim(c(1,3)) # cut ranges from 0 to 5 in the data
We can use labs()
to specify an overall titles for the
overall graph, the axes, and legends (guides).
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
labs(x="CARAT", y="PRICE", color="CUT", title="CARAT vs PRICE by CUT")
Guides (axes and legends) visualize a scale, displaying data values and their matching aesthetic values.
Most guides are displayed by default. The guides()
function sets and removes guides for each scale.
Here we use guides()
to remove the color
scale legend:
# notice no legend on the right anymore
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
guides(color="none")
Coordinate systems define the planes on which objects are positioned
in space on the plot. Most plots use Cartesian coordinate systems, as do
all the plots in the seminar. Nevertheless, ggplot2
provides multiple coordinate systems, including polar, flipped Cartesian
and map projections.
Split plots into small multiples (panels) with the faceting
functions, facet_wrap()
and facet_grid()
. The
resulting graph shows how each plot varies along the faceting
variable(s).
facet_wrap()
wraps a ribbon of plots into a multirow
panel of plots. Inside facet_wrap()
, specify
~
, then a list of splitting variables, separated by
+
. The number of rows and columns can be specified with
arguments nrow
and ncol
.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~cut) # create a ribbon of plots using cut
facet_grid()
allows direct specification of which
variables are used to split the data/plots along the rows and columns.
Put the row-splitting variable before ~
, and the
column-splitting variable after. The character .
specifies
no faceting along that dimension.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_grid(clarity~cut) # split using clarity along rows along columns using cut
We will again use the Sitka
data set.
A. Recreate the line plot of
Time
vssize
, with thecolor
of the lines mapped totreat
. Usescale_color_manual()
to change the colors to “orange” and “purple”.
B. Use
scale_x_continuous()
to convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.
C. Split the scatter plot into a panel of scatter plots by
tree
. (Note: Make the graph area large; graph may take a few seconds to appear)
Themes control elements of the graph not related to the data. For example:
To modify these, we use the theme()
function, which has
a large number of arguments called theme elements,
which control various non-data elements of the graph.
Some example theme()
arguments and what aspect of the
graph they control:
axis.line
: lines forming x-axis and y-axisaxis.line.x
: just the line for x-axislegend.position
: positioning of the legend on the
graphpanel.background
: the background of the graphpanel.border
: the border around the graphtitle:
all titles on the graphA full description of theme elements can be found on the
ggplot2
documentation page.
theme()
argumentsMost non-data element of the graph can be categorized as either a
line (e.g. axes, tick marks), a rectangle (e.g. the background), or text
(e.g. axes titles, tick labels). Each of these categories has an
associated element_
function to specify the parameters
controlling its apperance:
element_line()
- can specify color
,
size
, linetype
, etc.element_rect()
- can specify fill
,
color
, size
, etc.element_text()
- can specify family
,
face
, size
, color
,
angle
, etc.element_blank()
- removes theme elements from
graphInside theme()
we control the properties of a theme
element using the proper element_
function.
For example, the x- and y-axes are lines and are both controlled by
theme()
argument axis.line
, so their visual
properties, such as color
and size
(thickness), are specified as arguments to
element_line()
:
ggplot(txhousing, aes(x=volume, y=sales, color=listings)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2)) # size in mm
However, the background of the graph, controlled by
theme()
argument panel.background
is a
rectangle, so parameters like fill
color and border
color
can be specified element_rect()
.
ggplot(txhousing, aes(x=volume, y=sales, color=listings)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray")) # color is the border color
With element_text()
we can control properties such as
the font family
or face
("bold"
,
"italic"
, "bold.italic"
) of text elements like
title
, which controls the titles of both axes.
ggplot(txhousing, aes(x=volume, y=sales, color=listings)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray"),
title=element_text(family="serif", face="bold"))
Note: "sans"
, "serif"
, and
"mono"
are the only font family
choices
available for ggplot2
without downloading additional
R
packages. See this
RPubs
webpage for more information.
Finally, some theme()
arguments do not use
element_
functions to control their properties, like
legend.position
, which simply accepts values
"none"
, "left"
, "right"
,
"bottom"
, and "top"
.
ggplot(txhousing, aes(x=volume, y=sales, color=listings)) +
geom_point() +
theme(axis.line=element_line(color="black", size=2),
panel.background=element_rect(fill="white", color="gray"),
title=element_text(family="serif", face="bold"),
legend.position="bottom")
We could then use legend.text=element.text()
in
theme()
to rotate the legend labels (not shown).
Remember to use the
ggplot2
theme
documentation page when using
theme()
.
The ggplot2
package provides a few complete themes which
make several changes to the overall background look of the graphic (see
here
for a full description).
Some examples:
theme_bw()
theme_light()
theme_dark()
theme_classic()
The themes usually adjust the color of the background and most of the lines that make up the non-data portion of the graph.
theme_classic()
mimics the look of base R
graphics:
theme_dark()
makes a dramatic change to the look:
ggsave()
makes saving plots easy. The last plot
displayed is saved by default, but we can also save a plot stored to an
R
object.
ggsave
attempts to guess the device to use to save the
image from the file extension, so use a meaningful extension. Available
devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and
wmf.
Other important arguments to ggsave()
:
width
height
units
: units
of width
and
height
of plot file ("in"
, "cm"
or "mm"
)dpi
: plot resolution in dots per inchplot
: name of object with stored plot#save last displayed plot as pdf
ggsave("plot.pdf")
#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) +
geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)
For this exercise, we will use everything we have learned up to this
point to create a graph for a new dataset, the Rabbit
data
set, a data set stored on the UCLA IDRE website, which we load with the
following code:
The Rabbit
data set describes an experiment where:
The data set contains 30 rows (5 rabbits measured 6 times) of
the following 4 variables:
Rabbit
: animal ID (“R1” through “R5”)Treatment
: Sedentary or ExerciseMeal
: Control, Fruit, High-sodiumBPchange
: change in blood pressure relative to the
start of the experimentGoal: create a meal-response curve for each rabbit under each treatment (Sedentary/Exercise), resulting in 10 curves (2 each for 5 rabbits)
Constraints: no color, but publication quality (imagine submitting to a journal that only accepts non-color figures)
We will build this graph in steps.
A. First, try creating a line graph with
Meal
on the x-axis andBPchange
on the y-axis, with separate linetypes by Rabbit. Also, specifygroup=Rabbit
.
Why does this graph look wrong?
B. Draw separate lines by
Treatment
. How can we accomplish this without color?
Some of the line patterns still look a little too similar to distinguish between rabbits.
C. Add a scatter plot where the shape of the points is mapped to
Rabbit
.
Next we will change the shapes used. See ?pch for a list of codes for shapes.
D. Use
scale_shape_manual()
to change the shapes of the points. Use the shapes corresponding to the codes (0, 3, 8, 16, 17).
Ok, the graph has all the data we want on it. Now, we’ll prepare it for publication.
E. Change the x-axis title to “Type of meal” and the y-axis title to “Change in blood pressure”.
Finally, we will change some of the theme()
elements.
F. First, change the background from gray to white (or remove it) using
panel.background
intheme()
.
G. Next, change the color of the grid lines to “gray90”, a light gray using
panel.grid
.
H. Use
title
to change the titles (axes and legend) to boldface
.
I. Use
strip.text
to change the facet titles to boldface
.
J. Save your last plot as
mygraph.png
.
ggplot2
birthwt
{MASS
}The birthwt
data set contains data regarding risk
factors associated with low infant birth weight.
The data consist of 189 observations of 10 variables, all numeric:
low
: 0/1 indicator of birth weight < 2.5 kgage
: mother’s agelwt
: mother’s weight in poundsrace
: mother’s race, (1=white, 2=black, 3=other)smoke
: 0/1 indicator of smoking during pregnancyptl
: number of previous premature laborsht
: 0/1 indicator of history of hypertensionui
: 0/1 indicator of uterine irritabilityftv
: numer of physician visits during first
trimesterbwt
: birth weightLet’s take a look at the structure of the birthwt
data
set first, to get an idea of how the variables are measured.
Run
data(birthwt)
. Click its names in the Environment pane of RStudio.
For plotting, variables are either numeric variables, where the number value is a meaningful representation of a quantity, or factor variables, R’s representation of categorical variables.
In R
, we can encode variables as “factors” with the
factor()
function.
Some aesthetics can be mapped to either numeric or categorical variables, and will be scaled differently.
These include:
x
and y
: continuous or discrete axescolor
and fill
: color gradient scales or
evenly-spaced hue scalesOther aesthetics can only be mapped to categorical variables:
shape
linetype
And finally some aethetics should only be mapped to numeric variables (a warning is issued if mapped to a categorical variable):
size
alpha
Let’s examine how aesthetics behave when mapped to different type of
variables using the birthwt
dataset, in which all of the
variables are numeric initially.
When color
is mapped to a numeric variable, a color
gradient scale is used:
Note: even though we just used race
as
a numeric variable to demonstrate how ggplot
handles it, we
do not recommmend treating categorical variables as numeric.
When color
is instead mapped to a factor variable, a
color scale of evenly spaced hues is used. We can convert a numeric
variable to a factor inside of aes()
with
factor()
:
An error results if we try to map shape
to a numeric
version of race
, because shape
only accepts
factor variables.
Shape accepts the factor representation of race
:
Finally, some aesthetics like alpha
and
size
should really only be used with truly numeric
variables, and a warning will be issued if the variable is a factor:
ggplot(birthwt, aes(x=age, y=bwt, size=factor(race))) +
geom_point()
## Warning: Using size for a discrete variable is not advised.
We recommend converting all categorical variable to factors prior to graphing for several reasons:
factor()
each
time we create graphggplot2
will discourage mapping a factor variable to an
inappropriate aesthetic like size
For example, we can convert the 0/1 smoke
the 1/2/3
race
variable to factors smokef
and
racef
, respectively and label the values for each:
birthwt$smokef <- factor(birthwt$smoke, levels=0:1, labels=c("did not smoke", "smoked"))
birthwt$racef <- factor(birthwt$race, levels=1:3, labels=c("white", "black", "other"))
Now the labels will appear in the graph legend:
A. For the
birthwt
data, convertht
to a factor and label the values 0 and 1 “non-hyper” and “hyper”, respectively.
B. Create boxplots of
bwt
(birth weight), colored byht
, with separate panels bysmoke
.
When 2 data points have the same values plotted on the graphs, they will generally occupy the same position, causing one to obscure the other.
Here is an example where we map racef
, a factor
variable, to x
and map age (in years) to y
in
a scatter plot:
There are 189 data points in the data set, but far fewer than 189 points visible in the graph, because many are completely overlapping.
To address this problem, we have a choice of “position adjustments”
which can be specified to the position
argument in a geom
function.
For geom_point()
, we usually use either:
position="jitter"
: add a little random noise to
positionposition="identity"
: overlay points (the default for
geom_point()
)By adding position="jitter"
to the previous scatter
plot, we can better see how many points there are at each age:
Remember that geom_bar()
will plot the frequencies of
the variable mapped to x
as bars. If we map a second
variable to fill
, the bars will be colored by the second
variable.
We can use the position
argument in
geom_bar()
to control the placement of the bars with the
same x
value.
The following adjustments are generally used for
geom_bar()
:
position="stack"
: stack elements vertically (the
default for geom_bar()
position="dodge"
: move elements side-by-side (the
default for geom_boxplot()
)position="fill"
: stack elements vertically, standardize
heights to 1Each position adjustment emphasizes different quantities.
By default, geom_bar
uses position="stack"
,
a compromise where we can see both the counts and proportions well:
If we instead want to emphasize counts, we use
position="dodge"
, which places the bars side-by-side:
Proportions are emphasized with position="fill"
, where
the bars are stacked and their heights are standardized:
Error bars and confidence bands are both used to express ranges of
statistics. To draw these, we’ll use geom_errorbar()
and
geom_ribbon()
, repsectively.
To use both geoms, the following aesthetics are required:
x
: horizontal positioning of error bar or bandymin
: vertical position of lower error bar or bandymax
: vertical position of upper error bar or bandFor example, the following code estimates the mean birth weight and
95% confidence interval for the mean for the three races in data set
birthwt
. The means and confidence limits are stored in a
new data.frame called bwt_bt_race
.
bwt_by_racef <- do.call(rbind,
tapply(birthwt$bwt, birthwt$racef, mean_cl_normal))
bwt_by_racef$racef <- row.names(bwt_by_racef)
names(bwt_by_racef) <- c("mean", "lower", "upper", "racef")
bwt_by_racef
## mean lower upper racef
## white 3102.719 2955.235 3250.202 white
## black 2719.692 2461.722 2977.662 black
## other 2805.284 2629.127 2981.441 other
Now we can plot the means by race with geom_point()
and
the confidence limits with geom_errorbar()
:
ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper))
Use width=
to adjust the width of the error bars:
ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.1)
Confidence bands work similarly. We’ll need values for the maximum
and minium again for geom_ribbon()
.
This time, we’ll create a plot of predicted values with confidence bands from a regression of birthweight on age. First, we’ll run the model and add the predicted values and the confidence limits to the original data set for plotting:
# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt smokef racef
## 85 0 19 182 2 0 0 non-hyper 1 0 2523 did not smoke black
## 86 0 33 155 3 0 0 non-hyper 0 3 2551 did not smoke other
## 87 0 20 105 1 1 0 non-hyper 0 1 2557 smoked white
## 88 0 21 108 1 1 0 non-hyper 1 2 2594 smoked white
## 89 0 18 107 1 1 0 non-hyper 1 0 2600 smoked white
## 91 0 21 124 3 0 0 non-hyper 0 0 2622 did not smoke other
## fit lwr upr
## 85 2891.909 2757.969 3025.849
## 86 3065.925 2846.442 3285.408
## 87 2904.339 2781.794 3026.883
## 88 2916.768 2803.295 3030.242
## 89 2879.479 2732.358 3026.600
## 91 2916.768 2803.295 3030.242
Now we’ll use geom_line()
to show the best fit line, and
geom_ribbon()
to show the confidence bands:
Yikes! That confidence band is too dark. Use alpha
to
lighten the bands by making them more transparent. Remember, because we
are setting the entire band to be a constant transparency, we will
specify alpha
outside of aes()
.
At times we need to add notes or annotations directly to the graph that are not represented by any variables in the graph data set. Examples:
To add non-data related elements, use the annotate()
function.
To use annotate()
, the first argument is the
name of a geom (for example "text"
or
"rect"
). Subsequent arguments are positioning aesthetics
such as x=
and y=
and any additional
aesthetics needed for that particular geom.
Let’s imagine that we want to annotate the data point in the far upper right corner of this graph we have seen before:
Suppose we want to label the outlier as a possible data error. To add
annotation text, we will use geom_text()
in
annotate()
. We will need to specify x=
and
y=
positions for the text, and the contents of the text in
label=
.
We see that the outlier lies at x=45, y=5000. To place the text a
little to the left of the point, we will use x=42
and
y=5000
. Proper positioning will take some experimentation.
We specify the text to be displayed with
label="Data error?"
.
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
annotate("text", x=42, y=5000, label="Data error?") # notice first argument is "text", not geom_text
As another example, let’s highlight a portion of the graph that
features birthweights within 1 standard deviation of the mean weight. We
will create a rectangle using geom_rect()
that spans the
x-axis for its full width from xmin=13
and
xmax=46
, and the y-axis from ymin=2215
to
ymax=3673
(mean-sd, mean+sd). We will set
alpha=.2
to make the box transparent.
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2) # notice first argument is "text", not geom_text
We can specify a specific color in R
in several
ways:
rgb()
We have already used string names like “white” and “green” to specify colors.
You can issue colors()
in R
to see a full
list of available color names. See
here
for a chart of these colors. Here we show the first 30 names out of
657:
head(colors(), n=30)
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
## [9] "aquamarine1" "aquamarine2" "aquamarine3" "aquamarine4"
## [13] "azure" "azure1" "azure2" "azure3"
## [17] "azure4" "beige" "bisque" "bisque1"
## [21] "bisque2" "bisque3" "bisque4" "black"
## [25] "blanchedalmond" "blue" "blue1" "blue2"
## [29] "blue3" "blue4"
We can also use hex color codes. These hex codes usually consist of
#
followed by 6 numbers/letters (each a hexadecimal digit
ranging from 0 to F), where the first two digits represent redness, the
second two greenness, and the last two blueness.
For example, the hex code #009900
would represent a
green shade, while hex code #FF00EE
would represent a
purple shade.
Tools like
this can help you identify the hex code for a particular color.
Finally, we can use RGB (red, green, blue) values to specify a color.
Specify three numbers between 0 and 1 to rgb()
function,
and it will return the hex code for that color. Let’s try a purple:
# rgb() returns a hex code
rgb(.75, 0, 1)
## [1] "#BF00FF"
ggplot(birthwt, aes(x=age, y=bwt)) +
geom_point(color=rgb(.75, 0, 1))
We have already used scale_color_manual()
to alter color
scales, but note that you can use hex codes and rgb()
to
specify the colors:
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
scale_color_manual(values=c(rgb(.5,.5,.2), "#FF1035", "blue"))
Part of the challenge of making effective and attractive color graphs is choosing a color palette that serves both purposes of representing variation and catching the eye.
When you map a variable to color
or fill
,
the ggplot2
package will use the variable’s type
(i.e. numeric, factor, ordinal) to choose a color scale.
If you use map a numeric variable to color
you will
usually get a color gradient based on a single hue:
A color gradient is a natural analog to a numeric variable. In the
above graph, as the color becomes “bluer”, the race
value
becomes higher (assuming the value has numeric meaning).
On the other hand, if we map a factor variable to color
,
we get a set of distinct hues evenly spaced around the color wheel:
The categories of a factor
variable are considered
unordered, so using completely different hues to represent them makes
sense.
Those were just the default color scales that ggplot2
chooses for you, by guessing the appropriate scale from the variable’s
type. There are many ways to form color scales using
ggplot2
so you have lots of options when choosing a
palette.
Here are some color
scale functions used to form color
scales (there is an analogous scale function for the fill
aesthetic for each of the below):
scale_color_brewer
: use a ColorBrewer sequential,
diverging, or qualitative scale (see next slide)scale_color_gradient
: create a low-high color gradient
scale (default for numeric variables)scale_color_hue
: create a scale of evenly spaced hues
around the color wheel (deafult for factors)scale_color_manual
: manually create color scaleWith scale_color_gradient
, we can define the colors that
define the ends of the gradient with arguments low
and
high
. The default gradient runs from a blueish-black at the
low end to a light-blue at the high end. We can redefine the scale to go
from a very light green (honeydew) to a dark green:
ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
geom_point() +
scale_color_gradient(low="honeydew", high="darkgreen")
Because we are only changing a single hue with color gradients,
perhaps it is easier to use rgb()
, where we can specify the
intensity of each hue:
ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
geom_point() +
scale_color_gradient(low=rgb(.1, .2, .1), high=rgb(.1, 1, .1))
With scale_color_hue()
, we define a color scale by
specifying a range of colors to use, and then evenly spaced hues will be
selected from this range. Here are the relevant arguments to
scale_color_hue()
:
h
: range of hues to use (on color wheel), should be a
vector of 2 numbers between 0 and 360h.start
: starting hue (first color of palette, then
next colors are chosen to be equally spaced apart)direction
: direction to travel around color wheel,
either 1
=clockwise or
-1
=counter-clockwiseVarying any of the 3 above will alter the color palette.
First, we change the range of colors with h
to be much
smaller:
We can also use the original range, but change the starting hue with
h.start
to get a completely different set:
ColorBrewer is a webpage resource designed by Cynthia Brewer that lists many color schemes designed for different purposes:
The ColorBrewer palettes are not only designed to be highly functional, they are also very attractive, with colors that complement each other well.
The ColorBrewer palettes have been integrated into R
,
and are available in ggplot
through
scale_color_brewer()
and
scale_fill_brewer()
.
Arguments to scale_color_brewer()
and
scale_fill_brewer()
:
type
: one of "seq"
, "div"
, or
"qual"
palette
: the name of the palette (e.g. “YlGnBu”) or a
number indicating the index of the palette on ColorBrewerdirection
: 1
=default order,
-1
=reverse orderWe’ll use a sequential palette first, although it should not be used
with race
since race
does not progress from
low to high values:
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
scale_color_brewer(type="seq", palette="RdPu")
Instead, we should use a qualitative palette with
race
:
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
scale_color_brewer(type="qual", palette=8) # requests the 8th qualitative palette
For more in-depth information, read ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, creator of the ggplot2 package:
This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.
Additional packages that enhance the functionality and features of the ggplot2
hsb
For the final set of exercises, we will be using a data set stored on the UCLA IDRE website, which we load with the following code:
This data set contains demographic and academic data for 200 high school students. We will be using the following variables:
read
, write
, math
,
science
: academic test scoresfemale
: gender, factor with levels “female” and
“male”honors
: enrollment in honors program, factor with 2
levels “enrolled” and “not enrolled”ses
: socioeconomic status, factor with 3 levels, “low”,
“middle”, “high”schtyp
: school type, factor with 2 levels, “private”
and “public”
Use the following code to load the hsb data set:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
Create a scatter plot of
math
(x) vsread
(y), with different shapes byprog
. Color all of the points red.
Find the outlier at math=35, read=63, Add annotation text next to this outlier that says “error?”
Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables:
female
,prog
,schtyp
,ses
.
From such a graph we can know, for example, how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.
Try to recreate this graph:
Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.
#A
ggplot(Sitka, aes(x=Time, y=size, group=tree, color=treat)) +
geom_line() +
scale_color_manual(values=c("orange", "purple"))
#B
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, linetype=Rabbit)) +
geom_line() +
facet_wrap(~Treatment)
#C
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, linetype=Rabbit, shape=Rabbit)) +
geom_line() +
facet_wrap(~Treatment) +
geom_point()
#D
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
geom_line() +
facet_wrap(~Treatment) +
geom_point() +
scale_shape_manual(values=c(0, 3, 8, 16, 17))
#E
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
geom_point() +
geom_line() +
facet_wrap(~Treatment) +
scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
labs(x="Dose(mcg)", y="Change in blood pressure")
#F,G,H,I
ggplot(Rabbit, aes(x=Meal, y=BPchange, group=Rabbit, shape=Rabbit, linetype=Rabbit)) +
geom_point() +
geom_line() +
facet_wrap(~Treatment) +
scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
labs(x="Type of meal", y="Change in blood pressure") +
theme(panel.background = element_rect(fill="white"),
panel.grid=element_line(color="gray90"),
title=element_text(face="bold"),
strip.text=element_text(face="bold"))