install.packages("medicaldata")
Lesson 3: Exploring geom_ functions and customizing ggplots
1 Prepare the R environment for this lesson
For this lesson we’ll need four packages:
- medicaldata (new package)
- palmerpenguins
- ggplot2
- dplyr
First, we need to install the medicaldata package, using the install.packages()
function.
Now we use the library()
function to load all of these packages.
library(medicaldata)
library(palmerpenguins)
library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
2 The medicaldata Package and the covid_testing Dataset
In this lesson we’ll be using a dataset from the medicaldata package. The medicaldata package contains several real medical datasets, intended for use in teaching R. We can browse the list of available datasets on the medicaldata package website. We’ll be using the covid_testing
dataset for this lesson. To get a feel for this dataset, take a look at the “covid_testing” documentation pages on the medicaldata package website.
We can also access a quick description of the covid_testing data frame through R’s help docs.
help("covid_testing")
3 Data Visualization with the ggplot2 Package
ggplot2 is one of the core tidyverse packages. It is a design system allowing us to assemble complex figures from simple components that we “layer” on top of each other. The documentation for this package is excellent. We can access it either through the RStudio interface, or through the ggplot2 website
3.1 Refresher
In the first lesson, we used ggplot2 to generate a scatterplot from the penguin dataset. With this code block, we tell ggplot2 to use values from the ‘bill_length_mm’ column as x-axis coordinates, and values from the ‘flipper_length_mm’ column as y-axis coordinates. Remember, the aes()
or “aesthetic” function tells ggplot2 how to map the columns in our data to visual components of the graph.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
We can streamline this code by using the R pipe (|>
) and by specifying data
and mapping
as positional arguments. Refer to the ggplot()
documentation for the order of its arguments.
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Note, when we’re constructing figures with ggplot2, we connect the ggplot2 functions using the ‘+’ sign, instead of the R pipe. This is because ggplot2 was written before there were any pipes in R. If we accidentally use an R pipe instead of a ‘+’ sign, we’ll get an error
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm)) |>
geom_point()
Error in `geom_point()`:
! `mapping` must be created by `aes()`.
ℹ Did you use `%>%` or `|>` instead of `+`?
3.2 Aesthetic Mapping
Let’s modify the code to map the ‘species’ column to the ‘color’ aesthetic.
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
What happens if we map the ‘body_mass_g’ column to color instead?
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = body_mass_g)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Notice that we see a color gradient. When we used species to color the points, we ended up with three discrete colors. This is because the values in the body_mass_g and species columns have different data types, which affect how R interacts with them.
A value’s data type determines how it’s stored in memory and what type of operations we can perform with it. R supports several basic variable types:
- double - A number or numeric type that has decimals. This is the default type for any number in R. We can use doubles with all arithmetic operators.
- integer - A number without any decimal places. If we want an integer, we need to create one explicitly using a function like
as.integer()
. Integers also work with all arithmetic operators. - character - Text data, enclose by quotation marks, that can contain letters, numbers, and special characters. These are called “strings” in other programming languages. Characters are the most flexible type in terms of what they can store, but the don’t work with arithmetic operators.
- factor - R’s representation of categorical data. These are similar to characters in that they can contain letters, numbers, and special characters. This also means it’s generally easy to make factors out of character data. When we create a factor variable or column , we define all possible categories, or “levels”, that variable/column can assume. There’s additional complexity to factors that we’ll get into later.
- logical - A binary value that can only be
TRUE
orFALSE
. We’ve seen logical values when we used the relational operators (e.g. ==, >, !=). We can tell these apart from characters, becauseTRUE
andFALSE
aren’t surrounded by any quotation marks. Lastly, we can use these with arithmetic operators.TRUE
is treated like 1, andFALSE
is treated like 0
We can check the type of any object in R using the class()
function.
Species is a factor type while body mass is a numeric type. Factors are discrete variables, so ggplot2 maps them to discrete aesthetics. Numeric types are continuous variables, so ggplot2 cannot map them to discrete values.
We can use also use expressions as aesthetic mappings.
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = bill_length_mm > 45)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
3.3 Mapping vs Setting Aesthetics
Above we used the aes()
function to map columns in our penguin data to the aesthetics in our plots. Let’s see what happens when we move the aesthetic arguments outside of the aes()
function.
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point(color = "darkorange")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
This is called setting an aesthetic, because we’re not mapping any of our data to the aesthetic. When we set aesthetic arguments, we provide them with scalar values. The specific values will depend upon the aesthetic (i.e. some are numbers, some require known values).
Look at the geom_point()
documentation, and scroll down to the Aesthetics section. This lists all the different aesthetics this function supports. Try adding some of these aesthetics to the above plot, both in and out of the aes()
function. Hint: we can print a list of R color constants with the colors()
function.
|>
penguins ggplot(aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point(alpha = 0.5,
color = "#900C3F")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
This vignette from the ggplot package describes all of the aesthetics and the various values they can accept. We can find this vignette on the ggplot2 CRAN page, or by using the vignette()
function.
vignette("ggplot2-specs")
Tutorials demonstrating the functionality of a package are called vignettes. They’re another way to familiarize ourselves with a new package. These are downloaded to your system when you install a package and are accessible through the package website (on CRAN or Bioconductor). Not every package comes with a vignette, since they are not a required by CRAN or Bioconductor.
4 Data Exploration with geom_
Functions
So far we’ve seen two examples of geom_
functions: geom_point()
and geom_smooth()
. These functions control how our data are represented, or painted on the ggplot2 canvas. The ggplot2 package includes many different geom_
functions that draw different shapes and extract different types of summary information from our data. Here are a few of the functions you’ll use most commonly for exploratory data analyses.
4.1 Bar chart
The geom_bar()
function generates bar charts, which are useful for plotting the distributions of categorical data. Let’s use this to get a breakdown of the number of COVID-19 tests by their result (positive, negative, invalid).
|>
covid_testing ggplot(aes(x = result)) +
geom_bar()
When we use the geom_bar()
function, we only provide it with an aesthetic mapping for one of the axes (x or y). It automatically calculates the values for the other axis, based on the distribution of our data across the axis we specified.
4.2 Histogram & Freqpoly
The geom_histogram()
function calculates discrete density bins, given just an x-coordinate mapping. Its useful for visualizing the distributions of continuous numerical variables. Let’s use the geom_histogram()
function to plot the distribution of COVID-19 tests over the first 100 days of the covid_testing
dataset.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note, ggplot2 printed a status message with the histogram, pointing us toward two arguments for the geom_histogram()
function: bins
, and binwidth
.
The bins
argument controls the number of bins it groups our data into. The above plot sets the bins
argument to a value of 30 (the default). Try adjusting it below. What do you think will happen to the graph if we increase the bins
value? What if we decrease it?
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_histogram(bins = 60)
Alternatively, we can use the binwidth
argument to set the size of the interval on the x-axis that defines a bin. Whatever value we enter for the binwidth
uses the same units as the x-axis (days, in our case). What do you think will happen to the histogram as we increase the binwidth
? And if we decrease it?
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_histogram(binwidth = 1)
The geom_freqpoly()
function generates the same discrete distribution graph as geom_histogram()
, except it represents the data as lines, instead of bars.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_freqpoly()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can confirm these two functions work the same way by plotting them together. Try that here:
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_histogram() +
geom_freqpoly()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The line representation produced by geom_freqpoly()
can make it easier to plot multiple distributions together.
|>
covid_testing ggplot(aes(x = pan_day, color = result)) +
geom_freqpoly()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For comparison, here’s the same data plotted as a histogram:
|>
covid_testing ggplot(aes(x = pan_day, fill = result)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
4.3 Density
The geom_density()
function generates a smoothed density estimate for the distribution of a continuous variable. This is similar to the histogram and freqpoly geoms, except those geoms generate binned representations of the distribution, while geom_density()
generates a continuous representation of the distribution.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_density()
We can think of this like a probability distribution estimate for all the possible values our x-axis variable can assume (the area under the curve should sum approximately to 1).
after_stat()
function
We can use the after_stat()
function to access the derived stats that the geom_density()
function is calculating behind the scenes. Here we use this function to change the y-value to the number of COVID-19 tests, instead of the density value.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_density(aes(y = after_stat(count)))
We can refer to the the “Computed variables” section of a geom function’s help documentation to find the list of summary variables we can access with after_stat()
.
Here we see that the geom_density()
curve is effectively a smoothed version of the geom_freqpoly()
line.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_freqpoly(binwidth = 1) +
geom_density(aes(y = after_stat(count)))
Why do you think these curves diverge when we increase the ‘binwidth’ for geom_freqpoly()
?
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_freqpoly(binwidth = 3) +
geom_density(aes(y = after_stat(count)))
4.4 Boxplot & Violin plot
The geom_boxplot()
function offers another way to summarize the distributions of continuous data. It calculates and plots the median and quartiles of a continuous variable mapped to the x or y aesthetic.
|>
covid_testing ggplot(aes(y = age)) +
geom_boxplot()
However, what makes boxplots truly useful is their ability to represent the relationship numerical and categorical variables. Here we break up the age distribution we plotted above, by COVID-19 test result.
|>
covid_testing ggplot(aes(x = result, y = age)) +
geom_boxplot()
Looking at these boxplots, we see a subtle indication that patients with positive COVID-19 tests tend to skew a little older than those with negative tests (we’ll need to actually apply a statistical test if we want to say anything more conclusive).
Violin plots are a combination of the geom_boxplot()
and geom_density()
functions. The geom_violing()
function plots a sideways, mirrored representation of a numeric variable’s distribution.
|>
covid_testing ggplot(aes(x = result, y = age)) +
geom_violin()
Violin plots can reveal more subtle differences in relationships between numerical and categorical variables than we can see with a boxplot, like biomodality in the underlying distributions.
Here we’ll graph the geom_violin()
and geom_boxplot()
plots together. How are the boxplots and violin plots similar? How are they different?
|>
covid_testing ggplot(aes(x = result, y = age)) +
geom_violin() +
geom_boxplot(alpha = 0.5)
Here we used the alpha
aesthetic to control the transparency of the boxplot geom. Try changing the alpha
value to see how it affects the graph.
4.5 Horizontal and Vertical Lines
The geom_hline()
and geom_vline()
functions plot horizontal and vertical lines, respectively. They’re useful for marking cutoffs and regions of interest in our plots.
Here we use geom_hline()
to add a line to our age distribution boxplots that marks the division between patients under and over 5 years of age.
|>
covid_testing ggplot(aes(x = result, y = age)) +
geom_violin() +
geom_hline(yintercept = 5)
In this case, we set the value for the yintercept
aesthetic, rather than mapping it to a column in our data.
4.6 Coordinate Transformations
The coord_cartesian()
function allows us to manually set the x- and y-axis ranges of our graphs. This allows us to focus on particular regions of interest.
|>
covid_testing ggplot(aes(x = pan_day)) +
geom_histogram(binwidth = 1) +
coord_cartesian(xlim = c(10, NA),
expand = FALSE)
The coord_flip()
function allows us to swap the x- and y-axes.
|>
covid_testing ggplot(aes(x = result, y = age)) +
geom_boxplot() +
coord_flip()
There are many more coord_
functions that give us fine control coordinate systems we use to create our ggplots.
4.7 Setting Color Scales
Thus far we’ve just used the default color schemes ggplot2 provides. They’re fine for prototyping our analyses, but we’ll probably want to change these to something that looks better if we want to publish these figures. Now that you’ve gained some experience with the look and feel of the default settings for ggplot2 figures, you’ll start to notice them when you read papers.
To specify which colors get mapped to our data, we need to prepare a vector containing all of the colors we want to use.
<- c("grey","dodgerblue","orangered") colors_for_covid_result
In our work so far, we’ve largely worked with tabular data stored in a data frame. A data frame is an example of a data structure. These are constructs in a programming language that are specially designed to store and organize data. We’ll interact with several types of data structures over the course of these lessons:
- Vectors: Store ordered sequences of values, all of which must have the same type. We create vectors using the
c()
function (“combine”). The individual values in a vector are called “vector elements”. - Data frame: Rectangular data structure where columns are vectors (i.e. all values in a column need to have the same type) that all have the same length.
- Matrix: Rectangular like a data frame, except every single values in a matrix must have the same type. We often store metadata (e.g. gene IDs, sample labels) as the row/column names, using the
rownames()
andcolnames()
functions. Matrices are used for performing matrix math and are often much faster for computations than data frames.
With our color vector in hand, we use the scale_fill_manual()
function to match these colors to the three COVID-19 test results in our data. Try changing the result mapping from fill to color and see what happens.
|>
covid_testing filter(age < 100) |>
ggplot(aes(x = result, y = age, fill = result)) +
geom_boxplot() +
scale_fill_manual(values = colors_for_covid_result)
The colors are assigned to the results in alphabetical order.
Instead of manually specifying the colors, we can also use existing collections of colors (“palettes”) that come from various packages. On of the palettes packaged with ggplot2 is Color Brewer. We need to enter the name for one of Color Brewer’s palettes. We can view the options with the RColorBrewer::display.brewer.all()
function.
|>
covid_testing filter(age < 100) |>
ggplot(aes(x = result, y = age, fill = result)) +
geom_boxplot() +
scale_fill_brewer(palette = "Dark2")
Another way to view the Color Brewer palettes is through the Color Brewer website. There are many options on this site that can help you find a good palette, including a toggle to limit your options to colorblind safe options.
4.8 Facets
When we’re plotting complex data, we often run into cases where it would be helpful to break up a figure into smaller sub-figures, each displaying a portion of the input data. We do this with the facet_grid()
and facet_wrap()
functions. In this example, we use facets to compare the distribution of ages by COVID-19 test results between the tests collected in the drive-thru or in the clinic.
|>
covid_testing filter(age < 100) |>
ggplot(aes(x = result, y = age, fill = result)) +
geom_boxplot() +
scale_fill_manual(values = colors_for_covid_result) +
facet_wrap(facets = vars(drive_thru_ind))
The vars()
function is like a version of the aes()
function that’s specific to facet_wrap()
and facet_grid()
. They tell ggplot2 to map a variable from our dataset to the facets. You can think of the facet functions like graphical versions of the group_by()
function we saw previously. They group data and aesthetics according to another variable in our data, and generate separate graph panels for the data in each group.
Facets give use the ability to represent complex datasets with multiple variables in a more digestible manner. Here, we use facet_grid()
to look for differences in age distributions by COVID-19 test results, across patient groups, and by whether or not the test samples as collected at a drive-thru.
|>
covid_testing filter(age < 100) |>
ggplot(aes(x = result, y = age, fill = result)) +
geom_boxplot() +
scale_fill_manual(values = colors_for_covid_result) +
facet_grid(rows = vars(demo_group),
cols = vars(drive_thru_ind), scales = "free_y")
4.9 Saving ggplots
So far we’ve only created figures inside these markdown (Quarto) documents. We can render this file into an HTML or PDF output and then extract the images from the output file, but that’s a lot of extra work if we just want one figure. For ggplot2 figures, we can use the ggsave()
function to save figures to files on disk. The help doc for ggsave()
contains information about how we can control the dimensions, scale, resolution, and format of the saved figure. Here, we’ll save one of our faceted boxplot figures to the IMAGES/
directory as a PNG.
<-
covid_figure |>
covid_testing filter(age < 100) |>
# The "0" / "1" labels for the drive-thru status are not particularly clear.
# Here we create a new column with informative labels for drive-thru status.
# We'll use this new column for faceting, so the facet labels are clearer
# for potential readers.
mutate(
drive_thru_status =
case_match(drive_thru_ind,
0 ~ "In clinic",
1 ~ "Drive-thru")
|>
) ggplot(aes(x = result, y = age, fill = result)) +
geom_boxplot() +
scale_fill_manual(values = colors_for_covid_result) +
facet_grid(rows = vars(demo_group),
cols = vars(drive_thru_status),
scales = "free_y")
ggsave(filename = "IMAGES/COVID-test-result-age-distribution_By-patient-group-and-drive-thru-status_Boxplots.png",
plot = covid_figure,
units = "in",
width = 6,
height = 6)
ggsave()
automatically saves the last plot we generated. Alternatively, we can save our ggplot graph to a variable, and pass that variable to the plot
argument of the ggsave()
function.
5 ggplot2 cheatsheet
These are great reference materials for each of the main Tidyverse packages, including ggplot2. For day-to-day programming, it’s helpful to print some of these out and keep them at your workstation. These offer a great way to quickly look up the info you need to run a function.
6 ggplot2 extensions
There are many packages that extend the functionality of ggplot2, either by adding additional functionality to ggplot2, or implementing new graphing functions based on ggplot2’s conventions and grammar. You can browse a gallery of these extensions here.
7 R session information
Here we report the version number for R and the package versions we used to perform the analyses in this document.
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.4 ggplot2_3.5.1 palmerpenguins_0.1.1
[4] medicaldata_0.2.0
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.2 knitr_1.47 rlang_1.1.3
[5] xfun_0.44 generics_0.1.3 jsonlite_1.8.8 labeling_0.4.3
[9] glue_1.7.0 colorspace_2.1-0 htmltools_0.5.8.1 scales_1.3.0
[13] fansi_1.0.6 rmarkdown_2.27 grid_4.4.0 evaluate_0.23
[17] munsell_0.5.1 tibble_3.2.1 fastmap_1.2.0 yaml_2.3.8
[21] lifecycle_1.0.4 compiler_4.4.0 RColorBrewer_1.1-3 pkgconfig_2.0.3
[25] htmlwidgets_1.6.4 rstudioapi_0.16.0 farver_2.1.2 digest_0.6.35
[29] R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0
[33] magrittr_2.0.3 withr_3.0.0 tools_4.4.0 gtable_0.3.5