library(ggplot2)
library(palmerpenguins)
Lesson 1: Visualizing data with ggplot2
1 Prepare the R environment for this lesson
When we start R, we only have access to the base R packages. In order to use any additional packages, we need to load them into memory with the library()
function.
For this lesson we’ll need two packages: ggplot2 and palmerpenguins. Run the following code chunk to load these two packages. Note, the second library command is empty. Modify the command so it loads the palmerpenguins package.
It’s good practice to start R code, stored in R scripts or quarto/Rmarkdown documents, by loading the packages we’ll be using in the body of the code. It gives readers, including future versions of ourselves, a consistent place to check all of the packages required to run our code.
2 A note about functions
Functions are how we do everything in R. This includes simple procedures, like calculating the sum of three numbers, and more complex operations, like performing a linear regression and plotting the resulting regression line against the input data. Above, we saw an example of the general convention for writing functions in R: function_name(arguments)
. Generally speaking, a function’s arguments provide it with input and can control how it runs.
3 R Documentation
When we download and install a package from CRAN, we’re also downloading documentation pages that describe how to run each function in the package.
In RStudio, we can access the documentation through the Help tab, located in the lower right pane of the RStudio window. Let’s use the sum()
function as an example. To access the documentation pages for the sum()
function:
- Click the Help tab in the lower right pane of the RStudio window.
- Click the search box in the upper right corner of the Help tab.
- Type “sum” in the search bar. Note, we don’t include the parentheses after the function name when we’re searching for documentation.
- Press “Enter”
The documentation page for the sum()
function should appear in the Help pane. It contains several sections describing what the function does, the names of its arguments and how they work, and the output of the function. The very end of the documentation page also contains example code that we can run to get a better feel for how the function works (these are very useful).
We can also access these same help pages directly through the R console by using the help()
function, or the “?” operator.
help("sum")
?sum
When learning about a new function, it’s often most helpful to pay special attention to the usage, arguments, and examples sections of the function’s documentation page.
As we proceed through these materials, don’t hesitate to use the help feature whenever we encounter a new function.
4 Plotting penguins
4.1 Look at the Palmer Penguins data table
After loading the palmerpenguins package, we now have access to table of penguin data. We can view the first 10 lines of this table directly in the R console:
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Each row in this table contains data from a single penguin measurement during the study period. Looking at the columns, we see all of the different information the researchers recorded about each penguin. We can also see that there are observations with missing information, represented by NA
values. In R terminology, this is called a data frame. It is a natural way of representing rectangular, spreadsheet-style data. The penguin data are technically stored as a tibble, which is a special type of data frame used by the tidyverse.
The palmerpenguins package also contains documentation for the penguin data frame. Look up “penguins” using the Help pane, or one of the other help functions. The documentation describes the contents of each column in the data frame.
In addition to viewing the data frame in the R console, we can use the View()
function to quickly examine its contents in a scrollable, spreadsheet-style window.
View(penguins)
Inside the view window, we can search the data frame for specific values, or re-order it by values in each of the columns. While the View()
function is useful for smaller datasets, it is not generally suitable for data frames with more than a few thousand rows.
Lastly, we can use the summary()
function to get summary statistics about the data in each column of our data frame.
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
These functions help us building intuition about new datasets and spot potential problems for downstream analyses.
4.2 Visualize the penguin data with ggplot2
The penguins
data frame contains measurements of each penguin’s body mass and flipper length. Here we will use the ggplot2 package to visualize these data and examine the relationship between these two anatomical measurements.
Below we’ll review the code we need to create this graph:
4.2.1 How to create a basic scatterplot
We start with the ggplot()
function, which creates our ggplot and specifies the data we want to use. Think of this like setting up a blank canvas before we start painting. Note, we’re giving the ggplot()
function access to the penguin data frame with the data
argument.
ggplot(data = penguins)
Next, we use the mapping
argument to tell the ggplot()
function how we want it to use the penguin data. In ggplot2, we always use the aes()
function to define how to map the variables (columns) in our data to the visual properties (aesthetics) of the shapes we want to paint. In this example we want to plot body mass with the x-axis and flipper length with the y-axis.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm))
Note how mapping the x/y aesthetics has affected the graph. The axes are now labeled and the ggplot()
function has automatically set their ranges based on the range of body masses and flipper lengths in the penguin data frame.
Now that we’ve prepared our canvas, we can start adding layers of paint. We paint shapes in ggplot2 using geom functions. Since we want to create a scatterplot, we’ll use the geom_point()
function to add a layer of points to our plot. There is a whole family of different geom_
functions that we can use to plot different types of graphs (e.g. scatter plots, line graphs, bar graphs).
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Note that we combine the geom_point()
function with the ggplot()
function using the +
sign. We can treat this like we’re “adding” a layer of points on top of the canvas we created with ggplot()
.
Also, take note of the warning we get from the ggplot2 code. Recall that we saw some columns with NA
values when we were looking at the contents of the penguins data frame. The warning is telling us that two of the entries in the penguins data frame have NA
values in either the body_mass_g
or flipper_length_mm
columns. The geom_point()
function has no way of placing a point with a missing x- or y-coordinate, so ggplot2 automatically excludes rows with missing values before plotting. This filter only applies to columns (variables) in our data frame that we’re mapping to aesthetics. ggplot2 will still use rows in the penguins data frame that have NA
values in the bill_length_mm
and sex
columns, since we’re not currently using those to plot anything.
So the general procedure we follow for plotting data with ggplot2 is to prepare our canvas with the ggplot()
function, use the aes()
function to tell ggplot2 how we want to use our data to paint the canvas, and apply paint to our canvas with a geom_
function (geom_point in the example above).
4.2.2 Decorating our plots with more data
Now that we have a basic scatterplot in hand, we can try to gain deeper insights into our data by incorporating more information into our plot. Our current scatterplot shows a positive relationship between flipper length and body mass. We know the penguin data frame contains measurements from three different penguin species. Let’s map species to the color aesthetic to see how this trend looks across all three species.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm,
color = species)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Not only did ggplot2 automatically assign a different color to each penguin species, it also added a legend to the graph. That means this same block of code can work for a penguin data frame that contains data from one, two, or ten species of penguin.
To get a clearer picture of the trend between flipper length and body mass, we’re no going to add a regression line to the plot. We’ll do this with the geom_smooth()
function. We’ll use the method = "lm"
argument to specify we want to plot the line we get from fitting our data with a linear model.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm,
color = species)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
It looks like geom_smooth()
fit separate linear models for each penguin species and plotted the lines using the same colors as the points. While this is a useful feature (we’ll make use of this below), we want to fit all of the data with a single linear model. When we define aesthetic mappings in the ggplot()
function, they apply to all geom_
functions in that plot. By mapping species to color, we told geom_smooth()
we want it to color the smoothed line by species, which means it needs to fit a separate line for each species. To solve this problem we can specify an aesthetic mapping within a specific geom_
function. Let’s try moving the species to color mapping to geom_point()
and geom_smooth()
. Which option still colors the points by species, but fits all of the data with a single linear model?
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Now that we’ve sorted out the trend line, let’s return to how we’re plotting the points. The colors improve the readability of this graph, but could pose a problem if this figure is rendered in black and white, or if a reader has certain types of colorblindness. One solution is to map the points from different penguin species to different shapes. Try adding an aesthetic mapping of shape
to species.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Note that the legend automatically updates to reflect the new shape mapping.
The are many more aesthetics beyond shape and color. The documentation for geom_
functions contains a special section listing the aesthetic mappings supported by that geom. Look up the help page for “geom_point” and find the list of supported aesthetics.
We’re almost done recreating the figure we saw at the beginning of this section. The axis labels our version of the figure are not as clean, and we’re still missing the plot title and subtitle. Collectively, these attributes are known as the graph’s “labels”, and we can modify them using the labs()
function.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(title = "Relationship between flipper length and body mass",
subtitle = "Across three species of penguins studied at the Palmer Research Station",
x = "Body mass (g)",
y = "Flipper length (mm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Lastly, we’ll change the color scheme to a different palette from the default. The ggplot2 package includes a few additional color palettes we can apply to our plot using scale_color_
functions. For now, we’ll select a palette from the ColorBrewer collection that is good for representing qualitative data and is more colorblind safe than the default palette.
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(title = "Relationship between flipper length and body mass",
subtitle = "Across three species of penguins studied at the Palmer Research Station",
x = "Body mass (g)",
y = "Flipper length (mm)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
For now, we’re not going to worry about the specifics of the ColorBrewer collection of palettes (though feel free to examine the documentation for the scale_color_brewer()
function). But we can see that we can further adjust how ggplot2 displays our data using the scale_
family of functions.
With that last tweak, we’ve successfully recreated the figure we saw above.
4.2.3 Creating our own scatterplot
Now we’re going to put everything we just learned into practice. Using the penguins
data and the ggplot2 functions, we’re going to create a scatterplot of bill depth vs bill length. We can view the contents of penguins
data frame directly, or consult the documentation to find the names of the columns that contain this information. Again, we’ll add a regression line that we fit using all of the data.
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = "Bill length vs bill depth in three penguin species",
x = "Bill depth (mm)",
y = "Bill length (mm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Looking at the graph we created, what does it suggest about a relationship between bill depth and bill length? How might species affect this? How does the graph change if we fit separate regression lines within each species?
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point(mapping = aes(shape = species)) +
geom_smooth(method = 'lm') +
labs(title = "Bill length vs bill depth in three penguin species",
x = "Bill depth (mm)",
y = "Bill length (mm)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The difference between these two graphs is an example of Simpson’s paradox. Briefly, this is a phenomenon where a trend between two variables at the population level disappears, emerges, or reverses when we divide the population into groups. We were able to observe these changes by experimenting with the way we visualized the penguin data.
We’re only just scratching the surface of what we can do with ggplot2, but we’ve used it to rapidly visualize the penguin data and generate some interesting observations. In the next lesson, we will explore some of these observations further by working directly with the data to calculate summary statistics.
5 R session information
Just as it’s good practice for us to list all of the packages we load at the top of our code, it’s equally important to report the version number for R and the package versions we used at the end of our code. This will make it much easier to reproduce our results in the future. By running the sessionInfo()
function at the end of our code, we capture any changes to the environment that happened in the body of our code (e.g some function silently load additional packages when you run them).
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] palmerpenguins_0.1.1 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] Matrix_1.7-0 gtable_0.3.5 jsonlite_1.8.8 dplyr_1.1.4
[5] compiler_4.4.0 tidyselect_1.2.1 splines_4.4.0 scales_1.3.0
[9] yaml_2.3.8 fastmap_1.2.0 lattice_0.22-6 R6_2.5.1
[13] labeling_0.4.3 generics_0.1.3 knitr_1.47 htmlwidgets_1.6.4
[17] tibble_3.2.1 munsell_0.5.1 pillar_1.9.0 RColorBrewer_1.1-3
[21] rlang_1.1.3 utf8_1.2.4 xfun_0.44 cli_3.6.2
[25] withr_3.0.0 magrittr_2.0.3 mgcv_1.9-1 digest_0.6.35
[29] grid_4.4.0 rstudioapi_0.16.0 lifecycle_1.0.4 nlme_3.1-164
[33] vctrs_0.6.5 evaluate_0.23 glue_1.7.0 farver_2.1.2
[37] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.27 tools_4.4.0
[41] pkgconfig_2.0.3 htmltools_0.5.8.1