Lesson 1: Visualizing data with ggplot2

Published

October 19, 2023

1 Prepare the R environment for this lesson

When we start R, we only have access to the base R packages. In order to use any additional packages, we need to load them into memory with the library() function.

For this lesson we’ll need two packages: ggplot2 and palmerpenguins. Run the following code chunk to load these two packages. Note, the second library command is empty. Modify the command so it loads the palmerpenguins package.

library(ggplot2)
library(palmerpenguins)

It’s good practice to start R code, stored in R scripts or quarto/Rmarkdown documents, by loading the packages we’ll be using in the body of the code. It gives readers, including future versions of ourselves, a consistent place to check all of the packages required to run our code.

2 A note about functions

Functions are how we do everything in R. This includes simple procedures, like calculating the sum of three numbers, and more complex operations, like performing a linear regression and plotting the resulting regression line against the input data. Above, we saw an example of the general convention for writing functions in R: function_name(arguments). Generally speaking, a function’s arguments provide it with input and can control how it runs.

3 R Documentation

When we download and install a package from CRAN, we’re also downloading documentation pages that describe how to run each function in the package.

In RStudio, we can access the documentation through the Help tab, located in the lower right pane of the RStudio window. Let’s use the sum() function as an example. To access the documentation pages for the sum() function:

Click the Help tab in the lower right pane of the RStudio window.
Click the search box in the upper right corner of the Help tab.
Type “sum” in the search bar. Note, we don’t include the parentheses after the function name when we’re searching for documentation.
Press “Enter”

The documentation page for the sum() function should appear in the Help pane. It contains several sections describing what the function does, the names of its arguments and how they work, and the output of the function. The very end of the documentation page also contains example code that we can run to get a better feel for how the function works (these are very useful).

We can also access these same help pages directly through the R console by using the help() function, or the “?” operator.

help("sum")

?sum

When learning about a new function, it’s often most helpful to pay special attention to the usage, arguments, and examples sections of the function’s documentation page.

Keep Using the Help Tab

As we proceed through these materials, don’t hesitate to use the help feature whenever we encounter a new function.

4 Plotting penguins

4.1 Look at the Palmer Penguins data table

After loading the palmerpenguins package, we now have access to table of penguin data. We can view the first 10 lines of this table directly in the R console:

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Each row in this table contains data from a single penguin measurement during the study period. Looking at the columns, we see all of the different information the researchers recorded about each penguin. We can also see that there are observations with missing information, represented by NA values. In R terminology, this is called a data frame. It is a natural way of representing rectangular, spreadsheet-style data. The penguin data are technically stored as a tibble, which is a special type of data frame used by the tidyverse.

The palmerpenguins package also contains documentation for the penguin data frame. Look up “penguins” using the Help pane, or one of the other help functions. The documentation describes the contents of each column in the data frame.

In addition to viewing the data frame in the R console, we can use the View() function to quickly examine its contents in a scrollable, spreadsheet-style window.

View(penguins)

Inside the view window, we can search the data frame for specific values, or re-order it by values in each of the columns. While the View() function is useful for smaller datasets, it is not generally suitable for data frames with more than a few thousand rows.

The “V” in the View() function is capitalized

If we try to use the View() function with a lower case “v”, we’ll generally get an error:

view(penguins)

Error in view(penguins): could not find function "view"

Lastly, we can use the summary() function to get summary statistics about the data in each column of our data frame.

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

These functions help us building intuition about new datasets and spot potential problems for downstream analyses.

4.2 Visualize the penguin data with ggplot2

The penguins data frame contains measurements of each penguin’s body mass and flipper length. Here we will use the ggplot2 package to visualize these data and examine the relationship between these two anatomical measurements.

Below we’ll review the code we need to create this graph: Scatterplot comparing flipper length and body mass across three penguin species

4.2.1 How to create a basic scatterplot

We start with the ggplot() function, which creates our ggplot and specifies the data we want to use. Think of this like setting up a blank canvas before we start painting. Note, we’re giving the ggplot() function access to the penguin data frame with the data argument.

ggplot(data = penguins)

Next, we use the mapping argument to tell the ggplot() function how we want it to use the penguin data. In ggplot2, we always use the aes() function to define how to map the variables (columns) in our data to the visual properties (aesthetics) of the shapes we want to paint. In this example we want to plot body mass with the x-axis and flipper length with the y-axis.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm))

Note how mapping the x/y aesthetics has affected the graph. The axes are now labeled and the ggplot() function has automatically set their ranges based on the range of body masses and flipper lengths in the penguin data frame.

Now that we’ve prepared our canvas, we can start adding layers of paint. We paint shapes in ggplot2 using geom functions. Since we want to create a scatterplot, we’ll use the geom_point() function to add a layer of points to our plot. There is a whole family of different geom_ functions that we can use to plot different types of graphs (e.g. scatter plots, line graphs, bar graphs).

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that we combine the geom_point() function with the ggplot() function using the + sign. We can treat this like we’re “adding” a layer of points on top of the canvas we created with ggplot().

Also, take note of the warning we get from the ggplot2 code. Recall that we saw some columns with NA values when we were looking at the contents of the penguins data frame. The warning is telling us that two of the entries in the penguins data frame have NA values in either the body_mass_g or flipper_length_mm columns. The geom_point() function has no way of placing a point with a missing x- or y-coordinate, so ggplot2 automatically excludes rows with missing values before plotting. This filter only applies to columns (variables) in our data frame that we’re mapping to aesthetics. ggplot2 will still use rows in the penguins data frame that have NA values in the bill_length_mm and sex columns, since we’re not currently using those to plot anything.

So the general procedure we follow for plotting data with ggplot2 is to prepare our canvas with the ggplot() function, use the aes() function to tell ggplot2 how we want to use our data to paint the canvas, and apply paint to our canvas with a geom_ function (geom_point in the example above).

4.2.2 Decorating our plots with more data

Now that we have a basic scatterplot in hand, we can try to gain deeper insights into our data by incorporating more information into our plot. Our current scatterplot shows a positive relationship between flipper length and body mass. We know the penguin data frame contains measurements from three different penguin species. Let’s map species to the color aesthetic to see how this trend looks across all three species.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm,
                     color = species)) +
    geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Not only did ggplot2 automatically assign a different color to each penguin species, it also added a legend to the graph. That means this same block of code can work for a penguin data frame that contains data from one, two, or ten species of penguin.

To get a clearer picture of the trend between flipper length and body mass, we’re no going to add a regression line to the plot. We’ll do this with the geom_smooth() function. We’ll use the method = "lm" argument to specify we want to plot the line we get from fitting our data with a linear model.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm,
                     color = species)) +
    geom_point() +
    geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

It looks like geom_smooth() fit separate linear models for each penguin species and plotted the lines using the same colors as the points. While this is a useful feature (we’ll make use of this below), we want to fit all of the data with a single linear model. When we define aesthetic mappings in the ggplot() function, they apply to all geom_ functions in that plot. By mapping species to color, we told geom_smooth() we want it to color the smoothed line by species, which means it needs to fit a separate line for each species. To solve this problem we can specify an aesthetic mapping within a specific geom_ function. Let’s try moving the species to color mapping to geom_point() and geom_smooth(). Which option still colors the points by species, but fits all of the data with a single linear model?

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point(mapping = aes(color = species)) +
    geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Now that we’ve sorted out the trend line, let’s return to how we’re plotting the points. The colors improve the readability of this graph, but could pose a problem if this figure is rendered in black and white, or if a reader has certain types of colorblindness. One solution is to map the points from different penguin species to different shapes. Try adding an aesthetic mapping of shape to species.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point(mapping = aes(color = species, shape = species)) +
    geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that the legend automatically updates to reflect the new shape mapping.

The are many more aesthetics beyond shape and color. The documentation for geom_ functions contains a special section listing the aesthetic mappings supported by that geom. Look up the help page for “geom_point” and find the list of supported aesthetics.

We’re almost done recreating the figure we saw at the beginning of this section. The axis labels our version of the figure are not as clean, and we’re still missing the plot title and subtitle. Collectively, these attributes are known as the graph’s “labels”, and we can modify them using the labs() function.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point(mapping = aes(color = species, shape = species)) +
    geom_smooth(method = "lm") +
    labs(title = "Relationship between flipper length and body mass",
         subtitle = "Across three species of penguins studied at the Palmer Research Station",
         x = "Body mass (g)",
         y = "Flipper length (mm)")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Lastly, we’ll change the color scheme to a different palette from the default. The ggplot2 package includes a few additional color palettes we can apply to our plot using scale_color_ functions. For now, we’ll select a palette from the ColorBrewer collection that is good for representing qualitative data and is more colorblind safe than the default palette.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point(mapping = aes(color = species, shape = species)) +
    geom_smooth(method = "lm") +
    labs(title = "Relationship between flipper length and body mass",
         subtitle = "Across three species of penguins studied at the Palmer Research Station",
         x = "Body mass (g)",
         y = "Flipper length (mm)") +
    scale_color_brewer(palette = "Dark2")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

For now, we’re not going to worry about the specifics of the ColorBrewer collection of palettes (though feel free to examine the documentation for the scale_color_brewer() function). But we can see that we can further adjust how ggplot2 displays our data using the scale_ family of functions.

With that last tweak, we’ve successfully recreated the figure we saw above.

4.2.3 Creating our own scatterplot

Diagram of penguin bill indicating which dimensions correspond to depth and length — Artwork by @allison_horst

Now we’re going to put everything we just learned into practice. Using the penguins data and the ggplot2 functions, we’re going to create a scatterplot of bill depth vs bill length. We can view the contents of penguins data frame directly, or consult the documentation to find the names of the columns that contain this information. Again, we’ll add a regression line that we fit using all of the data.

ggplot(data = penguins,
       mapping = aes(x = bill_depth_mm, y = bill_length_mm)) +
    geom_point() +
    geom_smooth(method = 'lm') +
    labs(title = "Bill length vs bill depth in three penguin species",
         x = "Bill depth (mm)",
         y = "Bill length (mm)")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Looking at the graph we created, what does it suggest about a relationship between bill depth and bill length? How might species affect this? How does the graph change if we fit separate regression lines within each species?

ggplot(data = penguins,
       mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
    geom_point(mapping = aes(shape = species)) +
    geom_smooth(method = 'lm') +
    labs(title = "Bill length vs bill depth in three penguin species",
         x = "Bill depth (mm)",
         y = "Bill length (mm)") +
    scale_color_brewer(palette = "Dark2")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The difference between these two graphs is an example of Simpson’s paradox. Briefly, this is a phenomenon where a trend between two variables at the population level disappears, emerges, or reverses when we divide the population into groups. We were able to observe these changes by experimenting with the way we visualized the penguin data.

We’re only just scratching the surface of what we can do with ggplot2, but we’ve used it to rapidly visualize the penguin data and generate some interesting observations. In the next lesson, we will explore some of these observations further by working directly with the data to calculate summary statistics.

5 R session information

Just as it’s good practice for us to list all of the packages we load at the top of our code, it’s equally important to report the version number for R and the package versions we used at the end of our code. This will make it much easier to reproduce our results in the future. By running the sessionInfo() function at the end of our code, we capture any changes to the environment that happened in the body of our code (e.g some function silently load additional packages when you run them).

sessionInfo()

R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] palmerpenguins_0.1.1 ggplot2_3.5.1       

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0       gtable_0.3.5       jsonlite_1.8.8     dplyr_1.1.4       
 [5] compiler_4.4.0     tidyselect_1.2.1   splines_4.4.0      scales_1.3.0      
 [9] yaml_2.3.8         fastmap_1.2.0      lattice_0.22-6     R6_2.5.1          
[13] labeling_0.4.3     generics_0.1.3     knitr_1.47         htmlwidgets_1.6.4 
[17] tibble_3.2.1       munsell_0.5.1      pillar_1.9.0       RColorBrewer_1.1-3
[21] rlang_1.1.3        utf8_1.2.4         xfun_0.44          cli_3.6.2         
[25] withr_3.0.0        magrittr_2.0.3     mgcv_1.9-1         digest_0.6.35     
[29] grid_4.4.0         rstudioapi_0.16.0  lifecycle_1.0.4    nlme_3.1-164      
[33] vctrs_0.6.5        evaluate_0.23      glue_1.7.0         farver_2.1.2      
[37] fansi_1.0.6        colorspace_2.1-0   rmarkdown_2.27     tools_4.4.0       
[41] pkgconfig_2.0.3    htmltools_0.5.8.1