Data Visualisation with R

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Grammar of Graphics

How do we plot?


Catalogue of plot types (not exhaustive)

How do we plot it all?

One function

One complete plot type


The number of plots that can be drawn


The number of plot functions

Summary of R graphics


ggplot2 R package

  • ggplot2 R package is part of the tidyverse suite of R packages
  • ggplot2 is widely used by the scientific community and even by news outlets (e.g. Financial Times and BBC)

The grammar of graphics

  • In linguistics, we combine finite number of words to construct vast number of sentences under a shared understanding of the grammar.

Wilkinson (2005) introduced “the grammar of graphics” as a paradigm to describe plots by combining a finite number of components.

  • Wickham (2010) interpreted the grammar of graphics into the ggplot2 R package (as part of his PhD project).
  • The grammar of graphics paradigm is also interpreted in other programming languages such as Python (e.g., plotnine), Julia (e.g., Gadfly.jl, VegaLite.jl), and Javascript (e.g. VegaLite).

Basic structure of ggplot


  • data as data.frame
  • a set of aesthetic mappings between variables in the data and visual properties
  • at least one layer which describes what to render
  • the coordinate system (explained later)

Visualising distributions

  • geom_histogram()
  • geom_density()
  • stat_ecdf()
  • stat_qq()
  • geom_boxplot()
  • geom_violin()
  • geom_jitter()

Case study Faithful eruptions

  • faithful is a built-in data set in R
  • It contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Layers for univariate data

Available geom layers in ggplot2

Available stat layers in ggplot2

A layer in ggplot


  • A layer in ggplot has five main components:
    • geom - the geometric object to use display the data
    • stat - statistical transformation to use on the data data
    • data to be displayed in this layer (usually inherited)
    • mapping - aesthetic mappings (usually inherited)
    • position - position adjustment

Layer data

Deconstructing histogram

Deconstructing histogram

Accessing layer data

  • Equivalent to the old syntaxes
    • y = stat(density) and
    • y = ..density..

Density is calculated as the count divided by the total number of observations and the bin width.

Visualising multivariate relationships

  • geom_point()
  • geom_smooth()
  • geom_bin2d()
  • geom_hex()
  • geom_line()

Case study Palmer penguins

Mapping aesthetics to data

vignette("ggplot2-specs")

  • Aesthetic arguments for each layer are found in documentation (e.g. ?geom_point).

Common aesthetics include:


x and y

alpha

color

fill

size

Data variables:


species island bill_len bill_dep flipper_len body_mass sex year


Example: a scatterplot with geom_point()

Make the following target plot:

  • Notice that legends are automatically made for aesthetics

Aesthetic specification for points

shape

stroke vs size

  • The default shape is “circle”.
  • stroke and fill is only for the “filled” shapes.

Aesthetic specifications for lines

color

linetype

linewidth

lineend

linejoin

Aesthetic or Attribute?

  • When you supply values within aes, it assumes that it’s a data variable.
  • The string "dodgerblue" gets converted into a variable with one level and it gets colored by ggplot’s default color palette.

When your input is an attribute

Don’t put attributes inside aes()!

Make this target plot:

“As-is” operator for attributes

Use I() operator to mean “as-is” in aesthetic mapping.

Attributes are for layers

Attributes should be defined in specific layers.

  • Notice how the points don’t have the “dodgeblue” color.
  • Layers inherit data and the mapping from ggplot() but not attributes.

Summary

  • data as data.frame
  • a set of aesthetic mappings between variables in the data and visual properties
  • at least one layer (usually geom_ or stat_ functions) which describes what to render
  • the coordinate system (explained later)


x, y, color, fill, size, alpha, shape, linetype, linewidth, etc.

  • geom - the geometric object to use display the data
  • stat - statistical transformation to use on the data data
  • data to be displayed in this layer (usually inherited)
  • mapping - aesthetic mappings (usually inherited)
  • position - position adjustment

ggplot2 cheatsheet

Position Adjustments and Coordinate Systems

Visualising amounts and proportions

  • geom_bar()
  • geom_col()
  • geom_point()
  • geom_tile()
  • geom_density()

A barplot with geom_bar()

  • Here the stat = "count" is computing the frequencies for each category for you.
  • You can alternatively use stat_count() and change the geom.

Summary data

  • Sometimes your input data may already contain pre-computed counts.
  • What is the observational unit (row) for the datasets below?

A barplot with geom_col()

  • In this case, you don’t need stat = "count" to do the counting for you and use geom_col() instead.
  • This is essential a short hand for geom_bar(stat = "identity") where stat = "identity" means that you will take the value as supplied without any statistical transformation.

Position adjustments in barplots

Position adjustments

  • position_dodge() for grouped barplots
  • position_dodge2() for improved grouped barplots
  • position_fill() for stacked percentage barplots
  • position_identity() to use the raw positions
  • position_jitter() to add random noise to points
  • position_jitterdodge() for jittered and dodged points
  • position_nudge() to shift the position by a fixed amount
  • position_stack() for stacked barplots

Pie or donut charts with coord_polar()

  • The default coordinate system is the Cartesian coordinate system.
  • But you can change this to a polar coordinate system like below.

Coordinate systems

  • coord_cartesian() for Cartesian coordinate systems (default)
  • coord_equal() is essentially coord_fixed(ratio = 1)
  • coord_fixed() to use a fixed aspect ratio
  • coord_flip() to flip the x and y
  • coord_map() to use projection based on mapproj
  • coord_munch() to improve rendering of large datasets
  • coord_polar() to use polar coordinates
  • coord_quickmap() for quick map coordinate system
  • coord_radial() for radial coordinates
  • coord_sf() for spatial data frames
  • coord_transform() to transform the coordinate after the statistical transformation

Summary

  • position_dodge()
  • position_dodge2()
  • position_fill()
  • position_identity()
  • position_jitter()
  • position_jitterdodge()
  • position_nudge()
  • position_stack()
  • coord_cartesian()
  • coord_equal()
  • coord_fixed()
  • coord_flip()
  • coord_map()
  • coord_munch()
  • coord_polar()
  • coord_quickmap()
  • coord_radial()
  • coord_sf()
  • coord_transform()

Mulitple Layers with ggplot2

Layering plots

  • You can add more than one layer.
  • The order of layer matters.
  • A layer inherits the data and mapping from the initialised ggplot object by default

Layer-specific data and aesthetic mapping

  • For each layer, aesthetic and/or data can be overwritten.

Case study 🚜 Iowa farmland values by county

Drawing maps

  • Drawing maps require spatial/boundary data that defines the shapes of the regions.

Layer specific aesthetic

  • Layer specific aesthetic are not inherited by other layers.

Layer specific data

  • Layer specific data can overwrite the inherited data.

Layer specific data as a function of inherited data

Annotation layer

  • annotate() allows you to add elements to plots without a data.frame

More features of ggplot2

  • theme() to modify non-data components of the plot
  • facet_wrap() and facet_grid() for small multiples
  • scale_*() to modify scales
  • guides() to modify legends
  • labs() to modify labels and titles
  • ggsave() to save plots to files
  • … etc.

ggplot2 extensions

https://www.ggplot2-exts.org/

Featuring: ggincerta

  • Your tutor, Maggie Ma, developed an extension package ggincerta to visualise uncertainty in ggplot2 as part of her PhD work!
library(ggincerta)
ggplot(nc) + 
  aes(fill = duo(value, sd)) +
  geom_sf()

Summary

  • You can construct plots with multiple layers in ggplot2.
  • The order of the layer matters.

  • A layer inherits the data and mapping from the initialised ggplot object by default
  • But the data and mappings for each layer can be overwritten.
  • There are many more features and extensions of ggplot2 to explore!