<

Graphics

Overview

This lesson introduces some basic visualization tools in R.

Objectives

After completing this lesson, students should be able to:

  1. Create histograms, box and scatter plots using base graphics.
  2. Create histograms, box and scatter plots using ggplot2.
  3. Manipulate settings and save graphic output.

References

Base Graphics

Visualization is the best method for exploring and getting to know your data, as well as generating hypotheses and presenting results. R has robust graphics built in, and even better graphics through the ggplot2 package. We’ll start with the base graphics, and then look a bit more closely at ggplot2.

In the previous module, we used aggregate to examine mean temperatures by month in the airquality dataset. We can get much richer information by visualizing these data.

Histogram

To begin with, to display a single variable, we can generate a simple histogram using the base R function hist:

hist(airquality$Temp,main="Temperature Histogram for Airquality",xlab="Temperature")

The first input is the data vector; “main” is the figure title; and “xlab” is the label for the x axis.

Boxplot

If we want a boxplot to summarize a single variable instead, we can use:

boxplot(airquality$Temp,main="Temperature Boxplot for Airquality",ylab="Temperature")

The boxplot displays the median (dark line), the middle two quartiles (ie, 25th-75th percentiles, or the interquartile range, IQR) in the box, and whiskers out to +/- 1.5 of the IQR.

Scatter plot

Finally, if we want the simplest two-variable scatter plot, we can use plot.

If we want to examine temperature as a function of date, we need to first construct a nice date variable, since right now the airquality data has dates by month number and day of the month. (Think about how paste is working here.)

airdate <- as.Date(paste("1972","-",airquality$Month,"-",airquality$Day,sep=""))

Now we can plot temperature as a function of day:

plot(airdate,airquality$Temp,xlab="Date",ylab="Temperature",main="Temperature by Day")

ggplot2

Rather than go more deeply into the base graphics of R, it is worth turning immediately to the more robust and beautiful ggplot2 graphics package, which is now fairly standard for R visualization.

The fundamental idea of ggplot is that you build up an image by adding together various elements. The core function is ggplot, which takes your data plus a few settings and (silently) outputs it into a usable structure. The actual visualization is made by adding to ggplot various other functions that take the structured data and output the actual graphics to your operating system.

The easiest way to get ggplot2 is to install the package tidyverse (this also includes along other packages like dplyr). Alternatively, you can install the package ggplot2.

As always, this is best illustrated with a few examples!

Histogram

Here is our histogram from before in using ggplot2:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 3.5.2
Warning: package 'ggplot2' was built under R version 3.5.2
Warning: package 'tidyr' was built under R version 3.5.2
Warning: package 'dplyr' was built under R version 3.5.2
Warning: package 'stringr' was built under R version 3.5.2
Warning: package 'forcats' was built under R version 3.5.2
ggplot(airquality,aes(x=Temp)) + geom_histogram(color="#0059b3", fill="#0059b3")

The first function ggplot declares that the dataframe we are using is “airquality” and the x variable we want is “Temp” (“aes” stands for aesthetics, and can take many different aesthetic settings besides the variable names). geom_histogram then takes that data and outputs the histogram; it too can take various settings in the “()” but we use only the default here.

Boxplot

To do a boxplot, we follow a similar syntax:

ggplot(data=airquality,aes(x=1,y=Temp)) + geom_boxplot(color="#4d4d4d", fill="#0059b3")

The reason we need both an x and a y is that geom_boxplot is by default designed to do boxplots over a number of groups. x=1 is just a way to get around this by making x a constant.

Boxplot over x

But if we wanted to boxplot temperature by month, we would write

ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot() + xlab("Month")

Note that we have to change the x variable to a factor so that ggplot knows how to group it. We also added another function to change the xlabel, which otherwise would be the ugly “as.factor(Month)”.

Scatter plot

Finally, to do a scatter plot, you could probably almost guess the syntax by now, except for one thing. ggplot likes to have its data all in a single dataframe, so first we have to add our “airdate” variable to the original dataframe:

airquality2 <- cbind(airquality,airdate)
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature")

Scatter plot with line

We can also add a line to this plot, which illustrates the power of ggplot to build up images by adding layers:

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point()  + geom_line() + xlab("Date") + ylab("Temperature")

Themes

You can pretty much customize every element of a graph with ggplot2. A Theme is a collection of specific customizations that can be applied to a ggplot graph. There are many built-in themes that produce really nice outputs and require minimal effort by your part. For instance, let’s visualize the previous scatter-plot using different themes:

# Black and White Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_bw()

# Classic Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_classic()

# Dark Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_dark()

# Gray Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_gray()

# Minimal Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_minimal()

# Light Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_light()

# Light Theme
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + theme_void()

Themes from publications

Many users like to reproduce the styles of graphs produced by publications like the Wall Street Journal, the website FiveThirtyEight, the statistical software Stata, the magazine The Economist, etc. You can find more examples online. Here I’ll just produce the previous scatterplot using the style of FiveThirtyEight, Wall Street Journal and The Economist for reference. You can get them using the ggthemes package.

FiveThirtyEight

library(ggthemes)
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") +
  theme_fivethirtyeight()

The Economist

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + 
  theme_economist()

Wall Street Journal

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature") + 
  theme_wsj()

Saving graphs

Finally, how do we save our plots?

Save file using ggsave This is easily done with ggplot2 using the ggsave function, which can be executed after you have constructed your image:

ggsave("tempvsdate.pdf",width=6,height=4)

To save the file as a png (a good format for the web), we just change the file name to “tempvsdate.png”.

Save file with R’s export tab

Another alternative is to use RStudio to save graphics. This gives you a little more room to tinker with the size and format, although it is always best practice to include the save within your script for reproducibility.

To save the image in the “Plots” pane using RStudio, click on the “Export” tab right below the “Plots” tab in the “Plots” pane. To save as a PDF, for instance, choose “Save as PDF…”, which gives you the options to set the size and directory, and most importantly, to preview the results so you can get the size right. Different sizes even with the same width-to-height ratio produce different size text and other features, so it’s sometimes worth tinkering to get the most aethetically pleasing output.

Examples

Besides the basic histogram, boxplot and scatter plot graphs there are many more types of graphs that can be produced with ggplot2. In what follows I’ll provide some examples for you to use as reference.

Scatter plot with regression line

Assuming a Non-Linear Fit

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") + geom_smooth(color = "red") + theme_economist()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Linear Fit

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") + geom_smooth(method = "lm", color = "red") + theme_economist()
## `geom_smooth()` using formula 'y ~ x'

Linear Fit (No confidence Interval)

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") + geom_smooth(method = "lm", color = "red", se = FALSE) + theme_economist()
## `geom_smooth()` using formula 'y ~ x'

Pie Charts

data <- data.frame(
  group=LETTERS[1:4],
  value=c(60, 10, 20, 5)
)

ggplot(data, aes(x="", y=value, fill=group)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) + theme_void() 

Correlation heatmap

To produce a correlation matrix plots we can use the ggcorrplot package.

library(ggcorrplot)
cormat <- cor(mtcars) # Producing a correlation matrix from mtcars database

ggcorrplot(cormat)

 ggcorrplot(cormat, hc.order = TRUE, outline.col = "white") # Ordering by correlation clusters

ggcorrplot(cormat, hc.order = TRUE , method = "circle") # Using "circles" 

 ggcorrplot(cormat, hc.order = TRUE , method = "circle", lab = TRUE) # Displaying the values 

  ggcorrplot(cormat, hc.order = TRUE,  type = "lower", method = "circle", lab = TRUE) # Display only the lower triangle

ggcorrplot(cormat, hc.order = TRUE, type = "lower",
            ggtheme = ggplot2::theme_gray, method = "circle",
            colors = c("#6D9EC1", "white", "#E46726")) # Adjusting the color

Help with ggplot2

  • It is almost impossible to learn the syntax by heart, at least for all the little settings you will invariably want.
  • Have a look at Cookbook for R, based on the excellent R Graphics Cookbook by Winston Chang. The online version makes it especially easy to find what you need with the minimum of hair-pulling. The utility of crib sheets and online references for R is always important, but it is especially essential for a visual medium like graphics, where often you may not know exactly what you need until you see it.