In this lesson we learn to aggreate, merge, and manipulate strings in datasets.
After completing this lesson, students should be able to:
We often wish to describe data aggregated over some variables: eg, mean income by political party affiliation, or mean wages by gender. Aggregate
does this using a very simple syntax.
R comes with many sample datasets built in, and many packages come with their own sample datasets to illustrate their functions. One such dataset is the airquality
data, which measures various weather qualities in New York in over a series of days in 1973. To view the first 6 rows of any dataset, one can use the head function:
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Obviously airquality[1:6,]
would also do the trick for the top 6. Tail
gives the last 6 rows.
To view, for instance, the mean temperature aggregated by month in this data, we would use aggregate:
aggregate(Temp ~ Month, data=airquality, mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
The first argument consists of the variables to be aggregated on the left of the ~
, and the variables by which to aggregate on the right. The second argument specifies the dataset name, and the third is the function to apply (which can be anything).
If we want to summarize mean temperature and ozone levels by month, we would write:
aggregate(cbind(Temp,Ozone) ~ Month, data=airquality,mean)
Month Temp Ozone
1 5 66.73077 23.61538
2 6 78.22222 29.44444
3 7 83.88462 59.11538
4 8 83.96154 59.96154
5 9 76.89655 31.44828
And if we want to aggregate over two variables, eg month and day, we would write:
aggregate(cbind(Temp,Ozone) ~ Month + Day, data=airquality,mean)
The notation here might seem a bit inconsistent, but the ~
is part of the “formula” notation in R that is used often in statistical models, as we will soon see.
Aggregate functions
You can use any function you like with aggregate()
as long as the function can take a vector of data and return a single number. R contains many aggregating functions:
One of the final basic tasks needed for data manipulation prior to serious analysis is merging different datasets.
R basic function:
Merge
Other packages (see advanced data manipulation):
plyr
/ dplyr
data.table
tidyverse
To show you how to merge data let’s first create two datasets, “authors” and “books”:
authors <- data.frame(
surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
nationality = c("US", "Australia", "US", "UK", "Australia"),
stringsAsFactors=FALSE)
books <- data.frame(
name = c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core"),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
stringsAsFactors=FALSE)
Now say we wish to merge them by author name (a common variable). We do this with the merge
function:
bookmerge <- merge(authors, books, by.x="surname",by.y="name")
bookmerge
surname nationality title
1 McNeil Australia Interactive Data Analysis
2 Ripley UK Spatial Statistics
3 Ripley UK Stochastic Simulation
4 Tierney US LISP-STAT
5 Tukey US Exploratory Data Analysis
6 Venables Australia Modern Applied Statistics ...
Did the merge work?
The first argument is the first data frame, the second argument is the second dataframe. by.x
is the column name(s) to match on from the first dataset, by.y
is the column name(s) from the second. If you wish to merge on multiple columns, by.x
and by.y
can of course be vectors, such as by.x=c("name","height")
. If no by.x
or by.y
arguments are supplied, merge will look for variables with identical names to match on, but it is generally better to include the names explicitly even if they are the same. By default merge omits rows that aren’t matched in both datasets; to include those that are in the first data frame but not the second, include all.x=TRUE
; all.y=TRUE
does the same for the second data frame.
Besides merge
, we can merge data using plyr
, dplyr
, and tidyverse
. These are packages that often make the process of merging, summarizing and cleaning easier.
Let’s use dplyr
with R’s UScrime
dataset. Disclaimer: R’s built-in datasets can be outdated and do not necessarily reflect contemporary realities.
library(dplyr)
Warning: package 'dplyr' was built under R version 3.5.2
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
crimedata <- MASS::UScrime
glimpse(crimedata)
Rows: 47
Columns: 16
$ M <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134,…
$ So <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
$ Ed <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 11…
$ Po1 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 6…
$ Po2 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61…
$ LF <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595,…
$ M.F <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 97…
$ Pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30,…
$ NW <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, …
$ U1 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, …
$ U2 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43,…
$ GDP <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580,…
$ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172,…
$ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201,…
$ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.69…
$ y <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674,…
Sources: Ehrlich, I. (1973) Participation in illegitimate activities: a theoretical and empirical investigation. Journal of Political Economy, 81, 521-565.
Vandaele, W. (1978) Participation in illegitimate activities: Ehrlich revisited. In Deterrence and Incapacitation, eds A. Blumstein, J. Cohen and D. Nagin, pp. 270-335. US National Academy of Sciences.
Dplyr comes with five key functions:
Select a number of columns and assign it to a temporary dataframe or tibble which is the name used by the dplyr
package:
TempData1 <- select(crimedata, M, So, LF, Pop, NW, Ineq, y)
The following variables are now selected from crimedata:
1. Percentage of males aged 14-24;
2. Indicator variable for a Southern state;
3. Labour force participation rate;
4. State population;
5. Number of non-whites per 1000 people;
6. Income inequality,
7. Rate of crimes in a particular category per head of population.
Dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:
Note that when you refer to columns directly inside select(), you don’t use quotes. When using the helper functions, you do use quotes.
Filter by values on a specific variable or set of variables, for example:
crimedata_select <- filter(crimedata, So==1) #selecting Southern states only
Arrange data by certain columns:
crimedata_arrange <- arrange(crimedata, Pop, y)
Add a new variable to a dataset. For example, let’s create the original variable that contains the number of crimes per state from y
(rate of crimes in a particular category per head of population):
crimedata_mutate <- mutate(crimedata, crimes = (y/100000)*Pop)
Print out a summary with minimum and maximum distance:
summarise(crimedata, min_y = min(y), max_y = max(y))
## min_y max_y
## 1 342 1993
Pipes existed long before they were implemented in the R language and the logic behind it is to write simple codes of chaining, i.e. it passes an intermediate result onto a next function. Click here for more information about the history of pipes in R.
In R, pipes were developed in 2013 by Hadley Wickham, Chief Scientist at RStudio, who also started the dplyr
package. With the operator %>
, we can now chain different steps in our code.
For example, let’s translate the following English sentence into R Code:
Take the crime data set and then …
filter the data by Southern states only and then …
report on the mean of y
.
crimedata %>%
filter(So==1) %>%
summarise(avg=mean(y))
avg
1 856.8125
Your turn: Use dplyr functions and a pipe operator that returns the mean of y
. Next, use mutate and the variable Time
(in days) to add a variable to the crimedata
dataset that calculates the number of crimes per week. Then summarise your findings of the mean number of reportings and and the mean number of reportings per week.
Now let’s spend some time learning the most common built-in functions that are designed to manipulate strings.
When we want to display something as text on the console we use the print()
function. As we have seen, print("Hello")
will output the string “Hello”; we could include print(i)
in a for
loop, for instance, to keep track of progress. For instance,
x <- c("h","e","l","l","o")
for(i in x) {print(i)}
## [1] "h"
## [1] "e"
## [1] "l"
## [1] "l"
## [1] "o"
To concatenate strings one uses the paste
function. Concatenation means that take a set of strings to form a new one by placing them together:
For example, say we want to print the name of different databases that differ by an increasing number.
for(i in 1:5){
filename = paste("datafile",i,".txt",sep="")
print(filename)
}
[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"
[1] "datafile4.txt"
[1] "datafile5.txt"
All the arguments in paste
are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing “”, but it could be a comma if we wanted to create a comma-separated data file (the hard way!).
The inverse of paste
is splitting a string with strsplit
:
sout <- strsplit("Get/vaccinated/as/soon/as/possible","/")
sout
[[1]]
[1] "Get" "vaccinated" "as" "soon" "as"
[6] "possible"
This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.
Please note that this slide has been modified from what was presented and now provides additional explanation:
To test whether a string or value is in another set (including a vector or table), we can use %in%
:
c("fish","dog") %in% c("happy","fish","pie") #Prints whether or not fish and dog, respectively, occur at all in another set. The comma and quotation marks indicate that "fish" and "dog" are treated as two separate elements.
[1] TRUE FALSE
c("fish, dog") %in% c("happy","fish","pie") #Prints whether or not "fish, dog" (treated as one element) are included in another set.
[1] FALSE
c("fish","dog", 2) %in% c("happy","fish","pie", 2) # Prints whether the three elements "fish", "dog", and 2 are included in another set.
[1] TRUE FALSE TRUE
Note that in this example we do three searches (one for each element of the vector on the left of the %in%
), and it returns TRUE if a search element matches any of the elements of the vector being searched.
Partial matches don’t work with %in%
:
"fish" %in% "I would like to go fishing"
[1] FALSE
For partial matches, you can use grep
:
grep("fish", c("I would like to go fishing","dog my cats","fishsticks","I'm convinced there is something fishy going on"))
[1] 1 3 4
Note that grep
outputs the element numbers for the elements in the searched vector where it finds matches. Grep
uses the powerful regex
search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.
The last essential string tool is replacement, using gsub
:
gout <- gsub(c("Sick|Sad"), "Healthy", c("Sick baby","Sad puppy"))
gout
[1] "Healthy baby" "Healthy puppy"
The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string. Note that I use | to separate the different words that can be replaced.