Statistical Analysis 2.3: Data Manipulation

Data manipulation

Overview

In this lesson we learn to aggreate, merge, and manipulate strings in datasets.

Objectives

After completing this lesson, students should be able to:

Create summary datasets and statistics using “aggregate.”
Merge datasets.
Manipulate strings.

Aggregate

We often wish to describe data aggregated over some variables: eg, mean income by political party affiliation, or mean wages by gender. Aggregate does this using a very simple syntax.

R comes with many sample datasets built in, and many packages come with their own sample datasets to illustrate their functions. One such dataset is the airquality data, which measures various weather qualities in New York in over a series of days in 1973. To view the first 6 rows of any dataset, one can use the head function:

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Obviously airquality[1:6,] would also do the trick for the top 6. Tail gives the last 6 rows.

Aggregate over a variable

To view, for instance, the mean temperature aggregated by month in this data, we would use aggregate:

aggregate(Temp ~ Month, data=airquality, mean)

  Month     Temp
1     5 65.54839
2     6 79.10000
3     7 83.90323
4     8 83.96774
5     9 76.90000

The first argument consists of the variables to be aggregated on the left of the ~, and the variables by which to aggregate on the right. The second argument specifies the dataset name, and the third is the function to apply (which can be anything).

Aggregate over two variables

If we want to summarize mean temperature and ozone levels by month, we would write:

aggregate(cbind(Temp,Ozone) ~ Month, data=airquality,mean)

  Month     Temp    Ozone
1     5 66.73077 23.61538
2     6 78.22222 29.44444
3     7 83.88462 59.11538
4     8 83.96154 59.96154
5     9 76.89655 31.44828

And if we want to aggregate over two variables, eg month and day, we would write:

aggregate(cbind(Temp,Ozone) ~ Month + Day, data=airquality,mean)

The notation here might seem a bit inconsistent, but the ~ is part of the “formula” notation in R that is used often in statistical models, as we will soon see.

Aggregate functions

You can use any function you like with aggregate() as long as the function can take a vector of data and return a single number. R contains many aggregating functions:

min(x) - minimum value of vector x.
max(x) - maximum value of vector x.
mean(x) - mean value of vector x.
median(x) - median value of vector x.
quantile(x, p) - pth quantile of vector x.
sd(x) - standard deviation of vector x.
var(x) - variance of vector x.
IQR(x) - Inter Quartile Range (IQR) of vector x.
diff(range(x)) - total range of vector x.

Merging data

One of the final basic tasks needed for data manipulation prior to serious analysis is merging different datasets.

R basic function:

Merge

Other packages (see advanced data manipulation):

plyr / dplyr
data.table
tidyverse
and more…

Create dataset:

To show you how to merge data let’s first create two datasets, “authors” and “books”:

authors <- data.frame(
    surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"), 
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    stringsAsFactors=FALSE)
books <- data.frame(
    name = c("Tukey", "Venables", "Tierney",
               "Ripley", "Ripley", "McNeil", "R Core"),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    stringsAsFactors=FALSE)

Merge by

Now say we wish to merge them by author name (a common variable). We do this with the merge function:

bookmerge <- merge(authors, books, by.x="surname",by.y="name")
bookmerge

   surname nationality                         title
1   McNeil   Australia     Interactive Data Analysis
2   Ripley          UK            Spatial Statistics
3   Ripley          UK         Stochastic Simulation
4  Tierney          US                     LISP-STAT
5    Tukey          US     Exploratory Data Analysis
6 Venables   Australia Modern Applied Statistics ...

Did the merge work?

The first argument is the first data frame, the second argument is the second dataframe. by.x is the column name(s) to match on from the first dataset, by.y is the column name(s) from the second. If you wish to merge on multiple columns, by.x and by.y can of course be vectors, such as by.x=c("name","height"). If no by.x or by.y arguments are supplied, merge will look for variables with identical names to match on, but it is generally better to include the names explicitly even if they are the same. By default merge omits rows that aren’t matched in both datasets; to include those that are in the first data frame but not the second, include all.x=TRUE; all.y=TRUE does the same for the second data frame.

Advanced manipulation

Besides merge, we can merge data using plyr, dplyr, and tidyverse. These are packages that often make the process of merging, summarizing and cleaning easier.

Dplyr

Let’s use dplyr with R’s UScrime dataset. Disclaimer: R’s built-in datasets can be outdated and do not necessarily reflect contemporary realities.

library(dplyr)

Warning: package 'dplyr' was built under R version 3.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

crimedata <- MASS::UScrime
glimpse(crimedata)

Rows: 47
Columns: 16
$ M    <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134,…
$ So   <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
$ Ed   <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 11…
$ Po1  <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 6…
$ Po2  <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61…
$ LF   <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595,…
$ M.F  <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 97…
$ Pop  <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30,…
$ NW   <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, …
$ U1   <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, …
$ U2   <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43,…
$ GDP  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580,…
$ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172,…
$ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201,…
$ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.69…
$ y    <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674,…

Sources: Ehrlich, I. (1973) Participation in illegitimate activities: a theoretical and empirical investigation. Journal of Political Economy, 81, 521-565.

Vandaele, W. (1978) Participation in illegitimate activities: Ehrlich revisited. In Deterrence and Incapacitation, eds A. Blumstein, J. Cohen and D. Nagin, pp. 270-335. US National Academy of Sciences.

Dplyr functions

Dplyr comes with five key functions:

select()
filter()
arrange()
mutate()
summarise()

select()

Select a number of columns and assign it to a temporary dataframe or tibble which is the name used by the dplyr package:

TempData1 <- select(crimedata, M, So, LF, Pop, NW, Ineq, y)

The following variables are now selected from crimedata:

1. Percentage of males aged 14-24; 
2. Indicator variable for a Southern state;
3. Labour force participation rate; 
4. State population; 
5. Number of non-whites per 1000 people; 
6. Income inequality, 
7. Rate of crimes in a particular category per head of population.

Dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

starts_with(“X”): every name that starts with “X”,
ends_with(“X”): every name that ends with “X”,
contains(“X”): every name that contains “X”,
matches(“X”): every name that matches “X”, where “X” can be a regular expression,
num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05, one_of(x): every name that appears in x, which should be a character vector.

Note that when you refer to columns directly inside select(), you don’t use quotes. When using the helper functions, you do use quotes.

filter()

Filter by values on a specific variable or set of variables, for example:

crimedata_select <- filter(crimedata, So==1) #selecting Southern states only

arrange()

Arrange data by certain columns:

crimedata_arrange <- arrange(crimedata, Pop, y)

mutate()

Add a new variable to a dataset. For example, let’s create the original variable that contains the number of crimes per state from y(rate of crimes in a particular category per head of population):

crimedata_mutate <- mutate(crimedata, crimes = (y/100000)*Pop)

summarise()

Print out a summary with minimum and maximum distance:

summarise(crimedata, min_y = min(y), max_y = max(y))

##   min_y max_y
## 1   342  1993

Exercise

Pipes existed long before they were implemented in the R language and the logic behind it is to write simple codes of chaining, i.e. it passes an intermediate result onto a next function. Click here for more information about the history of pipes in R.

In R, pipes were developed in 2013 by Hadley Wickham, Chief Scientist at RStudio, who also started the dplyr package. With the operator %>, we can now chain different steps in our code.

For example, let’s translate the following English sentence into R Code:

Take the crime data set and then …
filter the data by Southern states only and then …
report on the mean of y.

crimedata %>%
  filter(So==1) %>%
  summarise(avg=mean(y))

       avg
1 856.8125

Your turn: Use dplyr functions and a pipe operator that returns the mean of y. Next, use mutate and the variable Time (in days) to add a variable to the crimedata dataset that calculates the number of crimes per week. Then summarise your findings of the mean number of reportings and and the mean number of reportings per week.

Show example answer

crimedata %>% 
  filter(So==1) %>%
  mutate(crime_stdbytime = (y/Time)*7) %>%
  summarise(crimereports_m=mean(y), averagecrimeperweek=mean(crime_stdbytime))

  crimereports_m averagecrimeperweek
1       856.8125            231.5108

String Functionatlity

Now let’s spend some time learning the most common built-in functions that are designed to manipulate strings.

When we want to display something as text on the console we use the print() function. As we have seen, print("Hello") will output the string “Hello”; we could include print(i) in a for loop, for instance, to keep track of progress. For instance,

x <- c("h","e","l","l","o")

for(i in x) {print(i)}

## [1] "h"
## [1] "e"
## [1] "l"
## [1] "l"
## [1] "o"

Paste

To concatenate strings one uses the paste function. Concatenation means that take a set of strings to form a new one by placing them together:

For example, say we want to print the name of different databases that differ by an increasing number.

for(i in 1:5){
  filename = paste("datafile",i,".txt",sep="")
  print(filename)
}

[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"
[1] "datafile4.txt"
[1] "datafile5.txt"

All the arguments in paste are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing “”, but it could be a comma if we wanted to create a comma-separated data file (the hard way!).

Split

The inverse of paste is splitting a string with strsplit:

sout <- strsplit("Get/vaccinated/as/soon/as/possible","/")
sout

[[1]]
[1] "Get"        "vaccinated" "as"         "soon"       "as"        
[6] "possible"

This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.

Find

Please note that this slide has been modified from what was presented and now provides additional explanation:

To test whether a string or value is in another set (including a vector or table), we can use %in%:

c("fish","dog") %in% c("happy","fish","pie") #Prints whether or not fish and dog, respectively, occur at all in another set. The comma and quotation marks indicate that "fish" and "dog" are treated as two separate elements.

[1]  TRUE FALSE

c("fish, dog") %in% c("happy","fish","pie") #Prints whether or not "fish, dog" (treated as one element) are included in another set.

[1] FALSE

c("fish","dog", 2) %in% c("happy","fish","pie", 2) # Prints whether the three elements "fish", "dog", and 2 are included in another set.

[1]  TRUE FALSE  TRUE

Note that in this example we do three searches (one for each element of the vector on the left of the %in%), and it returns TRUE if a search element matches any of the elements of the vector being searched.

Partial finding

Partial matches don’t work with %in% :

"fish" %in% "I would like to go fishing"

[1] FALSE

For partial matches, you can use grep:

grep("fish", c("I would like to go fishing","dog my cats","fishsticks","I'm convinced there is something fishy going on"))

[1] 1 3 4

Note that grep outputs the element numbers for the elements in the searched vector where it finds matches. Grep uses the powerful regex search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.

Replacement

The last essential string tool is replacement, using gsub:

gout <- gsub(c("Sick|Sad"), "Healthy", c("Sick baby","Sad puppy"))
gout

[1] "Healthy baby"  "Healthy puppy"

The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string. Note that I use | to separate the different words that can be replaced.