In this lesson we learn how functions and packages power R.
After completing this lesson, students should be able to:
This is a great source for a more advance and formal treatment of functions in R: https://www.stat.berkeley.edu/~statcur/Workshop2/Presentations/functions.pdf
Functions are pre-written pieces of code that we can use to make our work more efficient. A function has a name and most of the times a series of arguments or inputs. If there is a function stored in memory when you type the name of the function R will execute the code associated with the function, using any potential argument that you choose to pass along.
Let’s start with a quick example, imagine that I have a database with a string variable, \(y\), (e.g. costumers’ first and last name, product model, diagnose, etc.) and a numeric variable, \(x\), (e.g. age, price, temperature, etc.). My goal is to filter \(y\) according to a specific range of \(x\). This is very simple to do in R, say we want the values of \(y\) when \(x\) is greater than 10 and less or equal to 30. This is how that code should look like:
# First I'm generating the y and x variables
y <- c("Anna", "Bertha", "Cecilia", "Diana",
"Elizabeth", "Fran", "Gaby", "Helen")
x <- c(5, 10, 15, 20, 25, 30, 35, 40)
# Let's make the database now
db <- data.frame(y,x, stringsAsFactors = F)
db
## y x
## 1 Anna 5
## 2 Bertha 10
## 3 Cecilia 15
## 4 Diana 20
## 5 Elizabeth 25
## 6 Fran 30
## 7 Gaby 35
## 8 Helen 40
# Now let's filter the values of y according to x>10 and x<=30
db$y[x > 10 & x <= 30]
## [1] "Cecilia" "Diana" "Elizabeth" "Fran"
While working on the project I notice that I have to create several different ranges of \(x\) for different purposes. Writting db$y[x > 10 & x <= 30]
is not very long, but is it not the most readable piece of code and I can certainly make mistakes/typos while copy pasting this line of code all over a project. To avoid that I’m going to write a function instead:
myFilter <- function(db, varY, varX, minX, maxX) {
x <- db[,varX]
return( db[x > minX & x <= maxX, varY] )
}
myFilter(db, "y", "x", 10, 30)
## [1] "Cecilia" "Diana" "Elizabeth" "Fran"
# Now let's test it with different ranges
myFilter(db, "y", "x", 0, 20)
## [1] "Anna" "Bertha" "Cecilia" "Diana"
myFilter(db, "y", "x", 20, 35)
## [1] "Elizabeth" "Fran" "Gaby"
As you can see this is easier to read and manage. I would probably not call it myFilter
in an actual application as that’s a very ambiguous name, but you get the point.
R comes with a large set of functions pre-installed, although the naming of them is sufficiently idiosyncratic that you are unlikely to guess the name of the function (except for something as simple as mean
) on your own; much better to look it up online.
v <- c(2,4,1,5)
mv <- mean(v)
mv
[1] 3
mean()
is a function that takes as its input a vector, and outputs the mean (a scalar).
To get a list of the function in R’s base
install, you can write
library(help = "base")
To get help on a given function (eg, mean
), write
help(mean)
# or
?mean
In RStudio that should automatically open the Help pane with the definitions for the given function. In the help file, the Arguments are the function’s inputs, and the Values are its outputs. In addition to the data taken as input, there are usually options for how the function operates. The Usage portion of the Help shows both how the function is used, and what the default inputs are for the various options.
The reason R has become one of the dominant statistical software tools, though, is not due to the built-in functions – which are mainly similar to those in many other statistics programs – but because R is open-source and benefits from thousands of user-contributed functions across every domain of statistics and, increasingly, machine learning and computer science more generally: Where to find packages:
There are two steps:
Install and load existing packages through the Packages pane: To load an existing package, check the box next to the package name in the Packages pane. To install a new package, click on “Install” in the upper-left corner of the Packages pane; this of course requires knowing the package name.
Install and load existing packages through script (recommended): It is usually better to load packages using commands in your R script rather than the GUI, although it matters less for installing packages, which hopefully only happens once. The command for installing a package (here, the “ggplot2” package) is:
install.packages("ggplot2")
To load a package in order to be able to use the functions in it, do:
library(ggplot2)
require()
also works, and you will see that in many scripts.
Disclaimer: Many packages come with “dependencies” – other packages or functions that they use to get their job done. Luckily, most packages specify their dependencies, and by default these are also installed.
Why do we want to build our own functions: R is a very flexible and open programming language, so you can build your own functions and packages and share them with others easily. We won’t go into building full-scale packages here, but one of the most important skills in R is being able to create your own functions.
If you have done any programming elsewhere, creating a function in R will look familiar. Here is an example of a simple function:
# define the function named "doubleit"
doubleit <- function(x){
doubled <- x*2
return(doubled)
}
# now use the function with an input of 7 and save it as sevendoubled
sevendoubled <- doubleit(7)
sevendoubled
[1] 14
Show explanation
A function can have many inputs (a vector, or a set of scalars and vectors and lists, or anything at all) and can return a more complex output such as a list of R objects – just as we saw when examining the Help files for pre-existing functions.
For instance, here is a function that takes two numbers and calculates both their sum and difference:
sumdiff <- function(a,b){
nsum <- a + b
ndif <- a - b
return(c(nsum,ndif))
}
sdoutput <- sumdiff(5,3)
sdoutput
[1] 8 2
Write a function that:
An important aspect of how functions work is how they handle missing data (NA
). For example, if we try to calculate the mean and some of the values in the data are missing. Should the function calculate the mean eliminating those values, display an error or simply return NA
?
v2 <- c(1,2,3,4,NA,6,8,NA)
v2
[1] 1 2 3 4 NA 6 8 NA
What is the mean of v2?
Show answer Show explanationMissing data requires special care and the way we handle missing data has major implications for your findings and interpretations of findings. While it will be too early in the semester to get started with some advanced ways to deal with missing data, there are already a few considerations that you should keep in mind at all times when dealing with any dataset:
There are several types of missing data:
Questions we should be asking:
Here we are applying some of the built-in functions and subsetting techniques we learned last week.
Using the built-in function is.na()
you can identify missing data in most objects in R.
df <- data.frame(A=c(1, NA, 8, NA),
B=c(3, NA, 88, 23),
C=c(2, 45, 3, 1))
is.na(df)
## A B C
## [1,] FALSE FALSE FALSE
## [2,] TRUE TRUE FALSE
## [3,] FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE
You can combine is.na()
with the logical function any()
to test if there is any missing data at all in the database.
any(is.na(df))
## [1] TRUE
The summary()
function produces a percentile table of a vector or vectors in a database. It also shows the number of NA
in a given vector.
summary(df)
## A B C
## Min. :1.00 Min. : 3.0 Min. : 1.00
## 1st Qu.:2.75 1st Qu.:13.0 1st Qu.: 1.75
## Median :4.50 Median :23.0 Median : 2.50
## Mean :4.50 Mean :38.0 Mean :12.75
## 3rd Qu.:6.25 3rd Qu.:55.5 3rd Qu.:13.50
## Max. :8.00 Max. :88.0 Max. :45.00
## NA's :2 NA's :1
You can also use the functions sum()
to count the number of missing in combination with is.na()
:
sum(is.na(df))
## [1] 3
length(is.na(df)) # Be aware of the differences!
## [1] 12
You can also be positive and instead of focusing on what’s missing you can look only to the items that have no missing values. For instance, you can use complete.cases()
to find the rows with no missing values:
complete.cases(df)
## [1] TRUE FALSE TRUE FALSE
You can also find the rows with missing values:
missing <- which(is.na(df))
df[missing, ]
## A B C
## 2 NA NA 45
## 4 NA 23 1
## NA NA NA NA
Subset data, keeping only complete cases
df[complete.cases(df), ]
## A B C
## 1 1 3 2
## 3 8 88 3
#Or:
na.omit(df)
## A B C
## 1 1 3 2
## 3 8 88 3
Sometimes the missing values are stored as an emtpy text string. We should generally replace (empty) those values with NA
for R to understand that this data is missing:
df <- data.frame(A=c(1, "" , 8, ""),
B=c(3, "", 88, 23),
C=c(2, 45, 3, 1))
df[df == ""] <- NA
df
## A B C
## 1 1 3 2
## 2 <NA> <NA> 45
## 3 8 88 3
## 4 <NA> 23 1
Missing data is not always a major statistical problem. As long as the rate in which the data is missing is not correlated with a variable and is completely random you should not worry to much about it. For instance, if you are conducting a survey and generally men choose not to answer a particular question, then your sample will be unbalanced for that question as it will be over-represeting the answers of women mostly.
There is some specific terminology to dedicated to determine the distribution of missing data:
Consider this as suggestions rather than guidelines when dealing with missing observations.
Analysis of attrition: Compare whether observations with complete values differ from observations with incomplete values.
If there are no (or little) systematic differences: You should be fine working with missing observations. If the functions require your data to be complete. Then eliminating those observations should also be fine. As long as the missing data is not a very large percentage of your sample.
Multiple imputation: Plausible values can be ‘guessed’ from values on series of other observed variables (package ‘mice’ in R). This is generally a bad practice, but it can be used when your sample size is not very large and eliminating an observation or a set of observations is very costly.
In summary, (1) If there are no systematic differences between your missing data and non-missing data you can: (a) eliminate the missing data if you have a large sample or (b) input the missing data if you have a small sample. If there are systematic differences, then your statistcal analysis will probably be biased unless you find a way to improve your sample.