Statistical Analysis 2.1: Functions and Packages

Functions and Packages
Functions
- Built-in functions
Packages
- Installing and loading packages
Writing functions
Another example
Function exercise
Dealing with NA (functions)
Dealing with NA (advanced)

Functions and Packages

Overview

In this lesson we learn how functions and packages power R.

Objectives

After completing this lesson, students should be able to:

Use and write simple functions.
Install and load packages.

References

This is a great source for a more advance and formal treatment of functions in R: https://www.stat.berkeley.edu/~statcur/Workshop2/Presentations/functions.pdf

Functions

Functions are pre-written pieces of code that we can use to make our work more efficient. A function has a name and most of the times a series of arguments or inputs. If there is a function stored in memory when you type the name of the function R will execute the code associated with the function, using any potential argument that you choose to pass along.

Let’s start with a quick example, imagine that I have a database with a string variable, $y$, (e.g. costumers’ first and last name, product model, diagnose, etc.) and a numeric variable, $x$, (e.g. age, price, temperature, etc.). My goal is to filter $y$ according to a specific range of $x$. This is very simple to do in R, say we want the values of $y$ when $x$ is greater than 10 and less or equal to 30. This is how that code should look like:

# First I'm generating the y and x variables
y <- c("Anna", "Bertha", "Cecilia", "Diana",
       "Elizabeth", "Fran", "Gaby", "Helen")
x <- c(5, 10, 15, 20, 25, 30, 35, 40)

# Let's make the database now
db <- data.frame(y,x, stringsAsFactors = F)

db

##           y  x
## 1      Anna  5
## 2    Bertha 10
## 3   Cecilia 15
## 4     Diana 20
## 5 Elizabeth 25
## 6      Fran 30
## 7      Gaby 35
## 8     Helen 40

# Now let's filter the values of y according to x>10 and x<=30
db$y[x > 10 & x <= 30]

## [1] "Cecilia"   "Diana"     "Elizabeth" "Fran"

While working on the project I notice that I have to create several different ranges of $x$ for different purposes. Writting db$y[x > 10 & x <= 30] is not very long, but is it not the most readable piece of code and I can certainly make mistakes/typos while copy pasting this line of code all over a project. To avoid that I’m going to write a function instead:

myFilter <- function(db, varY, varX, minX, maxX) {
  
  x <- db[,varX]
  return( db[x > minX & x <= maxX, varY] )
  
}

myFilter(db, "y", "x", 10, 30)

## [1] "Cecilia"   "Diana"     "Elizabeth" "Fran"

# Now let's test it with different ranges
myFilter(db, "y", "x", 0, 20)

## [1] "Anna"    "Bertha"  "Cecilia" "Diana"

myFilter(db, "y", "x", 20, 35)

## [1] "Elizabeth" "Fran"      "Gaby"

As you can see this is easier to read and manage. I would probably not call it myFilter in an actual application as that’s a very ambiguous name, but you get the point.

Built-in functions

R comes with a large set of functions pre-installed, although the naming of them is sufficiently idiosyncratic that you are unlikely to guess the name of the function (except for something as simple as mean) on your own; much better to look it up online.

v <- c(2,4,1,5)
mv <- mean(v)
mv

[1] 3

mean() is a function that takes as its input a vector, and outputs the mean (a scalar).

To get a list of the function in R’s base install, you can write

library(help = "base")

To get help on a given function (eg, mean), write

help(mean)
# or
?mean

In RStudio that should automatically open the Help pane with the definitions for the given function. In the help file, the Arguments are the function’s inputs, and the Values are its outputs. In addition to the data taken as input, there are usually options for how the function operates. The Usage portion of the Help shows both how the function is used, and what the default inputs are for the various options.

Packages

The reason R has become one of the dominant statistical software tools, though, is not due to the built-in functions – which are mainly similar to those in many other statistics programs – but because R is open-source and benefits from thousands of user-contributed functions across every domain of statistics and, increasingly, machine learning and computer science more generally: Where to find packages:

These packages are not stored in the base installation of R, but have to be installed from internet repositories
There are many packages on the internet and often multiple ones with overlapping functions
Once you have found the package you need, installing it is relatively easy

There are two steps:

Install package
Load package: This is to prevent R from being bogged down for those of us with hundreds of packages and functions installed.

Installing and loading packages

Install and load existing packages through the Packages pane: To load an existing package, check the box next to the package name in the Packages pane. To install a new package, click on “Install” in the upper-left corner of the Packages pane; this of course requires knowing the package name.

Install and load existing packages through script (recommended): It is usually better to load packages using commands in your R script rather than the GUI, although it matters less for installing packages, which hopefully only happens once. The command for installing a package (here, the “ggplot2” package) is:

install.packages("ggplot2")

To load a package in order to be able to use the functions in it, do:

library(ggplot2)

require() also works, and you will see that in many scripts.

Disclaimer: Many packages come with “dependencies” – other packages or functions that they use to get their job done. Luckily, most packages specify their dependencies, and by default these are also installed.

Writing functions

Why do we want to build our own functions: R is a very flexible and open programming language, so you can build your own functions and packages and share them with others easily. We won’t go into building full-scale packages here, but one of the most important skills in R is being able to create your own functions.

If you have done any programming elsewhere, creating a function in R will look familiar. Here is an example of a simple function:

# define the function named "doubleit"
doubleit <- function(x){
  doubled <- x*2
  return(doubled)
}
# now use the function with an input of 7 and save it as sevendoubled
sevendoubled <- doubleit(7)
sevendoubled

[1] 14

Show explanation

Another example

A function can have many inputs (a vector, or a set of scalars and vectors and lists, or anything at all) and can return a more complex output such as a list of R objects – just as we saw when examining the Help files for pre-existing functions.

For instance, here is a function that takes two numbers and calculates both their sum and difference:

sumdiff <- function(a,b){
  nsum <- a + b
  ndif <- a - b
  return(c(nsum,ndif))
} 

sdoutput <- sumdiff(5,3)
sdoutput

[1] 8 2

Function exercise

Write a function that:

returns the smallest number of a vector of 50 numbers.

Show example

smallestnumber <- function(x){
  number <- min(x) 
  print(paste("Smallest number:", number))
}

x <- 50:100
smallestnumber(x)

[1] "Smallest number: 50"

Dealing with NA (functions)

An important aspect of how functions work is how they handle missing data (NA). For example, if we try to calculate the mean and some of the values in the data are missing. Should the function calculate the mean eliminating those values, display an error or simply return NA?

v2 <- c(1,2,3,4,NA,6,8,NA)
v2

[1]  1  2  3  4 NA  6  8 NA

What is the mean of v2?

Show answer

mean(v2)

[1] NA

mean(v2,na.rm=TRUE)

[1] 4

Show explanation

Dealing with NA (advanced)

Missing data requires special care and the way we handle missing data has major implications for your findings and interpretations of findings. While it will be too early in the semester to get started with some advanced ways to deal with missing data, there are already a few considerations that you should keep in mind at all times when dealing with any dataset:

Types of missing data:

There are several types of missing data:

Item-level: missing on a specific variable (person doesn’t answer a question)
Attrition: observation missing (e.g. person drops out of study)
Skip patterns: e.g. branching logic in a survey.

Questions we should be asking:

Is there a lot of data missing from a few subjects?
Is there a little data missing from many subjects?
Is the missing data concentrated on a few variables or across several variables?

Check for missing data:

Here we are applying some of the built-in functions and subsetting techniques we learned last week.

Using the built-in function is.na() you can identify missing data in most objects in R.

df <- data.frame(A=c(1, NA, 8, NA), 
                 B=c(3, NA, 88, 23),
                 C=c(2, 45, 3, 1)) 

is.na(df)

##          A     B     C
## [1,] FALSE FALSE FALSE
## [2,]  TRUE  TRUE FALSE
## [3,] FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE

You can combine is.na() with the logical function any() to test if there is any missing data at all in the database.

any(is.na(df))

## [1] TRUE

The summary() function produces a percentile table of a vector or vectors in a database. It also shows the number of NA in a given vector.

summary(df)

##        A              B              C        
##  Min.   :1.00   Min.   : 3.0   Min.   : 1.00  
##  1st Qu.:2.75   1st Qu.:13.0   1st Qu.: 1.75  
##  Median :4.50   Median :23.0   Median : 2.50  
##  Mean   :4.50   Mean   :38.0   Mean   :12.75  
##  3rd Qu.:6.25   3rd Qu.:55.5   3rd Qu.:13.50  
##  Max.   :8.00   Max.   :88.0   Max.   :45.00  
##  NA's   :2      NA's   :1

You can also use the functions sum() to count the number of missing in combination with is.na():

sum(is.na(df))

## [1] 3

length(is.na(df)) # Be aware of the differences!

## [1] 12

You can also be positive and instead of focusing on what’s missing you can look only to the items that have no missing values. For instance, you can use complete.cases() to find the rows with no missing values:

complete.cases(df)

## [1]  TRUE FALSE  TRUE FALSE

You can also find the rows with missing values:

missing <- which(is.na(df))
df[missing, ]

##     A  B  C
## 2  NA NA 45
## 4  NA 23  1
## NA NA NA NA

Subset data, keeping only complete cases

df[complete.cases(df), ]

##   A  B C
## 1 1  3 2
## 3 8 88 3

#Or: 
na.omit(df)

##   A  B C
## 1 1  3 2
## 3 8 88 3

Sometimes the missing values are stored as an emtpy text string. We should generally replace (empty) those values with NA for R to understand that this data is missing:

df <- data.frame(A=c(1, "" , 8, ""), 
                 B=c(3, "", 88, 23),
                 C=c(2, 45, 3, 1)) 
df[df == ""] <- NA
df

##      A    B  C
## 1    1    3  2
## 2 <NA> <NA> 45
## 3    8   88  3
## 4 <NA>   23  1

Distribution of missing data

Missing data is not always a major statistical problem. As long as the rate in which the data is missing is not correlated with a variable and is completely random you should not worry to much about it. For instance, if you are conducting a survey and generally men choose not to answer a particular question, then your sample will be unbalanced for that question as it will be over-represeting the answers of women mostly.

There is some specific terminology to dedicated to determine the distribution of missing data:

MCAR: missing completely at random: absence of values occur entirely at random, is not related to any of the observable nor unobservable variables.
MAR: missing at random: absence of values do not entirely occur at random but there are observarable variables that can account for missing data.
NMAR: missing not at random: missingness correlated with unobservable variables (= problem)

Solutions to MCAR and MAR

Consider this as suggestions rather than guidelines when dealing with missing observations.

Analysis of attrition: Compare whether observations with complete values differ from observations with incomplete values.
- If there are no (or little) systematic differences: You should be fine working with missing observations. If the functions require your data to be complete. Then eliminating those observations should also be fine. As long as the missing data is not a very large percentage of your sample.
- Multiple imputation: Plausible values can be ‘guessed’ from values on series of other observed variables (package ‘mice’ in R). This is generally a bad practice, but it can be used when your sample size is not very large and eliminating an observation or a set of observations is very costly.

In summary, (1) If there are no systematic differences between your missing data and non-missing data you can: (a) eliminate the missing data if you have a large sample or (b) input the missing data if you have a small sample. If there are systematic differences, then your statistcal analysis will probably be biased unless you find a way to improve your sample.