RVectors are a specific type of object (in the first lab we discussed object types of integer, numeric, character and logical) with dimension \(n\times1\), e.g. a place to store \(n\) observations of a single variable. Vector elements are indexed by positive integers. Many mathematical functions in R are “vectorized”, i.e. we can perform operations on every element of a vector at the same time (e.g. add 3 to each element of a vector.) Other functions can aggregate members of the vector (e.g. give the mean of the vector.)
rep()We can create a vector with repeated numbers.
help(rep)
z <- rep(1,5)
z## [1] 1 1 1 1 1q <- rep(c(1, 2), 5)
q##  [1] 1 2 1 2 1 2 1 2 1 2t <- rep(c(1,2), c(5, 5))
t##  [1] 1 1 1 1 1 2 2 2 2 2seq()We can create a vector with a sequence of numbers.
help(seq)
a <- 1:5
a## [1] 1 2 3 4 5b <- seq(1,5)
b## [1] 1 2 3 4 5c <- seq(5,1)
c## [1] 5 4 3 2 1d <- seq(from=0, to=15, by=3)
d## [1]  0  3  6  9 12 15e <- seq(0, 15, 3)
e## [1]  0  3  6  9 12 15c()To create a vector with specific numbers or objects inside, we need to use the c() function:
f <- (0, 5, 10)  #This doesn't workf <- c(0, 5, 10) #This works!!!
f## [1]  0  5 10rnorm()We can also create a vector from randomly generated numbers. Let’s draw 10 random numbers drawn from the standard normal distribution (mean of 0, standard deviation of 1). Note: you can draw from any normal distribution of your choosing by changing the mean=} andsd=} arguments.
g <- rnorm(10, mean=0, sd=1)
g##  [1] -1.3271844 -1.4665512  0.4433269 -0.6511719 -2.6024570  1.3078938
##  [7]  1.5704528 -2.1911601 -0.5491848  0.6541069Sometimes you will want to clear every object you’ve stored in your environment. This is done using the rm() function.
rm(list=ls())Sometimes we will need to access only certain elements of a vector. Let’s create a new vector x that we will work with to subset:
x <- seq(60, 70, 1)
x##  [1] 60 61 62 63 64 65 66 67 68 69 70Often subsetting in R involves using brackets [ ].
x[1]## [1] 60x[2]## [1] 61x[c(1, 3)]
x[c(2, 3, 6)]
x[c(6, 3, 2)]Often we will use Boolean logical operators in subsetting.
Some operators:
x<60##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSEx<65##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSENote: without using [ ], R simply returns a vector of class logical containing elements of either TRUE or FALSE.
x[x < 65]## [1] 60 61 62 63 64x > 60 & x < 65##  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSEx[x > 60 & x < 65]## [1] 61 62 63 64x[x < 60 | x > 65]## [1] 66 67 68 69 70x[x == 60]## [1] 60Finding the length of a vector.
n = length(x)
n    ## [1] 11What would the following code do?
x[1:length(x)]
x[3:length(x)]
x[1:n]
x[3:n]Is an object a vector?
is.vector(x)## [1] TRUELet’s convert an object to a vector. Note: for every class of object that has an “is” function, there is also an “as” function.
vec.x2<-as.vector(x)
is.vector(vec.x2)## [1] TRUEMany of R’s statistics functions take vectors as arguments.
mean(x)## [1] 65sd(x)## [1] 3.316625RThere are many ways to use the `matrix()} function to create a matrix.
?matrix
x <- matrix(1:12, nrow=3)
x##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12Alternatively:
y <- matrix(1:12, ncol=4)
y##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12Alternatively, we can coerce a vector into a matrix:
z <- 1:10
matrix.z <- matrix(z, ncol=5)
matrix.z##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10Let’s create a matrix of zeroes and add 1’s to the diagonal.
matrix.zero <- matrix(0, nrow=5, ncol=5)
matrix.zero##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0diag(matrix.zero) = 1
matrix.zero##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1Similarly to finding the length of a vector, we often want to find the dimensions of a matrix
dim(x)## [1] 3 4To isolate or look at parts of a matrix we will still use brackets [ ], but now we have two dimensions.
z <- 1:30
matrix.z <- matrix(z, ncol=5)
matrix.z##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    7   13   19   25
## [2,]    2    8   14   20   26
## [3,]    3    9   15   21   27
## [4,]    4   10   16   22   28
## [5,]    5   11   17   23   29
## [6,]    6   12   18   24   30#Display only the fifth row of matrix.a
matrix.z[5, ]  ## [1]  5 11 17 23 29#Display only the third column
matrix.z[ , 3] ## [1] 13 14 15 16 17 18#Display the third and fourth columns
matrix.z[ , 3:4]##      [,1] [,2]
## [1,]   13   19
## [2,]   14   20
## [3,]   15   21
## [4,]   16   22
## [5,]   17   23
## [6,]   18   24#Display the second and fourth columns
matrix.z[ , c(2,4)]##      [,1] [,2]
## [1,]    7   19
## [2,]    8   20
## [3,]    9   21
## [4,]   10   22
## [5,]   11   23
## [6,]   12   24#Display the first and fifth rows
matrix.z[c(1,5), ]##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    7   13   19   25
## [2,]    5   11   17   23   29#Change the value/s of an element or elements in the matrix
#Change all of column 1 to zeros
matrix.z[ , 1] = 0
matrix.z##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    7   13   19   25
## [2,]    0    8   14   20   26
## [3,]    0    9   15   21   27
## [4,]    0   10   16   22   28
## [5,]    0   11   17   23   29
## [6,]    0   12   18   24   30#Change all of column 3 to 50
matrix.z[3, ] = 50
matrix.z##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    7   13   19   25
## [2,]    0    8   14   20   26
## [3,]   50   50   50   50   50
## [4,]    0   10   16   22   28
## [5,]    0   11   17   23   29
## [6,]    0   12   18   24   30#Change row 1, column 4 to 999
matrix.z[1, 4] = 999
matrix.z##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    7   13  999   25
## [2,]    0    8   14   20   26
## [3,]   50   50   50   50   50
## [4,]    0   10   16   22   28
## [5,]    0   11   17   23   29
## [6,]    0   12   18   24   30We can also create a new matrix by combining columns or rows from a pre-existing matrix. The command cbind combines columns, and the command rbind combines rows.
Note: You need to have the same number of columns to use cbind, and rows to use rbind.
matrix.a <- matrix(1:25, nrow=5)
matrix.a##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25matrix.b <- matrix(50:74, nrow=5)
matrix.b##      [,1] [,2] [,3] [,4] [,5]
## [1,]   50   55   60   65   70
## [2,]   51   56   61   66   71
## [3,]   52   57   62   67   72
## [4,]   53   58   63   68   73
## [5,]   54   59   64   69   74#Combine matrix a and b by column.
matrix.c <- cbind(matrix.a, matrix.b)
matrix.c##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    6   11   16   21   50   55   60   65    70
## [2,]    2    7   12   17   22   51   56   61   66    71
## [3,]    3    8   13   18   23   52   57   62   67    72
## [4,]    4    9   14   19   24   53   58   63   68    73
## [5,]    5   10   15   20   25   54   59   64   69    74#Combine matrix a and b by row.
matrix.d <- rbind(matrix.a, matrix.b)
matrix.d##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    6   11   16   21
##  [2,]    2    7   12   17   22
##  [3,]    3    8   13   18   23
##  [4,]    4    9   14   19   24
##  [5,]    5   10   15   20   25
##  [6,]   50   55   60   65   70
##  [7,]   51   56   61   66   71
##  [8,]   52   57   62   67   72
##  [9,]   53   58   63   68   73
## [10,]   54   59   64   69   74#Combine column 1 in matrix a with column 1 of matrix b.
matrix.col1 <- cbind(matrix.a[,c(1)],
                     matrix.b[,c(1)])
matrix.col1##      [,1] [,2]
## [1,]    1   50
## [2,]    2   51
## [3,]    3   52
## [4,]    4   53
## [5,]    5   54#Combine row 5 in matrix a with row 3 in matrix b.
matrix.row <- rbind(matrix.a[c(5),], 
                    matrix.b[c(3),])
matrix.row##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   10   15   20   25
## [2,]   52   57   62   67   72First we will go over addition and subtraction. Recall, we can only add and subtract matrices with the same number of dimensions.
matrix.a <- matrix(1, ncol=5, nrow=5)
matrix.a##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1
## [5,]    1    1    1    1    1matrix.b <- matrix(5, ncol=5, nrow=5)
matrix.b##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    5    5    5    5
## [2,]    5    5    5    5    5
## [3,]    5    5    5    5    5
## [4,]    5    5    5    5    5
## [5,]    5    5    5    5    5matrix.a - matrix.b##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -4   -4   -4   -4   -4
## [2,]   -4   -4   -4   -4   -4
## [3,]   -4   -4   -4   -4   -4
## [4,]   -4   -4   -4   -4   -4
## [5,]   -4   -4   -4   -4   -4matrix.b - matrix.a##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    4    4    4    4
## [2,]    4    4    4    4    4
## [3,]    4    4    4    4    4
## [4,]    4    4    4    4    4
## [5,]    4    4    4    4    4matrix.a + matrix.b##      [,1] [,2] [,3] [,4] [,5]
## [1,]    6    6    6    6    6
## [2,]    6    6    6    6    6
## [3,]    6    6    6    6    6
## [4,]    6    6    6    6    6
## [5,]    6    6    6    6    6To multiply matrices we need the left matrix to have the same number of columns as the number of rows in the right matrix. Instead of * we use %*% to multiply matrices.
Take note of what happens when we try to do the following operations:
matrix.c \%*\% matrix.dmatrix.d \%*\% matrix.cmatrix.c <- matrix(3, ncol=4, nrow=5)
matrix.c##      [,1] [,2] [,3] [,4]
## [1,]    3    3    3    3
## [2,]    3    3    3    3
## [3,]    3    3    3    3
## [4,]    3    3    3    3
## [5,]    3    3    3    3dim(matrix.c)## [1] 5 4matrix.d <- matrix(7, ncol=5, nrow=3)
matrix.d##      [,1] [,2] [,3] [,4] [,5]
## [1,]    7    7    7    7    7
## [2,]    7    7    7    7    7
## [3,]    7    7    7    7    7dim(matrix.d)## [1] 3 5matrix.c %*% matrix.d 
matrix.d %*% matrix.cTo open files in R we need to specify the directory our datafiles are stored in. There are two ways to do this: using code or via the dropdown menus (this will vary by Windows or Mac).
setwd("~/Dropbox/MathCamp/2020/Lecture2/Lab2/")To set the working directory in your .Rmd document, you will need to include the following line of code:
knitr::opts_knit$set(root.dir = '~/Dropbox/MathCamp/2020/Lecture2/Lab2')There are many ways to read in files to R, depending on the file type. ## Read and Write .csv
?read.csv
data <- read.csv("Seattle_Pet_Licenses.csv")
write.csv(data, file = 'Seattle_Pets_copy.csv',
          row.names = F)data_copy <- data
save(data_copy, file = 'SeattlePets.rda') 
rm(data_copy) What is the name of the data set that loaded by the line of code below?
load('SeattlePets.rda') To load Stata data files we either need to use the package foreign or haven.
library(foreign)
write.dta(data_copy, file = 'SeattlePets.dta') #Save as a stata data frame
rm(data_copy)
data_copy <- read.dta('SeattlePets.dta')Recall if you have not installed the foreign package you can do so using the following line of code.
install.packages('foreign', dependencies = T)Try Google-ing! Chances are there’s a package for the file type of your choice. I’ve loaded ASCII files, .txt, .xls, .xlsx files among others.
data.frameA data.frame is a type of R object used for storing data. It can store non-numeric data as well.
Let’s go through some commands for exploring and viewing data frames.
Test whether an object is a data.frame object.
is.data.frame(data)  ## [1] TRUEis.data.frame(data_copy)## [1] TRUErm(data_copy)View variable names.
# VARIABLE NAMES
names(data)## [1] "License.Issue.Date" "License.Number"     "Animal.s.Name"     
## [4] "Species"            "Primary.Breed"      "Secondary.Breed"   
## [7] "ZIP.Code"colnames(data)## [1] "License.Issue.Date" "License.Number"     "Animal.s.Name"     
## [4] "Species"            "Primary.Breed"      "Secondary.Breed"   
## [7] "ZIP.Code"rownames(data)Find dimensions.
dim(data) # this gives rows and then columns (n X p)## [1] 51754     7nrow(data)## [1] 51754ncol(data)## [1] 7length(data) # NOT ADVISED TO USE WITH MATRICES OR DATA FRAMES## [1] 7Like we did with vectors and matrices, we may want to select or view partial data frames.
Whether you go down the base R or tidyr/dplyr path is up to you, but I want you to have some familiarity with both.
R vs tidyverse: tidyr, dplyr cheat sheet: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdfIn base R, we select a column in one of two ways
library(dplyr)
library(tidyr)
data$Species
data[ , c("Species")]In the Hadleyverse we use the select function:
select(data, Species)In base R, we subset data using Boolean logic tests. Here is a new data.frame of all the observations whose species is ``Cat’’.
head(data)##   License.Issue.Date License.Number Animal.s.Name Species      Primary.Breed
## 1      April 19 2003         200097   Tinkerdelle     Cat Domestic Shorthair
## 2   February 07 2006          75432        Pepper     Cat               Manx
## 3        May 21 2014         727943        Ashley     Cat Domestic Shorthair
## 4        May 08 2015         833836          Lulu     Cat             LaPerm
## 5        May 13 2015         361031        My Boy     Cat       Russian Blue
## 6       July 21 2015         203480        Rocket     Cat Domestic Shorthair
##   Secondary.Breed ZIP.Code
## 1                    98116
## 2             Mix    98103
## 3                    98115
## 4                    98136
## 5                    98121
## 6                    98144table(data$Species)## 
##   Cat   Dog  Goat   Pig 
## 16829 34882    38     5cat.base <- data[data$Species == "Cat", ]
dim(cat.base)## [1] 16829     7head(cat.base)##   License.Issue.Date License.Number Animal.s.Name Species      Primary.Breed
## 1      April 19 2003         200097   Tinkerdelle     Cat Domestic Shorthair
## 2   February 07 2006          75432        Pepper     Cat               Manx
## 3        May 21 2014         727943        Ashley     Cat Domestic Shorthair
## 4        May 08 2015         833836          Lulu     Cat             LaPerm
## 5        May 13 2015         361031        My Boy     Cat       Russian Blue
## 6       July 21 2015         203480        Rocket     Cat Domestic Shorthair
##   Secondary.Breed ZIP.Code
## 1                    98116
## 2             Mix    98103
## 3                    98115
## 4                    98136
## 5                    98121
## 6                    98144In the Hadleyverse one would use the filter function.
cat.tidy <- filter(data, Species == "Cat")
dim(cat.tidy)
head(cat.tidy)We can also use what is called ``the pipeline’’ to do the same operation:
cat.tidy2 <- data %>% filter(Species == "Cat" )
dim(cat.tidy2)
head(cat.tidy2)or to do multiple sequential operations.
Note: you must link the sequential functions by %>%. To make your code clean you probably want to use multiple lines, but the %>% must come at the end of a line or R will end your operation. What happens if you run this chunk of code?
data %>% filter(Species == "Cat" ) %>%
  select(Species)
data %>% filter(Species == "Cat" ) %>%
  select(Species)caf.data <- read.csv('caffeine.csv', header = T)
head(data)##   License.Issue.Date License.Number Animal.s.Name Species      Primary.Breed
## 1      April 19 2003         200097   Tinkerdelle     Cat Domestic Shorthair
## 2   February 07 2006          75432        Pepper     Cat               Manx
## 3        May 21 2014         727943        Ashley     Cat Domestic Shorthair
## 4        May 08 2015         833836          Lulu     Cat             LaPerm
## 5        May 13 2015         361031        My Boy     Cat       Russian Blue
## 6       July 21 2015         203480        Rocket     Cat Domestic Shorthair
##   Secondary.Breed ZIP.Code
## 1                    98116
## 2             Mix    98103
## 3                    98115
## 4                    98136
## 5                    98121
## 6                    98144caf.data$CaffKg <- caf.data$Caffeine/1000# dplyr
caf.dplyr <- mutate(caf.data,
                     CaffKg = Caffeine/1000)
head(caf.dplyr)
caf.dplyr <- caf.dplyr %>% 
  mutate(CaffKg = Caffeine/1000)caf.sum.base <- data.frame(CaffMean = mean(caf.data$Caffeine),
                           CaffSd = sd(caf.data$Caffeine))
head(caf.sum.base)##   CaffMean   CaffSd
## 1 39.32504 5.517254# Use the "aggregate" function
## Column names might have to be changed afterwards
caf.sum.base <- aggregate(formula = 
                            Caffeine ~ 1, 
                           data = caf.data,
                           FUN = function(x) c(mean = mean(x), sd = sd(x)))
head(caf.sum.base)##   Caffeine.mean Caffeine.sd
## 1     39.325042    5.517254caf.sum.dplyr <- summarise(caf, 
                            CaffMean = mean(Caffeine),
                            CaffSD  = sd(Caffeine))
head(caf.sum.dplyr)
caf.sum.dplyr <- summarise_at(caf, 
                               .vars = c("Caffeine"), 
                               .funs = c("mean", "sd"))
names(caf.sum.dplyr)In many cases it’s inconsequential whether you use base R or the tidyverse. Often tidyr and dplyr functions are a bit faster than R, but I find the summarisefunction in the tidyverse to be MUCH, MUCH slower than aggregate in base R.
data(mtcars)
mtcars.sum.by <- aggregate(formula = cbind(mpg, wt) ~ cyl + gear, 
          data = mtcars, 
          FUN = function(x){
            c(mean = mean(x), sd = sd(x))
          },
          drop = T)
mtcars.sum.by##   cyl gear   mpg.mean     mpg.sd   wt.mean     wt.sd
## 1   4    3 21.5000000         NA 2.4650000        NA
## 2   6    3 19.7500000  2.3334524 3.3375000 0.1732412
## 3   8    3 15.0500000  2.7743959 4.1040833 0.7683069
## 4   4    4 26.9250000  4.8073604 2.3781250 0.6006243
## 5   6    4 19.7500000  1.5524175 3.0937500 0.4131460
## 6   4    5 28.2000000  3.1112698 1.8265000 0.4433560
## 7   6    5 19.7000000         NA 2.7700000        NA
## 8   8    5 15.4000000  0.5656854 3.3700000 0.2828427mtcars$total <- 1
mtcars.sum.by2 <- aggregate(formula = total ~ cyl + gear, 
          data = mtcars, 
          FUN = sum, drop = T)
mtcars.sum.by2##   cyl gear total
## 1   4    3     1
## 2   6    3     2
## 3   8    3    12
## 4   4    4     8
## 5   6    4     4
## 6   4    5     2
## 7   6    5     1
## 8   8    5     2table(mtcars$cyl, mtcars$gear)##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2mtcars.sum.dplyr <- mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(mpg.mean = mean(mpg),
            mpg.sd = sd(mpg),
            wt.mean = mean(wt),
            wt.sd = sd(wt),
            total = n()) %>% 
  ungroup()
mtcars.sum.dplyrsummary()The summary() function will summarize variables in a data set, based on their class.
summary(data)##         License.Issue.Date License.Number  Animal.s.Name   Species     
##  July 24 2018    :  346    21091  :    2   Lucy   :  434   Cat :16829  
##  November 07 2017:  291    S100636:    2   Luna   :  395   Dog :34882  
##  January 16 2018 :  286    S102467:    2   Charlie:  376   Goat:   38  
##  August 07 2018  :  276    S104231:    2   Bella  :  327   Pig :    5  
##  December 05 2017:  239    S104449:    2          :  294               
##  March 20 2018   :  237    S104953:    2   Daisy  :  264               
##  (Other)         :50079    (Other):51742   (Other):49664               
##                Primary.Breed                Secondary.Breed     ZIP.Code    
##  Domestic Shorthair   : 9819                        :26842   98115  : 4537  
##  Retriever, Labrador  : 4636   Mix                  :13511   98103  : 4394  
##  Domestic Medium Hair : 2030   Poodle, Standard     : 1149   98117  : 3804  
##  Retriever, Golden    : 1872   Poodle, Miniature    :  909   98125  : 2798  
##  Chihuahua, Short Coat: 1859   Retriever, Labrador  :  885   98122  : 2480  
##  Domestic Longhair    : 1317   Chihuahua, Short Coat:  423   98107  : 2426  
##  (Other)              :30221   (Other)              : 8035   (Other):31315min(caf.data$Caffeine)## [1] 28.43max(caf.data$Caffeine)## [1] 52.54mean(caf.data$Caffeine)## [1] 39.32504sd(caf.data$Caffeine)## [1] 5.517254var(caf.data$Caffeine)## [1] 30.44009sqrt(var(caf.data$Caffeine)) # same as the sd## [1] 5.517254median(caf.data$Caffeine)## [1] 38.78quantile(caf.data$Caffeine,0.5)##   50% 
## 38.78quantile(caf.data$Caffeine,0.25)##    25% 
## 34.865quantile(caf.data$Caffeine,0.75)##    75% 
## 43.815quantile(caf.data$Caffeine,c(0.25,0.5,0.75))##    25%    50%    75% 
## 34.865 38.780 43.815