R
Vectors are a specific type of object (in the first lab we discussed object types of integer, numeric, character and logical) with dimension \(n\times1\), e.g. a place to store \(n\) observations of a single variable. Vector elements are indexed by positive integers. Many mathematical functions in R
are “vectorized”, i.e. we can perform operations on every element of a vector at the same time (e.g. add 3 to each element of a vector.) Other functions can aggregate members of the vector (e.g. give the mean of the vector.)
rep()
We can create a vector with repeated numbers.
help(rep)
z <- rep(1,5)
z
## [1] 1 1 1 1 1
q <- rep(c(1, 2), 5)
q
## [1] 1 2 1 2 1 2 1 2 1 2
t <- rep(c(1,2), c(5, 5))
t
## [1] 1 1 1 1 1 2 2 2 2 2
seq()
We can create a vector with a sequence of numbers.
help(seq)
a <- 1:5
a
## [1] 1 2 3 4 5
b <- seq(1,5)
b
## [1] 1 2 3 4 5
c <- seq(5,1)
c
## [1] 5 4 3 2 1
d <- seq(from=0, to=15, by=3)
d
## [1] 0 3 6 9 12 15
e <- seq(0, 15, 3)
e
## [1] 0 3 6 9 12 15
c()
To create a vector with specific numbers or objects inside, we need to use the c() function:
f <- (0, 5, 10) #This doesn't work
f <- c(0, 5, 10) #This works!!!
f
## [1] 0 5 10
rnorm()
We can also create a vector from randomly generated numbers. Let’s draw 10 random numbers drawn from the standard normal distribution (mean of 0, standard deviation of 1). Note: you can draw from any normal distribution of your choosing by changing the mean=} and
sd=} arguments.
g <- rnorm(10, mean=0, sd=1)
g
## [1] -1.3271844 -1.4665512 0.4433269 -0.6511719 -2.6024570 1.3078938
## [7] 1.5704528 -2.1911601 -0.5491848 0.6541069
Sometimes you will want to clear every object you’ve stored in your environment. This is done using the rm()
function.
rm(list=ls())
Sometimes we will need to access only certain elements of a vector. Let’s create a new vector x
that we will work with to subset:
x <- seq(60, 70, 1)
x
## [1] 60 61 62 63 64 65 66 67 68 69 70
Often subsetting in R
involves using brackets [ ].
x[1]
## [1] 60
x[2]
## [1] 61
x[c(1, 3)]
x[c(2, 3, 6)]
x[c(6, 3, 2)]
Often we will use Boolean logical operators in subsetting.
Some operators:
x<60
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x<65
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Note: without using [ ], R
simply returns a vector of class logical containing elements of either TRUE
or FALSE
.
x[x < 65]
## [1] 60 61 62 63 64
x > 60 & x < 65
## [1] FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x[x > 60 & x < 65]
## [1] 61 62 63 64
x[x < 60 | x > 65]
## [1] 66 67 68 69 70
x[x == 60]
## [1] 60
Finding the length of a vector.
n = length(x)
n
## [1] 11
What would the following code do?
x[1:length(x)]
x[3:length(x)]
x[1:n]
x[3:n]
Is an object a vector?
is.vector(x)
## [1] TRUE
Let’s convert an object to a vector. Note: for every class of object that has an “is” function, there is also an “as” function.
vec.x2<-as.vector(x)
is.vector(vec.x2)
## [1] TRUE
Many of R
’s statistics functions take vectors as arguments.
mean(x)
## [1] 65
sd(x)
## [1] 3.316625
R
There are many ways to use the `matrix()} function to create a matrix.
?matrix
x <- matrix(1:12, nrow=3)
x
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Alternatively:
y <- matrix(1:12, ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Alternatively, we can coerce a vector into a matrix:
z <- 1:10
matrix.z <- matrix(z, ncol=5)
matrix.z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
Let’s create a matrix of zeroes and add 1’s to the diagonal.
matrix.zero <- matrix(0, nrow=5, ncol=5)
matrix.zero
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
diag(matrix.zero) = 1
matrix.zero
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
Similarly to finding the length of a vector, we often want to find the dimensions of a matrix
dim(x)
## [1] 3 4
To isolate or look at parts of a matrix we will still use brackets [ ], but now we have two dimensions.
z <- 1:30
matrix.z <- matrix(z, ncol=5)
matrix.z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 7 13 19 25
## [2,] 2 8 14 20 26
## [3,] 3 9 15 21 27
## [4,] 4 10 16 22 28
## [5,] 5 11 17 23 29
## [6,] 6 12 18 24 30
#Display only the fifth row of matrix.a
matrix.z[5, ]
## [1] 5 11 17 23 29
#Display only the third column
matrix.z[ , 3]
## [1] 13 14 15 16 17 18
#Display the third and fourth columns
matrix.z[ , 3:4]
## [,1] [,2]
## [1,] 13 19
## [2,] 14 20
## [3,] 15 21
## [4,] 16 22
## [5,] 17 23
## [6,] 18 24
#Display the second and fourth columns
matrix.z[ , c(2,4)]
## [,1] [,2]
## [1,] 7 19
## [2,] 8 20
## [3,] 9 21
## [4,] 10 22
## [5,] 11 23
## [6,] 12 24
#Display the first and fifth rows
matrix.z[c(1,5), ]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 7 13 19 25
## [2,] 5 11 17 23 29
#Change the value/s of an element or elements in the matrix
#Change all of column 1 to zeros
matrix.z[ , 1] = 0
matrix.z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 7 13 19 25
## [2,] 0 8 14 20 26
## [3,] 0 9 15 21 27
## [4,] 0 10 16 22 28
## [5,] 0 11 17 23 29
## [6,] 0 12 18 24 30
#Change all of column 3 to 50
matrix.z[3, ] = 50
matrix.z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 7 13 19 25
## [2,] 0 8 14 20 26
## [3,] 50 50 50 50 50
## [4,] 0 10 16 22 28
## [5,] 0 11 17 23 29
## [6,] 0 12 18 24 30
#Change row 1, column 4 to 999
matrix.z[1, 4] = 999
matrix.z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 7 13 999 25
## [2,] 0 8 14 20 26
## [3,] 50 50 50 50 50
## [4,] 0 10 16 22 28
## [5,] 0 11 17 23 29
## [6,] 0 12 18 24 30
We can also create a new matrix by combining columns or rows from a pre-existing matrix. The command cbind combines columns, and the command rbind combines rows.
Note: You need to have the same number of columns to use cbind, and rows to use rbind.
matrix.a <- matrix(1:25, nrow=5)
matrix.a
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
matrix.b <- matrix(50:74, nrow=5)
matrix.b
## [,1] [,2] [,3] [,4] [,5]
## [1,] 50 55 60 65 70
## [2,] 51 56 61 66 71
## [3,] 52 57 62 67 72
## [4,] 53 58 63 68 73
## [5,] 54 59 64 69 74
#Combine matrix a and b by column.
matrix.c <- cbind(matrix.a, matrix.b)
matrix.c
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 6 11 16 21 50 55 60 65 70
## [2,] 2 7 12 17 22 51 56 61 66 71
## [3,] 3 8 13 18 23 52 57 62 67 72
## [4,] 4 9 14 19 24 53 58 63 68 73
## [5,] 5 10 15 20 25 54 59 64 69 74
#Combine matrix a and b by row.
matrix.d <- rbind(matrix.a, matrix.b)
matrix.d
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
## [6,] 50 55 60 65 70
## [7,] 51 56 61 66 71
## [8,] 52 57 62 67 72
## [9,] 53 58 63 68 73
## [10,] 54 59 64 69 74
#Combine column 1 in matrix a with column 1 of matrix b.
matrix.col1 <- cbind(matrix.a[,c(1)],
matrix.b[,c(1)])
matrix.col1
## [,1] [,2]
## [1,] 1 50
## [2,] 2 51
## [3,] 3 52
## [4,] 4 53
## [5,] 5 54
#Combine row 5 in matrix a with row 3 in matrix b.
matrix.row <- rbind(matrix.a[c(5),],
matrix.b[c(3),])
matrix.row
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 10 15 20 25
## [2,] 52 57 62 67 72
First we will go over addition and subtraction. Recall, we can only add and subtract matrices with the same number of dimensions.
matrix.a <- matrix(1, ncol=5, nrow=5)
matrix.a
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 1 1 1 1 1
## [3,] 1 1 1 1 1
## [4,] 1 1 1 1 1
## [5,] 1 1 1 1 1
matrix.b <- matrix(5, ncol=5, nrow=5)
matrix.b
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 5 5 5 5
## [2,] 5 5 5 5 5
## [3,] 5 5 5 5 5
## [4,] 5 5 5 5 5
## [5,] 5 5 5 5 5
matrix.a - matrix.b
## [,1] [,2] [,3] [,4] [,5]
## [1,] -4 -4 -4 -4 -4
## [2,] -4 -4 -4 -4 -4
## [3,] -4 -4 -4 -4 -4
## [4,] -4 -4 -4 -4 -4
## [5,] -4 -4 -4 -4 -4
matrix.b - matrix.a
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 4 4 4 4
## [2,] 4 4 4 4 4
## [3,] 4 4 4 4 4
## [4,] 4 4 4 4 4
## [5,] 4 4 4 4 4
matrix.a + matrix.b
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6 6 6 6 6
## [2,] 6 6 6 6 6
## [3,] 6 6 6 6 6
## [4,] 6 6 6 6 6
## [5,] 6 6 6 6 6
To multiply matrices we need the left matrix to have the same number of columns as the number of rows in the right matrix. Instead of * we use %*% to multiply matrices.
Take note of what happens when we try to do the following operations:
matrix.c \%*\% matrix.d
matrix.d \%*\% matrix.c
matrix.c <- matrix(3, ncol=4, nrow=5)
matrix.c
## [,1] [,2] [,3] [,4]
## [1,] 3 3 3 3
## [2,] 3 3 3 3
## [3,] 3 3 3 3
## [4,] 3 3 3 3
## [5,] 3 3 3 3
dim(matrix.c)
## [1] 5 4
matrix.d <- matrix(7, ncol=5, nrow=3)
matrix.d
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7 7 7 7 7
## [2,] 7 7 7 7 7
## [3,] 7 7 7 7 7
dim(matrix.d)
## [1] 3 5
matrix.c %*% matrix.d
matrix.d %*% matrix.c
To open files in R
we need to specify the directory our datafiles are stored in. There are two ways to do this: using code or via the dropdown menus (this will vary by Windows or Mac).
setwd("~/Dropbox/MathCamp/2020/Lecture2/Lab2/")
To set the working directory in your .Rmd document, you will need to include the following line of code:
knitr::opts_knit$set(root.dir = '~/Dropbox/MathCamp/2020/Lecture2/Lab2')
There are many ways to read in files to R
, depending on the file type. ## Read and Write .csv
?read.csv
data <- read.csv("Seattle_Pet_Licenses.csv")
write.csv(data, file = 'Seattle_Pets_copy.csv',
row.names = F)
data_copy <- data
save(data_copy, file = 'SeattlePets.rda')
rm(data_copy)
What is the name of the data set that loaded by the line of code below?
load('SeattlePets.rda')
To load Stata data files we either need to use the package foreign
or haven
.
library(foreign)
write.dta(data_copy, file = 'SeattlePets.dta') #Save as a stata data frame
rm(data_copy)
data_copy <- read.dta('SeattlePets.dta')
Recall if you have not installed the foreign
package you can do so using the following line of code.
install.packages('foreign', dependencies = T)
Try Google-ing! Chances are there’s a package for the file type of your choice. I’ve loaded ASCII files, .txt, .xls, .xlsx files among others.
data.frame
A data.frame
is a type of R
object used for storing data. It can store non-numeric data as well.
Let’s go through some commands for exploring and viewing data frames.
Test whether an object is a data.frame
object.
is.data.frame(data)
## [1] TRUE
is.data.frame(data_copy)
## [1] TRUE
rm(data_copy)
View variable names.
# VARIABLE NAMES
names(data)
## [1] "License.Issue.Date" "License.Number" "Animal.s.Name"
## [4] "Species" "Primary.Breed" "Secondary.Breed"
## [7] "ZIP.Code"
colnames(data)
## [1] "License.Issue.Date" "License.Number" "Animal.s.Name"
## [4] "Species" "Primary.Breed" "Secondary.Breed"
## [7] "ZIP.Code"
rownames(data)
Find dimensions.
dim(data) # this gives rows and then columns (n X p)
## [1] 51754 7
nrow(data)
## [1] 51754
ncol(data)
## [1] 7
length(data) # NOT ADVISED TO USE WITH MATRICES OR DATA FRAMES
## [1] 7
Like we did with vectors and matrices, we may want to select or view partial data frames.
Whether you go down the base R
or tidyr
/dplyr
path is up to you, but I want you to have some familiarity with both.
R
vs tidyverse
: tidyr
, dplyr
cheat sheet: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdfIn base R
, we select a column in one of two ways
library(dplyr)
library(tidyr)
data$Species
data[ , c("Species")]
In the Hadleyverse we use the select
function:
select(data, Species)
In base R
, we subset data using Boolean logic tests. Here is a new data.frame
of all the observations whose species is ``Cat’’.
head(data)
## License.Issue.Date License.Number Animal.s.Name Species Primary.Breed
## 1 April 19 2003 200097 Tinkerdelle Cat Domestic Shorthair
## 2 February 07 2006 75432 Pepper Cat Manx
## 3 May 21 2014 727943 Ashley Cat Domestic Shorthair
## 4 May 08 2015 833836 Lulu Cat LaPerm
## 5 May 13 2015 361031 My Boy Cat Russian Blue
## 6 July 21 2015 203480 Rocket Cat Domestic Shorthair
## Secondary.Breed ZIP.Code
## 1 98116
## 2 Mix 98103
## 3 98115
## 4 98136
## 5 98121
## 6 98144
table(data$Species)
##
## Cat Dog Goat Pig
## 16829 34882 38 5
cat.base <- data[data$Species == "Cat", ]
dim(cat.base)
## [1] 16829 7
head(cat.base)
## License.Issue.Date License.Number Animal.s.Name Species Primary.Breed
## 1 April 19 2003 200097 Tinkerdelle Cat Domestic Shorthair
## 2 February 07 2006 75432 Pepper Cat Manx
## 3 May 21 2014 727943 Ashley Cat Domestic Shorthair
## 4 May 08 2015 833836 Lulu Cat LaPerm
## 5 May 13 2015 361031 My Boy Cat Russian Blue
## 6 July 21 2015 203480 Rocket Cat Domestic Shorthair
## Secondary.Breed ZIP.Code
## 1 98116
## 2 Mix 98103
## 3 98115
## 4 98136
## 5 98121
## 6 98144
In the Hadleyverse one would use the filter
function.
cat.tidy <- filter(data, Species == "Cat")
dim(cat.tidy)
head(cat.tidy)
We can also use what is called ``the pipeline’’ to do the same operation:
cat.tidy2 <- data %>% filter(Species == "Cat" )
dim(cat.tidy2)
head(cat.tidy2)
or to do multiple sequential operations.
Note: you must link the sequential functions by %>%
. To make your code clean you probably want to use multiple lines, but the %>%
must come at the end of a line or R
will end your operation. What happens if you run this chunk of code?
data %>% filter(Species == "Cat" ) %>%
select(Species)
data %>% filter(Species == "Cat" ) %>%
select(Species)
caf.data <- read.csv('caffeine.csv', header = T)
head(data)
## License.Issue.Date License.Number Animal.s.Name Species Primary.Breed
## 1 April 19 2003 200097 Tinkerdelle Cat Domestic Shorthair
## 2 February 07 2006 75432 Pepper Cat Manx
## 3 May 21 2014 727943 Ashley Cat Domestic Shorthair
## 4 May 08 2015 833836 Lulu Cat LaPerm
## 5 May 13 2015 361031 My Boy Cat Russian Blue
## 6 July 21 2015 203480 Rocket Cat Domestic Shorthair
## Secondary.Breed ZIP.Code
## 1 98116
## 2 Mix 98103
## 3 98115
## 4 98136
## 5 98121
## 6 98144
caf.data$CaffKg <- caf.data$Caffeine/1000
# dplyr
caf.dplyr <- mutate(caf.data,
CaffKg = Caffeine/1000)
head(caf.dplyr)
caf.dplyr <- caf.dplyr %>%
mutate(CaffKg = Caffeine/1000)
caf.sum.base <- data.frame(CaffMean = mean(caf.data$Caffeine),
CaffSd = sd(caf.data$Caffeine))
head(caf.sum.base)
## CaffMean CaffSd
## 1 39.32504 5.517254
# Use the "aggregate" function
## Column names might have to be changed afterwards
caf.sum.base <- aggregate(formula =
Caffeine ~ 1,
data = caf.data,
FUN = function(x) c(mean = mean(x), sd = sd(x)))
head(caf.sum.base)
## Caffeine.mean Caffeine.sd
## 1 39.325042 5.517254
caf.sum.dplyr <- summarise(caf,
CaffMean = mean(Caffeine),
CaffSD = sd(Caffeine))
head(caf.sum.dplyr)
caf.sum.dplyr <- summarise_at(caf,
.vars = c("Caffeine"),
.funs = c("mean", "sd"))
names(caf.sum.dplyr)
In many cases it’s inconsequential whether you use base R
or the tidyverse
. Often tidyr
and dplyr
functions are a bit faster than R
, but I find the summarise
function in the tidyverse
to be MUCH, MUCH slower than aggregate
in base R
.
data(mtcars)
mtcars.sum.by <- aggregate(formula = cbind(mpg, wt) ~ cyl + gear,
data = mtcars,
FUN = function(x){
c(mean = mean(x), sd = sd(x))
},
drop = T)
mtcars.sum.by
## cyl gear mpg.mean mpg.sd wt.mean wt.sd
## 1 4 3 21.5000000 NA 2.4650000 NA
## 2 6 3 19.7500000 2.3334524 3.3375000 0.1732412
## 3 8 3 15.0500000 2.7743959 4.1040833 0.7683069
## 4 4 4 26.9250000 4.8073604 2.3781250 0.6006243
## 5 6 4 19.7500000 1.5524175 3.0937500 0.4131460
## 6 4 5 28.2000000 3.1112698 1.8265000 0.4433560
## 7 6 5 19.7000000 NA 2.7700000 NA
## 8 8 5 15.4000000 0.5656854 3.3700000 0.2828427
mtcars$total <- 1
mtcars.sum.by2 <- aggregate(formula = total ~ cyl + gear,
data = mtcars,
FUN = sum, drop = T)
mtcars.sum.by2
## cyl gear total
## 1 4 3 1
## 2 6 3 2
## 3 8 3 12
## 4 4 4 8
## 5 6 4 4
## 6 4 5 2
## 7 6 5 1
## 8 8 5 2
table(mtcars$cyl, mtcars$gear)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
mtcars.sum.dplyr <- mtcars %>%
group_by(cyl, gear) %>%
summarise(mpg.mean = mean(mpg),
mpg.sd = sd(mpg),
wt.mean = mean(wt),
wt.sd = sd(wt),
total = n()) %>%
ungroup()
mtcars.sum.dplyr
summary()
The summary()
function will summarize variables in a data set, based on their class.
summary(data)
## License.Issue.Date License.Number Animal.s.Name Species
## July 24 2018 : 346 21091 : 2 Lucy : 434 Cat :16829
## November 07 2017: 291 S100636: 2 Luna : 395 Dog :34882
## January 16 2018 : 286 S102467: 2 Charlie: 376 Goat: 38
## August 07 2018 : 276 S104231: 2 Bella : 327 Pig : 5
## December 05 2017: 239 S104449: 2 : 294
## March 20 2018 : 237 S104953: 2 Daisy : 264
## (Other) :50079 (Other):51742 (Other):49664
## Primary.Breed Secondary.Breed ZIP.Code
## Domestic Shorthair : 9819 :26842 98115 : 4537
## Retriever, Labrador : 4636 Mix :13511 98103 : 4394
## Domestic Medium Hair : 2030 Poodle, Standard : 1149 98117 : 3804
## Retriever, Golden : 1872 Poodle, Miniature : 909 98125 : 2798
## Chihuahua, Short Coat: 1859 Retriever, Labrador : 885 98122 : 2480
## Domestic Longhair : 1317 Chihuahua, Short Coat: 423 98107 : 2426
## (Other) :30221 (Other) : 8035 (Other):31315
min(caf.data$Caffeine)
## [1] 28.43
max(caf.data$Caffeine)
## [1] 52.54
mean(caf.data$Caffeine)
## [1] 39.32504
sd(caf.data$Caffeine)
## [1] 5.517254
var(caf.data$Caffeine)
## [1] 30.44009
sqrt(var(caf.data$Caffeine)) # same as the sd
## [1] 5.517254
median(caf.data$Caffeine)
## [1] 38.78
quantile(caf.data$Caffeine,0.5)
## 50%
## 38.78
quantile(caf.data$Caffeine,0.25)
## 25%
## 34.865
quantile(caf.data$Caffeine,0.75)
## 75%
## 43.815
quantile(caf.data$Caffeine,c(0.25,0.5,0.75))
## 25% 50% 75%
## 34.865 38.780 43.815