Office of Advanced Research Computing (OARC)
Statistical Methods and
Data Analytics
Interpreted Execution: Unlike languages like C++ or FORTRAN, R doesn’t require compilation. It operates as an interpreted language, executing code line by line.
Versatility Beyond Statistics: While R serves as statistical software like SAS, Stata, or SPSS, it stands out by offering more. R isn’t limited to statistics; it’s a dynamic programming language that enables users to craft customized routines and functions tailored to their unique requirements.
Key Points:
R’s popularity rests on its adaptability to data analysis and statistical tasks.
R’s interpreter allows interactive coding and rapid testing.
R bridges the gap between statistical analysis and programming.
Efficient Reusability: Avoid code duplication by repeating program segments. Copy-pasting leads to lengthy, error-prone code that’s difficult to navigate.
Method Evaluation: Simulate experiments to assess method performance accurately.
Tailored Data Manipulation: While R offers powerful data tools, custom data modifications for specific situations are sometimes necessary.
Code Comprehension: Understand and work with provided R code efficiently.
Method Development: Create new methods or adapt existing ones to suit your needs.
Code Aesthetics: Enhance code readability and appearance.
Closing Thought:
Mastering R empowers you to streamline processes, customize analyses, and contribute to a community of data-driven professionals.
Data Storage: In R, information is stored in memory as objects. Each object has a specific structure type defining its behavior and allowed operations.
Naming and Value: Creating an object involves storing a value or values in memory and assigning a name as a reference to access these values.
Visibility: Objects created in R become visible in the RStudio environment pane, aiding in managing and interacting with them.
Temporary Workspace: During an R session, all objects reside in a temporary workspace, allowing for efficient data manipulation.
Workspace Persistence: To retain objects between sessions, the workspace can be saved, enabling loading in future R sessions.
Key Takeaway:
Objects are the building blocks of R programming, enabling efficient data storage, manipulation, and analysis.
Names must not begin with a number.
Avoid special symbols like ^, !, $, @, +, -, /, :, or *
in names, promoting code readability.
R object names are case sensitive, distinguishing between lowercase and uppercase characters.
you can’t use any of the reserved words like TRUE
,
NULL
, if
, and function
(see the
complete list in ?Reserved
)
When we pass the value of an object to a new object, we are not generating a new value in the computer memory; rather, we are establishing a new reference to the same object.
By typing an object , we print values into RStudio console.
## [1] 1
#creates an object names 'my.number' with value 7
my.number <- 7
# Assign the value of 'my.number' to 'x'
x <- my.number
# Print the values of 'x' and 'my.number'
#c combine values into one vector and then print, prints it
c(x, my.number)
## [1] 7 7
## [1] 5 6
## [1] 5
Create and object name die
to store all possible value
of a die. Pass the value of die
to a new object, call it
new.die
. Print die
and new.die in RStudio
console.
#################### Exercise 1 ######################
#creating a die vector from 1 to 6
die <- c(1,2,3,4,5,6)
#assign it to new object name new.die
new.die <- die
#print
die
## [1] 1 2 3 4 5 6
## [1] 1 2 3 4 5 6
The most simple type of object in R is an atomic vector. It is a vector that contains the same type of data.
is.vector
: To test if an object is a vector we
useis.vector
.
length
: returns the length a vector.
typeof
: identifies the type of a vector.
Using the ‘c’ Function: The ‘c’ function combines vector values into a new vector.
There are six basic types of vectors:
doubles
, integers
,
characters
, logicals
, complex
,
and raw
.
double
: stores a numerical value
with decimals.
integer
: stores an integer without
decimal.
characters
: stores text. If a
character string is present in an atomic vector, R will convert
everything else to character strings.
logical
: used to store Boolean data
as TRUE
and FALSE
. If a vector contains
logical and numbers, R will convert every TRUE becomes to 1, and every
FALSE to a 0.
complex
: is used to store a complex
number.
raw
: is used to create raw and
empty bytes in memory.
here is some examples of types of vector.
## [1] TRUE
## [1] "double"
## [1] 1 -2
## [1] "integer"
#If in mathematical calculations of integers we include a double the result is double.
typeof(a * die)
## [1] "double"
## [1] "john" "mary" "2"
## [1] "character"
## [1] "logical"
## [1] TRUE FALSE FALSE
## [1] FALSE
## [1] "complex"
## [1] 00 00 00
Attributes are used to store additional properties beyond the main data of an object.
names
: Names are an attribute that
provides labels for each element in the object.
dim
: Matrices and arrays have a
dimension attribute (dim
) that specifies the number of rows
and columns. We talk about Matrices and arrays next.
class
: The class attribute
indicates the type or class of the object. It is used to determine how R
should treat and handle the object. For example, the class of a data
frame is “data.frame”.
levels
: For factors, the
levels
attribute defines the possible values of the factor.
Factors are used to represent categorical data.
Missing Values: In some objects, like vectors or data frames, missing values are indicated using attributes. For example, the NA values in a vector are marked using attributes.
Key Takeaway:
Attributes are used to store additional information beyond the main
data of an object. You can access attributes using functions like
attr()
, names()
, dim()
,
class()
, and so on. Understanding and utilizing attributes
is crucial for effective data manipulation and analysis in R.
## NULL
# Assign names to the 'die' vector
names(die) <- c("one","two", "three", "four", "five", "six")
#Print the values and attributes of 'die'
die
## one two three four five six
## 1 2 3 4 5 6
## $names
## [1] "one" "two" "three" "four" "five" "six"
## [1] 1 2 3 4 5 6
## NULL
# By changing dim we Reshapes 'die' into a matrix of 2 by 3
dim(die) <- c(2,3)
# Print the reshaped 'die' matrix
die
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## $dim
## [1] 2 3
## , , 1
##
## [,1] [,2]
## [1,] 1 2
##
## , , 2
##
## [,1] [,2]
## [1,] 3 4
##
## , , 3
##
## [,1] [,2]
## [1,] 5 6
## [1] 1 2 3 4 5 6
A class is basically an attribute of an object that defines structure of the object in R. It is used to determine how R (functions) should treat and handle the object.
Notice that by changing dimension of a vector we did not change type of vector but the class of vector changes.
The class of a vector is the type of vector. However, class of
double
is numeric
.
Key Takeaway:
Classes are crucial because they define how objects respond to various operations and functions. Different classes may have different behavior even if they store similar underlying data. Understanding the class of an object is important for effective programming and data analysis in R.
factor is a way R store categorical data.
To make a factor
, we pass an atomic vector into the
factor()
function.
R stores the factor data as integer but adds levels
in the attribute.
The levels
contain labels for each value of factor
to display the factor values.
# Create a factor variable 'gender' with two levels: "male" and "female"
gender <- factor(c("male", "female", "female", "male"))
# Print the values and attributes of 'gender'
gender
## [1] male female female male
## Levels: female male
## [1] male female female male
## Levels: male female
## [1] "integer"
## $levels
## [1] "female" "male"
##
## $class
## [1] "factor"
Definition: In R, matrix is a two dimensional data structure, organized into rows and columns, designed to hold elements of the same data type.
Creation: Matrices can be created using the
matrix()
function. Specify data, number of rows
(nrow
), and number of columns (ncol
).
Operations: Matrices enable fundamental linear operations, like addition and multiplication.
Usage: Matrices are utilized for tasks such as data organization, statistical computations, and solving systems of linear equations.
# Create a 2 by 3 matrix 'm' using 'die'
m <- matrix(die, nrow = 2, ncol = 3)
# Print the matrix 'm'
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
# Create a 2 by 3 matrix 'm' using 'die', filling by rows
m <- matrix(die, nrow = 2, ncol = 3, byrow = TRUE)
# Print the matrix 'm'
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Definition: An array
in R is a
multi-dimensional data structure that extends the concept of matrices to
more than two dimensions. It can store elements of the same data
type.
Structure: Arrays have dimensions defined by their rows, columns, and additional dimensions. The number of elements in each dimension can vary.
Creation: Arrays can be created using the
array()
function. Specify data, dimensions
(dim)
, and optionally dimension names
(dimnames)
.
Operations: Arrays support similar operations as matrices, including mathematical computations and indexing along multiple dimensions.
Usage: Arrays are valuable for tasks requiring multi-dimensional data representation, like image data, time-series data, and numerical simulations.
# Create a 2 by 2 by 3 array 'ar' with values 1 to 12
ar <- array(1:12, dim = c(2, 2, 3))
# Print the array 'ar'
ar
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
To store multiple vectors with different type but the same length
we use data frame or data.frame
. Similar to matrices, Data
frames are two dimensional object. So, to extract an element we use the
same way as we used in a matrix.
Since the type of data can be different, data.frame is the most useful structure to store data in data analysis.
In data analysis, columns of data frame are different variables and each row represent a unit of observation.
For example if we need to represent 3 students of a class with their gender and enrollment status and GPA score we use a data.frame:
# Create a data frame 'gpa.data' with various variables
gpa.data <- data.frame(student_number = c(1, 2, 3), Gender = c("F", "F", "M"),
GPA = c(3.5, 3.7, 3.6), Enroll = c(TRUE, TRUE, FALSE))
# Print the data frame 'gpa.data'
gpa.data
## student_number Gender GPA Enroll
## 1 1 F 3.5 TRUE
## 2 2 F 3.7 TRUE
## 3 3 M 3.6 FALSE
Now we are ready to Create a data frame that represent a deck of playing cards. This data frame has 52 rows, each row represent a single card and 3 columns, one for card suit, one for value of card from 1 to 13, and one for face of card.
In a playing card there are 4 different suit and 13 faces for each suit with value from 1 to 13.
using function rep(x, times, each)
. is helpful here
so we do not type a value many times.
To get the help of a function in R we use ?
before
the function for example: ?rep.
#create a vector of 4 suits
suit <- c("spades", "heart", "clubs", "dimonds")
#create a vector of 13 faces
face <- c("king", "queen", "jack", "ten", "nine", "eight", "seven", "six",
"five", "four", "three", "two", "ace")
#create a vector of values 13 to 1
value <- 13:1
#putting together all variables
deck <- data.frame( suit = rep(suit, each = 13), face = rep(face, times = 4), value = rep(value, times = 4))
deck
## suit face value
## 1 spades king 13
## 2 spades queen 12
## 3 spades jack 11
## 4 spades ten 10
## 5 spades nine 9
## 6 spades eight 8
## 7 spades seven 7
## 8 spades six 6
## 9 spades five 5
## 10 spades four 4
## 11 spades three 3
## 12 spades two 2
## 13 spades ace 1
## 14 heart king 13
## 15 heart queen 12
## 16 heart jack 11
## 17 heart ten 10
## 18 heart nine 9
## 19 heart eight 8
## 20 heart seven 7
## 21 heart six 6
## 22 heart five 5
## 23 heart four 4
## 24 heart three 3
## 25 heart two 2
## 26 heart ace 1
## 27 clubs king 13
## 28 clubs queen 12
## 29 clubs jack 11
## 30 clubs ten 10
## 31 clubs nine 9
## 32 clubs eight 8
## 33 clubs seven 7
## 34 clubs six 6
## 35 clubs five 5
## 36 clubs four 4
## 37 clubs three 3
## 38 clubs two 2
## 39 clubs ace 1
## 40 dimonds king 13
## 41 dimonds queen 12
## 42 dimonds jack 11
## 43 dimonds ten 10
## 44 dimonds nine 9
## 45 dimonds eight 8
## 46 dimonds seven 7
## 47 dimonds six 6
## 48 dimonds five 5
## 49 dimonds four 4
## 50 dimonds three 3
## 51 dimonds two 2
## 52 dimonds ace 1
Lists are used to group different object with the same or different class to a one dimensional sets.
Lists are very flexible because each element can have different type and attributes like dimensions.
# Create a list 'l' containing various objects
l <- list(die, ar, m, gpa.data, gender, list("a", "b"))
# Print the list 'l'
l
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
##
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
##
## [[4]]
## student_number Gender GPA Enroll
## 1 1 F 3.5 TRUE
## 2 2 F 3.7 TRUE
## 3 3 M 3.6 FALSE
##
## [[5]]
## [1] male female female male
## Levels: female male
##
## [[6]]
## [[6]][[1]]
## [1] "a"
##
## [[6]][[2]]
## [1] "b"
To extract value from an object of data we use brackets and the indices within the brackets, separated by commas, specify which values to extract.
Data Frame Example: Consider a data frame named deck. To extract values from it, you can use deck[,].
Brackets and Indices: The indices within the brackets, separated by commas, determine which values are extracted.
Index Types:
Indices are usually integer vectors, but they can also be:
Negative Integers
Blank Spaces
Logical Values
Names
Vector
:
Syntax: vector[index]
Example: my_vector[3]
matrix
:
Syntax: matrix[row_index, col_index]
Example: my_matrix[2, 1]
array
:
Syntax: array[index1, index2, index3]
Example: my_array[1, 3, 2]
data.frame
:
Syntax: data_frame[row_index, col_index]
Example: my_data_frame[4, “Age”]
list
:
Syntax: list_element[[index]]
Example: my_list[[2]]
## [1] "spades"
## suit face value
## 2 spades queen 12
## 1 spades king 13
## suit face value
## 3 spades jack 11
## suit face value
## 3 spades jack 11
## suit face value
## 3 spades jack 11
# Extract values in row 3 and columns 1 and 2 of 'deck' using logical indices
deck[3, c(TRUE, TRUE, FALSE)]
## suit face
## 3 spades jack
Now we will see how logical indices can be useful.
If we want to extract cards that have only values larger than 7, what can we do it?
First we test what values of deck are bigger than equal 7. Then, we extract rows with values bigger than 7.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE
# Create a new 'deck7' with values greater than or equal to 7
deck7 <- deck[deck$value >= 7, ]
deck7
## suit face value
## 1 spades king 13
## 2 spades queen 12
## 3 spades jack 11
## 4 spades ten 10
## 5 spades nine 9
## 6 spades eight 8
## 7 spades seven 7
## 14 heart king 13
## 15 heart queen 12
## 16 heart jack 11
## 17 heart ten 10
## 18 heart nine 9
## 19 heart eight 8
## 20 heart seven 7
## 27 clubs king 13
## 28 clubs queen 12
## 29 clubs jack 11
## 30 clubs ten 10
## 31 clubs nine 9
## 32 clubs eight 8
## 33 clubs seven 7
## 40 dimonds king 13
## 41 dimonds queen 12
## 42 dimonds jack 11
## 43 dimonds ten 10
## 44 dimonds nine 9
## 45 dimonds eight 8
## 46 dimonds seven 7
In R, you can easily modify values within objects, such as vectors, matrices, and data frames.
Use the assignment operator (<- or =) to replace values.
Modify multiple elements simultaneously.
Apply conditions for selective modifications.
%in%
Operator:Test if elements in one vector are present in another.
Syntax: element %in% vector_to_check
In gpa.data
we add 0.1 to everyone’s GPA who have
enrolled equal TRUE
## student_number Gender GPA Enroll
## 1 1 F 3.5 TRUE
## 2 2 F 3.7 TRUE
## 3 3 M 3.6 FALSE
# Identify rows where 'Enroll' is TRUE
ind.enroll <- gpa.data[, 4] == TRUE
# Add 0.1 to 'GPA' for rows where 'Enroll' is TRUE
gpa.data[ind.enroll, "GPA"] <- gpa.data[ind.enroll, "GPA"] + 0.1
# Print the modified 'gpa.data'
gpa.data
## student_number Gender GPA Enroll
## 1 1 F 3.6 TRUE
## 2 2 F 3.8 TRUE
## 3 3 M 3.6 FALSE
dealer
, and refer to the five selected
cards as the player
.Hint: This task may seem a bit tricky, but it’s not overly difficult.
Try using the sample
function without replacement, choosing
from numbers 1 to 52, and using these as new indices for the rows.
Modify the deck of playing cards to suit the game of Blackjack. To achieve this, replace the values for king, queen, jack, and ace with a value of 10.
Shuffle the modified deck of playing cards randomly and deal the first card.
# Sample without replacement from 1 to 52 and extract rows from 'deck'
shuffled.deck <- deck[sample(1:52), ]
# Print the shuffled 'deck'
shuffled.deck
## suit face value
## 42 dimonds jack 11
## 33 clubs seven 7
## 8 spades six 6
## 5 spades nine 9
## 4 spades ten 10
## 51 dimonds two 2
## 15 heart queen 12
## 18 heart nine 9
## 19 heart eight 8
## 28 clubs queen 12
## 34 clubs six 6
## 20 heart seven 7
## 24 heart three 3
## 9 spades five 5
## 48 dimonds five 5
## 3 spades jack 11
## 17 heart ten 10
## 25 heart two 2
## 27 clubs king 13
## 50 dimonds three 3
## 35 clubs five 5
## 7 spades seven 7
## 46 dimonds seven 7
## 41 dimonds queen 12
## 47 dimonds six 6
## 37 clubs three 3
## 14 heart king 13
## 49 dimonds four 4
## 52 dimonds ace 1
## 1 spades king 13
## 22 heart five 5
## 45 dimonds eight 8
## 13 spades ace 1
## 32 clubs eight 8
## 30 clubs ten 10
## 12 spades two 2
## 16 heart jack 11
## 39 clubs ace 1
## 43 dimonds ten 10
## 21 heart six 6
## 31 clubs nine 9
## 29 clubs jack 11
## 6 spades eight 8
## 2 spades queen 12
## 38 clubs two 2
## 10 spades four 4
## 11 spades three 3
## 44 dimonds nine 9
## 40 dimonds king 13
## 36 clubs four 4
## 26 heart ace 1
## 23 heart four 4
# Select the first five cards as 'player'
player <- shuffled.deck[1:5,]
# Print the first five cards for 'player'
head(player)
## suit face value
## 42 dimonds jack 11
## 33 clubs seven 7
## 8 spades six 6
## 5 spades nine 9
## 4 spades ten 10
# Keep the remaining cards as 'dealer'
dealer <- shuffled.deck[-(1:5),]
# Print the remaining cards for 'dealer'
head(dealer)
## suit face value
## 51 dimonds two 2
## 15 heart queen 12
## 18 heart nine 9
## 19 heart eight 8
## 28 clubs queen 12
## 34 clubs six 6
# Create a copy of 'deck' named 'deck2'
deck2 <- deck
# Identify cards that are king, queen, jack, or ace and set their values to 10
ind <- deck2$face == "king" | deck2$face == "queen" | deck2$face == "jack" | deck2$face == "ace"
deck2$value[ind] <- 10
# Shorter approach using '%in%'
deck2$value[deck$face %in% c("king", "queen", "jack", "ace")] <- 10
# Shuffle 'deck2'
shuffled.deck2 <- deck2[sample(1:52), ]
# Extract the first card from shuffled 'deck2'
shuffled.deck2[1, ]
## suit face value
## 33 clubs seven 7
Introduction: In R, we use NA
to
represent missing values in our data.
Detecting Missing Values:
The is.na()
function checks for missing values.
It returns a logical vector indicating if elements are missing.
Replacing Missing Values:
You can replace NA
with a specified value using
assignment.
Example:
data[data_column_name][is.na(data[data_column_name])] <- new_value
Removing NA values can be done with the na.omit() function.
Example:
clean_data <- na.omit(data)
R comes with many functions that you can use to perform sophisticated tasks.
For example, you can round a number with the round
function or calculate its factorial with the factorial
function. Using a function is straightforward: write the function’s name
followed by the data you want the function to operate on in
parentheses.
We’ve already used the function c to concatenate vectors.
## [1] 3
## [1] 3.14
## [1] 6
## [1] 3.5
## [1] 4
To roll the die randomly, you can use the sample
function.
To access the help for the sample
function, use
?sample
.
You can specify which data should be assigned to each argument by setting a name equal to the data.
If you don’t explicitly name your arguments, R will match your values to the function’s arguments based on order.
Some function arguments have default values. If not set, these arguments use the default value.
The default value for the replacement
argument in
the sample function is FALSE
.
## [1] 6 2 3 5 4 1
# Sample values from 'die' six times with replacement (simulating dice rolling)
sample(die, size = 6, replace = TRUE)
## [1] 1 2 3 6 4 5
Introduction: Generic functions in R are special functions designed to work with different data types, providing a consistent interface for diverse objects.
Purpose: Generic functions enable users to perform similar operations on various types of data without needing to know the specific type beforehand.
Usage: You call the same function name on different objects, and R figures out the appropriate method to use based on the object’s class.
Advantages:
Customizes behavior of generic functions for specialized tasks.
Promotes efficient and appropriate handling of different data types.
Example:
summary()
is a generic function that provides summaries
for different types of data, such as vectors, data frames, and
models.
Introduction: In R, you can make your own functions that do specific tasks. A function is like a mini-program.
Function Structure a function includes the following three elements:
1. Name: name for the function.
2. Arguments: Input information given to function
3. Function body code: The core code that processes the inputs and produces the output.
return()
to provide results. However, if we omit the
return()
statement, the last line of code becomes the
return value of the function.Key Takeaway:
Creating your own functions in R helps you build step-by-step instructions to solve problems and keeps your code neat and tidy.
avarage_two_number
that adds two numbers and divide by
2.# Define a function 'average_two_number' to calculate the average of two numbers
average_two_number <- function(a, b){
x <- (a + b) / 2
print(x)
}
# Calculate and print the average of 2 and 4
average_two_number(2, 4)
## [1] 3
# Define a function 'C_to_F' to convert Celsius to Fahrenheit
C_to_F <- function(c = 0){
f <- c * 9/5 + 32
return(f)
}
# Convert 30 degrees Celsius to Fahrenheit
f <- C_to_F(30)
# Print the converted temperature
f
## [1] 86
## [1] 32
## [1] "function"
## function(c = 0){
## f <- c * 9/5 + 32
## return(f)
## }
## <bytecode: 0x00000281005fbee8>
a- Create a function called roll_die
that rolls a die
and returns its value.
b- Create a function that rolls two six sided dice and return sum of
two dice, call it roll_die_2
.
c- Modify your function to rolls any given sided die for a given
number of times and return sum of dice, call it
roll_die_k
.
# Define a function 'roll_die' to simulate rolling a single die
roll_die <- function(){
#Create a vector of 1 to 6 for each side of die
die <- 1:6
#Sample one number from 1 to 6
die <- sample(die, size = 1, replace = TRUE)
#return the result
return(die)
}
# Roll the die using the defined function
roll_die()
## [1] 3
# Define a function 'roll_die_2' to simulate rolling two dice and getting their sum
roll_die_2 <- function(){
#create die
die <- 1:6
#sample from 1 to 6 with replacement
dice <- sample(die, size = 2, replace = TRUE)
#sum/ I did not used return so the last line will return
sum(dice)
}
# Roll two dice and calculate their sum using the defined function
roll_die_2()
## [1] 8
# Define a function 'roll_die_k' to simulate rolling a die 'k' times
roll_die_k <- function(side = 6, k = 1){
#create a die with side number of sides
die <- 1:side
#sample die k times
dice <- sample(die, size = k, replace = TRUE)
#sum
sum(dice)
}
# Roll a 4-sided die 3 times using the defined function
roll_die_k(side = 4, k = 3)
## [1] 7
Before introducing Control-Flow constructs of the R language, we review how R enable logical comparisons.
TRUE
or FALSE
.Common Boolean Operators are:
==
Equal to
!=
Not equal to
<
Less than
>
Greater than
<=
Less than or equal to
>=
Greater than or equal to
Logical Operators:
&
(And): Returns TRUE
if both
conditions are TRUE
.
|
(OR): Returns TRUE
if at least one
condition is TRUE
.
!
(Not): Negates a logical value.
if
Statement in RThe if
statement directs the R program to perform a task
only if a given condition is TRUE
.
Two Parts of if:
1- Condition: A logical value that’s usually a single TRUE/FALSE result.
2- Body: The instructions to execute if the condition is
TRUE
.
Functionality:
If the condition is TRUE
, the specified code inside the
curly braces will run.
If the condition is FALSE
, the code within the braces is
skipped.
Single-Line or Multi-Line:
The body can contain a single line or multiple lines of code. If the body is single line the curly braces is optional.
The following code checks if a number is even and print “The number is even” if it is even.
# Set the initial value of 'number' to 6
number <- 6
# Use an if statement to check if 'number' is even
if (number %% 2 == 0) {
print("The number is even.")
}
## [1] "The number is even."
# Update 'number' to 7 and check again
number <- 7
if (number %% 2 == 0) {
print("The number is even.")
}
Now we create a function to take a number and print “The number is even” if it is even and “The number is odd” if it is odd.
# Define a function 'even_odd' to determine if a number is even or odd
even_odd <- function(a){
#is the remainder zero
if (a %% 2 == 0) {
print("The number is even")
}
# if the remainder is not zero
if (a %% 2 != 0) {
print("The number is odd")
}
}
# Check if 5 is even or odd using the function
even_odd(5)
## [1] "The number is odd"
else
statement in RThe else
statement complements the if
statement, allowing user to specify an alternative course of action when
the condition is not met.
else
:While the if
statement handles the case when the
condition is TRUE
, the else
statement handles
what to do when the condition is FALSE
.
Nested if and else:
Nesting an if
inside an else
allows for
multiple levels of conditions.
Syntax: The structure of else
in
code is:
we create a function to take a number and print “The number is even” if it is even and else print “The number is odd”.
# Define a function 'even_odd' to determine if a number is even or odd
even_odd <- function(a){
#is the remainder zero
if (a %% 2 == 0) {
print("The number is even")
} else {
print("The number is odd")
}
}
# Check if 6 is even or odd using the function
even_odd(6)
## [1] "The number is even"
## [1] "The number is odd"
ifelse
Statement in RIntroduction: The ifelse
statement
is a versatile tool in R that helps you make decisions and apply them
element-wise across data structures. We will talk about element-wise
operation in R more later.
Syntax:
ifelse(test, yes, no)
1- test
A logical test or condition.
2- yes
Value to return when the condition is TRUE.
3- no
Value to return when the condition is FALSE.
ifelse
evaluates the test
condition for
each element, returning elements from yes
or
no
based on the corresponding TRUE
or
FALSE
result.
Flexibility:
The test
doesn’t need to be a single logical value; it
can be a vector of conditions. ifelse
works well with
vectors and data frames for element-wise operations.
ifelse
# Given grades for 4 students
grades <- c(85, 72, 94, 60)
# Determine whether each student passed or failed
pass_fail <- ifelse(grades >= 70, "Pass", "Fail")
# Print the results
pass_fail
## [1] "Pass" "Pass" "Pass" "Fail"
# Simulate rolling a die 10 times and determine win/loss for each roll
dice <- sample(die, size = 10, replace = TRUE)
win_loss <- ifelse(dice == 6, "win", "loss")
# Print the results
win_loss
## [1] "loss" "win" "loss" "loss" "loss" "win" "loss" "win" "win" "loss"
When we need to repeat the same task multiple times, we use loops in programming.
R offers three control flow commands for loops: for
,
while
, and repeat
.
for
Loop in RA for
loop is a valuable tool for automating repetitive
tasks by executing a specific block of code multiple times.
It’s particularly useful when you know the exact number of iterations needed.
Benefits:
Simplifies repetitive tasks, reducing code duplication.
Efficiently handles a known number of iterations.
Syntax:
The structure of for
loop in R is shown below:
for
loop# Use a for loop to calculate the square of numbers from 1 to 10
for(i in 1:10) {
x1 <- i^2
print(x1)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
# Calculate factorial using a for loop
calculate_factorial <- function(n) {
#initiate factorial to be 1
factorial <- 1
#factorial of 0 is 1!
#prints warning if the number is not non negative!
if (n < 0) print("Warning message: Factorial can only calculated for non negative integers")
if (n > 0){
for (i in 1:n) {
factorial <- factorial * i
}#end for
}#end if
return(factorial)
}#end function
# Calculate factorial of 3
calculate_factorial(3)
## [1] 6
## [1] 1
while
Loop in RA while loop is a control structure that allows you to repeatedly execute a block of code as long as a specified condition is true.
The loop continues to run as long as the condition remains true, and it stops once the condition becomes false.
The condition is a logical expression that is checked before each iteration of the loop.
Somewhere in the body of the loop, there should be a mechanism to change the condition.
Be careful with while loops to ensure that the loop termination condition will eventually be met, or else you could end up in an infinite loop, causing the program to run indefinitely.
while
loopWe want to find the smallest power of 2 greater than 1000
# Find the smallest power of 2 greater than 1000 using while loop
# Current number (initialized to 2^0)
number <- 1
# Current power (initialized to 0)
power <- 0
# Start a while loop
while (number <= 1000) {
#Calculate the next power of 2
number <- 2 ^ power
# Increment the power
power <- power + 1
}
#I use paste to combine a text with the value of power
print(paste("The smallest power of 2 greater than 1000 is:", number))
## [1] "The smallest power of 2 greater than 1000 is: 1024"
repeat
Loop in Rrepeat
loop is similar to while loop but it check
the condition in the body of code.
The repeat
loop is an unconditional loop, meaning it
will keep executing the code block within it repeatedly until a
break
statement is encountered.
The repeat loop doesn’t have a built-in condition check like the while loop. Instead, the condition is typically checked within the loop body using an if statement, and when the desired condition is met, a break statement is used to exit the loop.
repeat
loopWe want to find the smallest power of 2 greater than 1000 this time
using repeat
.
# Find the smallest power of 2 greater than 1000
# Initialize variables
number <- 1 # Current number (initialized to 2^0)
power <- 0 # Current power (initialized to 0)
# Start a repeat loop
repeat {
# Calculate the next power of 2
number <- 2^power
# Check if the number is greater than 1000
if (number > 1000) {
break # Exit the loop if condition is met
}
# Increment the power
power <- power + 1
}
# Print the result
print(paste("The smallest power of 2 greater than 1000 is:", number))
## [1] "The smallest power of 2 greater than 1000 is: 1024"
Nested loops in R involve placing one loop inside another, creating a powerful way to solve complex problems by breaking them down into smaller steps.
In a nested loop, the inner loop runs completely for each iteration of the outer loop.
Benefits:
Tackles complex tasks by breaking them into manageable parts.
Effective for tasks that require multiple levels of iteration.
Considerations:
Be cautious of potential performance impacts with deeply nested loops.
Example: Let’s say you want to print a multiplication table. You can achieve this using nested loops.
# multiply a combination of 1 to 5 to 1 to 5
for (i in 1:5) {
for (j in 1:5) {
result <- i * j
#print current multiplication
cat(i, "x", j, "=", result, "\t")
}
#print in the next line
cat("\n")
}
## 1 x 1 = 1 1 x 2 = 2 1 x 3 = 3 1 x 4 = 4 1 x 5 = 5
## 2 x 1 = 2 2 x 2 = 4 2 x 3 = 6 2 x 4 = 8 2 x 5 = 10
## 3 x 1 = 3 3 x 2 = 6 3 x 3 = 9 3 x 4 = 12 3 x 5 = 15
## 4 x 1 = 4 4 x 2 = 8 4 x 3 = 12 4 x 4 = 16 4 x 5 = 20
## 5 x 1 = 5 5 x 2 = 10 5 x 3 = 15 5 x 4 = 20 5 x 5 = 25
Run the die_roll_2
function for the sum of two dice
multiple times and calculate the long-term (1000 times) average of the
sum of two dice. Additionally, compute the standard deviation of the
long-term average using the sd() function.
Run the die_roll_2
function for the sum of two dice
repeatedly until you roll a total of 12. Determine how many times you
need to roll to achieve a sum of 12.
Execute part b 1000 times and calculate the average number of times it takes to roll two dice and obtain a sum of 12 over the long term.
# Define the function to roll two dice and get the sum
die_roll_2 <- function() {
die1 <- sample(1:6, 1)
die2 <- sample(1:6, 1)
return(die1 + die2)
}
# a) Calculate average and standard deviation of sum of two dice
#set the total runs to 1000
total_runs <- 1000
#Initialize sums to save result of each iterations
sums <- rep(NA, total_runs)
for (i in 1:total_runs) {
sums[i] <- die_roll_2()
}
average_sum <- mean(sums)
std_dev <- sd(sums)
cat("a) Long-run average of sum of two dice:", average_sum, "\n")
## a) Long-run average of sum of two dice: 7.086
## Standard deviation of sum of two dice: 2.362665
# b) Find number of rolls to get sum of 12
#Initialize the while loop
rolls_to_get_12 <- 0
sum_result <- 0
while (sum_result != 12) {
sum_result <- die_roll_2()
rolls_to_get_12 <- rolls_to_get_12 + 1
}
cat("b) Number of rolls to get sum of 12:", rolls_to_get_12, "\n")
## b) Number of rolls to get sum of 12: 69
# c) Calculate average number of rolls to get sum of 12 over 1000 runs
# set the total iterations to 1000
total_runs_c <- 1000
# Create a vector of size total_runs_c to keep result of each iteration
rolls_to_get_12_c <- rep(NA, total_runs_c)
#run for total_runs_c
for (i in 1:total_runs_c) {
#Initialize the while loop
rolls <- 0
sum_result <- 0
#Start while loop
while (sum_result != 12) {
sum_result <- die_roll_2()
rolls <- rolls + 1
}
# save the result of each iterations
rolls_to_get_12_c[i] <- rolls
}
average_rolls_c <- mean(rolls_to_get_12_c)
cat("c) Average number of rolls to get sum of 12 over 1000 runs:", average_rolls_c, "\n")
## c) Average number of rolls to get sum of 12 over 1000 runs: 35.489
break
In the repeat
loop we already seen that the loop
does not naturally check the condition to be able to exit the
loop.
The break
statement can also be used in other loops
if we want immediately interrupt the loop and exit.
If we have nested loops, the break
will force to
exit only from the innermost loop.
next
If for any reason we need to skip one or some of iterations in a
loop, we use next
When in the body of loop encounters the next
statement, it will jump to evaluation of the condition for
loop.
Loops in R, while powerful, can be time-consuming operations. It’s crucial to consider their performance implications, especially when dealing with large datasets.
Factors Affecting Efficiency:
Whenever possible, replace explicit loops with vectorized operations. R’s built-in vectorized functions are optimized for speed.
Preallocation: For loops that involve appending or modifying elements, preallocate memory for the final result. This minimizes the need for reallocation, which can slow down loops.
Reduce Redundant Computations:
Avoid repeating the same calculations within a loop. Instead, compute the value once and reuse it.
In some cases, it’s possible to parallelize loops to utilize multiple CPU cores, improving performance significantly.
for
Loop## user system elapsed
## 0.17 0.03 0.22
system.time({
#loop with Preallocation
output <- rep(NA, 1000000)
for (i in 1:1000000) {
output[i] <- i + 1
}
}
)
## user system elapsed
## 0.03 0.00 0.05
Vectorization is a fundamental concept in R that simplifies performing operations on multiple data points (vectors or arrays) at once.
Vectorized operations eliminate the need for explicit loops, making code concise, efficient, and easier to read.
Improved Performance: Vectorized operations use R’s built-in capabilities for faster execution.
Simplicity: Code becomes more compact, like using simple math.
Readability: Vectorized code is easier to understand, as it processes entire data structures.
Introduction: Performing mathematical operations in R is straightforward. R utilizes object contents when the object’s name is used in a command, enabling a wide range of mathematical operations.
Basic Operations:
When working with vectors, R employs element-wise execution for standard mathematical operations. element-wise execution.
Matrix Multiplication: R supports matrix multiplication using the %*% operator.
Handling Different Vector Lengths:
If you provide R with two vectors of unequal lengths, it will repeat the shorter vector until it matches the length of the longer vector before performing the math.
## [1] 0 1 2 3 4 5
## [1] 2 4 6 8 10 12
#element by element operation of two vector
# returns 1 + 1 2 + 2 3 + 1 4 + 2 5 + 1 6 + 2
die + die[1:2]
## [1] 2 4 4 6 6 8
## Warning in die * die[1:4]: longer object length is not a multiple of shorter
## object length
## [1] 1 4 9 16 5 12
## [,1]
## [1,] 91
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 4 5 6
## [2,] 2 4 6 8 10 12
## [3,] 3 6 9 12 15 18
## [4,] 4 8 12 16 20 24
## [5,] 5 10 15 20 25 30
## [6,] 6 12 18 24 30 36
apply
Family of Functions in RThe apply
family of functions in R provides an efficient
way to apply a function to the rows or columns of matrices and
arrays.
Instead of using loops, apply
functions simplify complex
operations, enhancing code readability and performance.
apply()
: Apply a function to rows or
columns of a matrix or array.
sapply()
: Simplify the result of
apply() into a vector or array.
lapply()
: Apply a function to each
element of a list and return a list.
tapply()
: Apply a function to subsets
of a vector based on factors.
Compact and readable code.
Avoids explicit loops.
Compatible with various data structures.
replicate
is a wrapper for the common
use of sapply
for repeated evaluation of an expression
(which will usually involve random number generation).
vapply
is similar to sapply
, but has a pre-specified type of
return value, so it can be safer (and sometimes faster) to use.
The apply
family offers an elegant solution for applying
functions across rows, columns, and elements, streamlining code and
enhancing performance in R.
apply
## -------------------------------
# Calculate the mean of each column of a matrix using 'apply'
matrix_data <- matrix(1:12, nrow = 3)
col_sums <- apply(matrix_data, 2, mean)
# Calculate the square root of each element in a vector using 'sapply'
sqrt_12 <- sapply(1:12, sqrt)
sqrt_12
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278 3.316625 3.464102
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278 3.316625 3.464102
replicate
Use the function replicate
or sapply
to
solve Exercise 4 for loop
replicate
# Define the function to roll two dice and get the sum
die_roll_2 <- function() {
die1 <- sample(1:6, 1)
die2 <- sample(1:6, 1)
return(die1 + die2)
}
# a) Calculate average and standard deviation of sum of two dice
total_runs <- 1000
sums <- replicate(total_runs, die_roll_2())
#or using sapply but does not work unless function gets an argument
#for example die number of side
#sums <- sapply(rep(6, total_runs), FUN = die_roll_2)
average_sum <- mean(sums)
std_dev <- sd(sums)
cat("a) Long-run average of sum of two dice:", average_sum, "\n")
## a) Long-run average of sum of two dice: 7.048
## Standard deviation of long-run average: 2.412782
# b) Find number of rolls to get sum of 12
get_rolls_to_get_12 <- function() {
#Initialize the while loop
rolls_to_get_12 <- 0
sum_result <- 0
#start while
while (sum_result != 12) {
sum_result <- die_roll_2()
rolls_to_get_12 <- rolls_to_get_12 + 1
}
return(rolls_to_get_12)
}
rolls_to_get_12 <- get_rolls_to_get_12()
cat("b) Number of rolls to get sum of 12:", rolls_to_get_12, "\n")
## b) Number of rolls to get sum of 12: 34
# c) Calculate average number of rolls to get sum of 12 over 1000 runs
total_runs_c <- 1000
rolls_to_get_12_c <- replicate(total_runs_c, get_rolls_to_get_12())
average_rolls_c <- mean(rolls_to_get_12_c)
cat("c) Average number of rolls to get sum of 12 over 1000 runs:", average_rolls_c, "\n")
## c) Average number of rolls to get sum of 12 over 1000 runs: 35.267