The R program (as a text file) for the code on this page.
In order to see more than just the results from the computations of the functions (i.e. if you want to see the functions echoed back in console as they are processed) use the echo=T option in the source function when running the program.
One of the main methods for improving the efficiency of a function is to avoid using loops which are very slow and inefficient. On this page we will show a number of ways to avoid using loops by vectorizing the functions. We will cover the following topics:
The apply function
The lapply function
The sapply function
The tapply function
The sweep function
The column functions
The row functions
Miscellaneous
The apply function
Applies a function to sections of an array and returns the results in an array.
apply(array, margin, function, ...)
Note that an array in R is a very generic data type; it is a general structure of up to eight dimensions. For specific dimesions there are special names for the structures. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix.
The margin argument is used to specify which margin we want to apply the function to and which margin we wish to keep. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows) or 2 (apply the function to the columns). The function can be any function that is built in or user defined. The … after the function refers to any other arguments that is passed to the function being used.
Note that in R the apply function internally uses a loop so perhaps one of the other apply functions would be a better choice if time and efficiency is very important.
mat1 <- matrix(rep(seq(4), 4), ncol = 4) mat1 [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 2 2 2 2 [3,] 3 3 3 3 [4,] 4 4 4 4 #row sums of mat1 apply(mat1, 1, sum) [1] 4 8 12 16 #column sums of mat1 apply(mat1, 2, sum) [1] 10 10 10 10
#using a user defined function sum.plus.2 <- function(x){ sum(x) + 2 } #using the sum.plus.2 function on the rows of mat1 apply(mat1, 1, sum.plus.2) [1] 6 10 14 18 #the function can be defined inside the apply function #note the lack of curly brackets apply(mat1, 1, function(x) sum(x) + 2) [1] 6 10 14 18
#generalizing the function to add any number to the sum #add 3 to the row sums apply(mat1, 1, function(x, y) sum(x) + y, y=3) [1] 7 11 15 19 #add 5 to the column sums apply(mat1, 2, function(x, y) sum(x) + y, y=5) [1] 15 15 15 15
The lapply function
Applies a function to elements in a list or a vector and returns the results in a list.
lapply(list, function, ...)
The lapply function becomes especially useful when dealing with data frames. In R the data frame is considered a list and the variables in the data frame are the elements of the list. We can therefore apply a function to all the variables in a data frame by using the lapply function.
Note that unlike in the apply function there is no margin argument since we are just applying the function to each component of the list.
#creating a data frame using mat1 mat1.df <- data.frame(mat1) mat1.df X1 X2 X3 X4 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 #in the data frame mat1.df the variables mat1.1 - mat1.4 are elements of the list mat1.df #these variables can thus be accessed by lapply is.list(mat1.df) [1] TRUE #obtaining the sum of each variable in mat1.df lapply(mat1.df, sum) $X1 [1] 10 $X2 [1] 10 $X3 [1] 10 $X4 [1] 10
Verifying that the results are stored in a list, obtaining the names of the elements in the result list and displaying the first element of the result list.
#storing the results of the lapply function in the list y y <- lapply(mat1.df, sum) #verifying that y is a list is.list(y) [1] TRUE #names of the elements in y names(y) [1] "X1" "X2" "X3" "X4" #displaying the first element y[[1]] [1] 10 y$X1 [1] 10
Just like in the apply function we can use any built in or user defined function and we can define the function to be used inside the lapply function.
#user defined function with multiple arguments #function defined inside the lapply function #displaying the first two results in the list y1 <- lapply(mat1.df, function(x, y) sum(x) + y, y = 5) y1[1:2] $X1 [1] 15 $X2 [1] 15
Another useful application of the lapply function is with a “dummy sequence”. The list argument is the dummy sequence and it is only used to specify how many iterations we would like to have the function executed. When the lapply functions is used in this way it can replace a for loop very easily.
#using the lapply function instead of the for loop unlist(lapply(1:5, function(i) print(i) )) [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 1 2 3 4 5 #using the for loop for(i in 1:5) print(i) [1] 1 [1] 2 [1] 3 [1] 4 [1] 5
The sapply function
Applies a function to elements in a list and returns the results in a vector, matrix or a list.
sapply(list, function, ..., simplify)
When the argument simplify=F then the sapply function returns the results in a list just like the lapply function. However, when the argument simplify=T, the default, then the sapply function returns the results in a simplified form if at all possible. If the results are all scalars then sapply returns a vector. If the results are all of the same length then sapply will return a matrix with a column for each element in list to which function was applied.
y2 <- sapply(mat1.df, function(x, y) sum(x) + y, y = 5) y2 X1 X2 X3 X4 15 15 15 15 is.vector(y2) [1] TRUE
The tapply function
Applies a function to each cell of a ragged array.
tapply(array, indicies, function, ..., simplify)
The function is applied to each of the cells which are defined by the categorical variables listed in argument indicies. If the results of applying function to each cell is a single number then the results are returned in a multi-way array which has as many dimensions as there are components in the argument indicies. For example, if the argument indicies = c(gender, employed) then the result will be a 2 by 2 matrix with rows defined by male, female and columns defined by employed, unemployed. If the results are not a single value then the results are in a list with an dim attribute which means that it prints like a list but the user access the components by using subscripts like in an array.
#creating the data set with two categorical variables x1 <- runif(16) x1 [1] 0.83189832 0.93558412 0.59623797 0.71544196 0.79925238 0.44859140 [7] 0.03347409 0.62955913 0.97507841 0.71243195 0.58639700 0.43562781 [13] 0.23623549 0.97273216 0.72284040 0.25412129 cat1 <- rep(1:4, 4) cat1 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 cat2 <- c(rep(1, 8), rep(2, 8)) cat2 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 mat2.df <- data.frame(x1) names(mat2.df) <- c("x1") mat2.df$cat1 <- cat1 mat2.df$cat2 <- cat2 mat2.df x1 cat1 cat2 1 0.9574315 1 1 2 0.1163076 2 1 3 0.6661923 3 1 4 0.8265729 4 1 5 0.6701039 1 1 6 0.1478860 2 1 7 0.8537499 3 1 8 0.9993158 4 1 9 0.4189768 1 2 10 0.8830733 2 2 11 0.6114867 3 2 12 0.3111015 4 2 13 0.8834808 1 2 14 0.3606836 2 2 15 0.7056246 3 2 16 0.8052925 4 2 tapply(mat2.df$x1, mat2.df$cat1, mean) 1 2 3 4 0.7324982 0.3769876 0.7092634 0.7355707 tapply(mat2.df$x1, list(mat2.df$cat1, mat2.df$cat2), mean) 1 2 1 0.8137677 0.6512288 2 0.1320968 0.6218785 3 0.7599711 0.6585556 4 0.9129443 0.5581970
The sweep function
The sweep function returns an array like the input array with stats swept out.
sweep(array, margin, stats, function, ...)
The input array can be any dimensional array. The stats argument is a vector containing the summary statistics of array which are to be “swept” out. The argument margin specifies which dimensions of array corresponds to the summary statistics in stats. If array is a matrix then margin=1 refers to the rows and stats has to contain row summary statistics; margin=2 refers to the columns and stats then has to contain column summary statistics. The function argument specifies which function is to be used in the “sweep” operation; most often this is either “/” or “-“.
#creating the data set a <- matrix(runif(100, 1, 2),20) a.df <- data.frame(a)
#subtract column means from each column #centering each column around mean colMeans(a) [1] 1.470437 1.497412 1.553592 1.454150 1.613789 a1 <- sweep(a, 2, colMeans(a), "-") a1[1:5, ] [,1] [,2] [,3] [,4] [,5] [1,] -0.4285240 -0.42115156 -0.1069188 -0.07067987 -0.46239745 [2,] -0.4087171 -0.26986445 0.1232513 -0.38962010 0.35919248 [3,] 0.5000310 -0.42625368 -0.3284257 -0.42179545 0.06797352 [4,] -0.4324198 -0.22340951 0.2689937 0.01781444 -0.01896476 [5,] 0.5256995 0.06979928 -0.3596666 -0.03256511 -0.25716789 colMeans(a1) [1] -4.440892e-017 4.440892e-017 8.881784e-017 8.881784e-017 8.881784e-017 #dividing each column by sum a2 <- sweep(a, 2, colSums(a), "/") a2[1:5, ] [,1] [,2] [,3] [,4] [,5] [1,] 0.03542869 0.03593735 0.04655898 0.04756972 0.03567355 [2,] 0.03610219 0.04098897 0.05396666 0.03660317 0.06112886 [3,] 0.06700280 0.03576699 0.03943012 0.03549684 0.05210602 [4,] 0.03529621 0.04254014 0.05865716 0.05061254 0.04941242 [5,] 0.06787562 0.05233066 0.03842467 0.04888027 0.04203217 #centering each row around the mean of the row rowMeans(a)[1:5] [1] 1.219942 1.400724 1.396182 1.440279 1.507096 a3 <- sweep(a, 1, rowMeans(a), "-") a3[1:5, ] [,1] [,2] [,3] [,4] [,5] [1,] -0.1780286 -0.14368139 0.2267312 0.16352895 -0.06855023 [2,] -0.3390045 -0.17317704 0.2761186 -0.33619404 0.57225694 [3,] 0.5742861 -0.32502378 -0.1710159 -0.36382689 0.28558047 [4,] -0.4022616 -0.16627648 0.3823066 0.03168612 0.15454532 [5,] 0.4890407 0.06011528 -0.3131707 -0.08551046 -0.15047484 rowMeans(a3)[1:5] [1] 0.000000e+000 -4.440892e-017 -8.881784e-017 4.440892e-017 -4.440892e-017
The column functions
There are a suite of functions whose sole purpose is to compute summary statistics over columns of vectors, matrices, arrays and data frames. These functions include:
colMeans
colSums
#creating the data set a <- matrix(runif(100, 1, 2), 20) a.df[1:5, ] X1 X2 X3 X4 X5 1 1.533694 1.058162 1.739173 1.539331 1.523406 2 1.234076 1.305300 1.621082 1.274907 1.518986 3 1.628392 1.589093 1.067717 1.168978 1.538356 4 1.987724 1.900699 1.271701 1.022540 1.381527 5 1.252216 1.155357 1.441486 1.274234 1.550317 #Get columns means using columns function #input is the matrix a, results in a vector col.means <- colMeans(a) col.means [1] 1.461788 1.470676 1.451375 1.378107 1.555699 is.vector(col.means) [1] TRUE
The row functions
There are a suite of functions whose sole purpose is to compute summary statistics over rows of vectors, matrices, arrays and data frames. These functions include:
rowMeans
rowSums
#Get row means using row functions #input is the matrix a, results are in a vector row.means <- rowMeans(a) row.means[1:5] [1] 1.478753 1.390870 1.398507 1.512838 1.334722 is.vector(row.means) [1] TRUE
Miscellaneous
It is important to realize that there are usually many different ways of obtaining the same results and that these methods do differ in efficiency and other details. The following examples shows three different methods for obtaining the column means and how the results differ.
#Get columns means using columns function #input is the matrix a, results in a vector col.means1 <- colMeans(a) col.means1 [1] 1.470437 1.497412 1.553592 1.454150 1.613789 is.vector(col.means1) [1] TRUE #get column means using apply #input is a matrix, results in a vector col.means2 <- apply(a, 2, mean) col.means2 [1] 1.470437 1.497412 1.553592 1.454150 1.613789 is.vector(col.means2) [1] TRUE #get column means using lapply #input is the data frame which is a list, results are in a list col.means3 <- lapply(a.df, mean) col.means3 $X1: [1] 1.470437 $X2: [1] 1.497412 $X3: [1] 1.553592 $X4: [1] 1.45415 $X5: [1] 1.613789 is.list(col.means3) [1] TRUE
The following examples shows three different methods for obtaining the row means and how the results differ.
#Get row means using row functions #input is the matrix a, results are in a vector row.means1 <- rowMeans(a) row.means1[1:5] [1] 1.373179 1.550485 1.489238 1.599728 1.522427 is.vector(row.means1) [1] TRUE #using apply, input is a matrix, results are in a vector row.means2 <- apply(a, 1, mean) row.means2[1:5] [1] 1.606348 1.350039 1.601302 1.631221 1.616117 is.vector(row.means2) [1] TRUE #we can transpose the data frame and create a new data frame ta.df <- data.frame(t(a.df)) #use lapply on the data frame since it is a list #results are in a list row.means3 <- lapply(ta.df, mean) row.means3[1:5] $X1: [1] 1.219942 $X2: [1] 1.400724 $X3: [1] 1.396182 $X4: [1] 1.440279 $X5: [1] 1.507096 is.list(row.means3) [1] TRUE
Any of the functions that have been mentioned above can be used inside a user defined function. In the following example the function f1 multiply the sequence 1-x by y by using the lapply function instead of a for loop.
f1 <- function(x, y) { return(lapply(1:x, function(a, b) b*a, b=y )) } #multiplying the sequence 1:3 by 2 f1(3, 2) [[1]]: [1] 2 [[2]]: [1] 4 [[3]]: [1] 6 #multiplying the sequence 1:4 by 10 f1(4, 10) [[1]]: [1] 10 [[2]]: [1] 20 [[3]]: [1] 30 [[4]]: [1] 40
Cool use of the lapply function which can be used in many clever ways.
list1 <- lapply(1:6, runif) list1 [[1]] [1] 0.796063 [[2]] [1] 0.5456884 0.8709621 [[3]] [1] 0.6957483 0.2939853 0.3849384 [[4]] [1] 0.77135125 0.14607271 0.58522428 0.09303452 [[5]] [1] 0.6610859 0.2150228 0.1366291 0.0921625 0.6901002 [[6]] [1] 0.1368554 0.2575741 0.1218799 0.6293610 0.4628676 0.8303309 list2 <- lapply(1:6, runif) list2 [[1]] [1] 0.444944 [[2]] [1] 0.1236448 0.1709105 [[3]] [1] 0.01409603 0.76272480 0.65504591 [[4]] [1] 0.7724182 0.7856118 0.2360862 0.2319794 [[5]] [1] 0.1785266 0.5621904 0.3170615 0.2320846 0.6087983 [[6]] [1] 0.67940923 0.99266570 0.05010323 0.50740777 0.11782769 0.71910324 lapply(1:6, function(i, x, y) x[[i]] + y[[i]], x = list1, y = list2) [[1]] [1] 1.241007 [[2]] [1] 0.6693333 1.0418726 [[3]] [1] 0.7098443 1.0567101 1.0399843 [[4]] [1] 1.5437694 0.9316845 0.8213105 0.3250139 [[5]] [1] 0.8396126 0.7772132 0.4536906 0.3242471 1.2988985 [[6]] [1] 0.8162646 1.2502398 0.1719831 1.1367687 0.5806953 1.5494342