SPLUS Library: Advanced functions

One of the main methods for improving the efficiency of a function is to avoid

using loops which are very slow and inefficient. On this page we will show a number of ways

to avoid using loops by vectorizing the functions. We will cover the following topics:

The apply function The lapply function The sapply function The tapply function The sweep function The column functions The row functions Miscellaneous

## The apply function

Applies a function to sections of an array and returns the results in an array.

apply(array, margin, function, ...)

Note that an array in SPLUS is a very generic data type; it is a general structure of up to eight dimensions. For specific dimesions there are special names for the structures. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix. The

marginargument is used to specify which margin we want to apply the function to and which margin we wish to keep. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows) or 2 (apply the function to the columns). Thefunctioncan be any function that is built in or user defined. The … after the function refers to any other arguments that is passed to the function being used. Note that in R theapplyfunction internally uses a loop so perhaps one of the other apply functions would be a better choice if time and efficiency is very important.

mat1 [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 2 2 2 2 [3,] 3 3 3 3 [4,] 4 4 4 4 #row sums of mat1apply(mat1, 1, sum)[1] 4 8 12 16 #column sums of mat1apply(mat1, 2, sum)[1] 10 10 10 10

#using a user defined functionsum.plus.2 #using the sum.plus.2 function on the rows of mat1apply(mat1, 1, sum.plus.2)[1] 6 10 14 18 #the function can be defined inside the apply function #note the lack of curly bracketsapply(mat1, 1, function(x) sum(x) + 2)[1] 6 10 14 18

#generalizing the function to add any number to the sum #add 3 to the row sumsapply(mat1, 1, function(x, y) sum(x) + y, y=3)[1] 7 11 15 19 #add 5 to the column sumsapply(mat1, 2, function(x, y) sum(x) + y, y=5)[1] 15 15 15 15

## The lapply function

Applies a function to elements in a list and returns the results in a list.

lapply(list, function, ...)

The

lapplyfunction becomes especially useful when dealing with data frames. In SPLUS and R the data frame is considered a list and the variables in the data frame are the elements of the list. We can therefore apply a function to all the variables in a data frame by using thelapplyfunction. Note that unlike in theapplyfunction there is no margin argument since we are just applying thefunctionto each component of thelist. Note: One of the differences between SPLUS and R is in the default naming convention for variables in a data frame created from a matrix. In SPLUS the default names are created using the name of the matrix whereas in R the variables are always names X1, X2, etc.

#creating a data frame using mat1mat1.df #output from SPLUS mat1.1 mat1.2 mat1.3 mat1.4 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 #in the data frame mat1.df the varibles mat1.1 - mat1.4 are elements of the list mat1.df #these variables can thus be accessed by lapplyis.list(mat1.df)[1] T #obtaining the sum of each variable in mat1.dflapply(mat1.df, sum)$mat1.1: [1] 10 $mat1.2: [1] 10 $mat1.3: [1] 10 $mat1.4: [1] 10

Verifying that the results are stored in a list, obtaining the names of the elements in the result list and displaying the first element of the result list.

#storing the results of the lapply function in the list yy #verifying that y is a listis.list(y)[1] T #names of the elements in ynames(y)[1] "mat1.1" "mat1.2" "mat1.3" "mat1.4" #displaying the first elementy[[1]][1] 10y$mat1.1[1] 10

Just like in the

applyfunction we can use any built in or user defined function and we can define the function to be used inside thelapplyfunction.

#user defined function with multiple arguments #function defined inside the lapply function #displaying the first two results in the listy1 $mat1.1: [1] 15 $mat1.2: [1] 15

## The sapply function

Applies a function to elements in a list and returns the results in a vector, matrix or a list.

sapply(list, function, ..., simplify)

When the argument

simplify=F then thesapplyfunction returns the results in a list just like thelapplyfunction. However, when the argumentsimplify=T, the default, then thesapplyfunction returns the results in a simplified form if at all possible. If the results are all scalars thensapplyreturns a vector. If the results are all of the same length thensapplywill return a matrix with a column for each element inlistto whichfunctionwas applied.

y2 mat1.1 mat1.2 mat1.3 mat1.4 15 15 15 15is.vector(y2)[1] F #there is a problem with a vector has names associated #with each component so we will remove the namesnames(y2) [1] Ty2[1] 15 15 15 15

## The tapply function

Applies a function to each cell of a ragged array.

tapply(array, indicies, function, ..., simplify)

The

functionis applied to each of the cells which are defined by the categorical variables listed in argumentindicies. If the results of applyingfunctionto each cell is a single number then the results are returned in a multi-way array which has as many dimensions as there are components in the argumentindicies. For example, if the argumentindicies= c(gender, employed) then the result will be a 2 by 2 matrix with rows defined by male, female and columns defined by employed, unemployed. If the results are not a single value then the results are in a list with an dim attribute which means that it prints like a list but the user access the components by using subscripts like in an array.

#creating the data set with two categorical variablesx1 [1] 0.83189832 0.93558412 0.59623797 0.71544196 0.79925238 0.44859140 [7] 0.03347409 0.62955913 0.97507841 0.71243195 0.58639700 0.43562781 [13] 0.23623549 0.97273216 0.72284040 0.25412129cat1 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4cat2 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2mat2.df x1 cat1 cat2 1 0.9574315 1 1 2 0.1163076 2 1 3 0.6661923 3 1 4 0.8265729 4 1 5 0.6701039 1 1 6 0.1478860 2 1 7 0.8537499 3 1 8 0.9993158 4 1 9 0.4189768 1 2 10 0.8830733 2 2 11 0.6114867 3 2 12 0.3111015 4 2 13 0.8834808 1 2 14 0.3606836 2 2 15 0.7056246 3 2 16 0.8052925 4 2tapply(mat2.df$x1, mat2.df$cat1, mean)1 2 3 4 0.7324982 0.3769876 0.7092634 0.7355707tapply(mat2.df$x1, list(mat2.df$cat1, mat2.df$cat2), mean)1 2 1 0.8137677 0.6512288 2 0.1320968 0.6218785 3 0.7599711 0.6585556 4 0.9129443 0.5581970

## The sweep function

The sweep function returns an array like the input

arraywithstatsswept out.

sweep(array, margin, stats, function, ...)

The input

arraycan be any dimensional array. Thestatsargument is a vector containing the summary statistics ofarraywhich are to be “swept” out. The argument margin specifies which dimensions ofarraycorresponds to the summary statistics instats. Ifarrayis a matrix thenmargin=1 refers to the rows andstatshas to contain row summary statistics;margin=2 refers to the columns andstatsthen has to contain column summary statistics. Thefunctionargument specifies which function is to be used in the “sweep” operation; most often this is either “/” or “-“.

#subtract column means from each column #centering each column around meancolMeans(a)[1] 1.470437 1.497412 1.553592 1.454150 1.613789a1 [,1] [,2] [,3] [,4] [,5] [1,] -0.4285240 -0.42115156 -0.1069188 -0.07067987 -0.46239745 [2,] -0.4087171 -0.26986445 0.1232513 -0.38962010 0.35919248 [3,] 0.5000310 -0.42625368 -0.3284257 -0.42179545 0.06797352 [4,] -0.4324198 -0.22340951 0.2689937 0.01781444 -0.01896476 [5,] 0.5256995 0.06979928 -0.3596666 -0.03256511 -0.25716789colMeans(a1)[1] -4.440892e-017 4.440892e-017 8.881784e-017 8.881784e-017 8.881784e-017 #dividing each column by suma2 [,1] [,2] [,3] [,4] [,5] [1,] 0.03542869 0.03593735 0.04655898 0.04756972 0.03567355 [2,] 0.03610219 0.04098897 0.05396666 0.03660317 0.06112886 [3,] 0.06700280 0.03576699 0.03943012 0.03549684 0.05210602 [4,] 0.03529621 0.04254014 0.05865716 0.05061254 0.04941242 [5,] 0.06787562 0.05233066 0.03842467 0.04888027 0.04203217 #centering each row around the mean of the rowrowMeans(a)[1:5][1] 1.219942 1.400724 1.396182 1.440279 1.507096a3 [,1] [,2] [,3] [,4] [,5] [1,] -0.1780286 -0.14368139 0.2267312 0.16352895 -0.06855023 [2,] -0.3390045 -0.17317704 0.2761186 -0.33619404 0.57225694 [3,] 0.5742861 -0.32502378 -0.1710159 -0.36382689 0.28558047 [4,] -0.4022616 -0.16627648 0.3823066 0.03168612 0.15454532 [5,] 0.4890407 0.06011528 -0.3131707 -0.08551046 -0.15047484 rowMeans(a3)[1:5] [1] 0.000000e+000 -4.440892e-017 -8.881784e-017 4.440892e-017 -4.440892e-017

## The column functions

There are a suite of functions whose sole purpose is to compute summary statistics over columns of vectors, matrices, arrays and data frames. These functions include: colMeans colSums colVars colStdevs

#creating the data seta a.1 a.2 a.3 a.4 a.5 1 1.041913 1.076260 1.446673 1.383471 1.151391 2 1.061720 1.227547 1.676843 1.064530 1.972981 3 1.970468 1.071158 1.225166 1.032355 1.681762 4 1.038017 1.274002 1.822585 1.471965 1.594824 5 1.996136 1.567211 1.193925 1.421585 1.356621 #Get columns means using columns function #input is a matrix, results in a vectorcol.means1 [1] 1.470437 1.497412 1.553592 1.454150 1.613789is.vector(col.means1)[1] T

## The row functions

There are a suite of functions whose sole purpose is to compute summary statistics over rows of vectors, matrices, arrays and data frames. These functions include: rowMeans rowSums rowVars rowStdevs

#Get row means using row functions #input is the matrix a, results are in a vectorrow.means2 [1] 1.373179 1.550485 1.489238 1.599728 1.522427is.vector(rm1)[1] T

## Miscellaneous

It is important to realize that there are usually many different ways of obtaining the same results and that these methods do differ in efficiency and other details. The following examples shows three different methods for obtaining the column means and how the results differ.

#Get columns means using columns function #input is the matrix a, results in a vectorcol.means1 [1] 1.470437 1.497412 1.553592 1.454150 1.613789is.vector(col.means1)[1] T #get column means using apply #input is a matrix, results in a vectorcol.means2 [1] 1.470437 1.497412 1.553592 1.454150 1.613789is.vector(col.means2)[1] T #get column means using lapply #input is the data frame which is a list, results are in a listcol.means3 $a.1: [1] 1.470437 $a.2: [1] 1.497412 $a.3: [1] 1.553592 $a.4: [1] 1.45415 $a.5: [1] 1.613789is.list(col.means3)[1] T

The following examples shows three different methods for obtaining the row means and how the results differ.

#Get row means using row functions #input is the matrix a, results are in a vectorrow.means2 [1] 1.373179 1.550485 1.489238 1.599728 1.522427is.vector(rm1)[1] T #using apply, input is a matrix, results are in a vectorrow.means2 [1] 1.606348 1.350039 1.601302 1.631221 1.616117is.vector(rm2)[1] T #we can transpose the data frame and create a new data frameta.df #use lapply on the data frame since it is a list #results are in a listrow.means3 $X1: [1] 1.219942 $X2: [1] 1.400724 $X3: [1] 1.396182 $X4: [1] 1.440279 $X5: [1] 1.507096is.list(row.means3)[1] T

Any of the functions that have been mentioned above can be used inside a user defined function. In the following example the function

f1multiply the sequence 1-x by y by using thelapplyfunction instead of aforloop.

f1 #multiplying the sequence 1:3 by 2f1(3, 2)[[1]]: [1] 2 [[2]]: [1] 4 [[3]]: [1] 6 #multiplying the sequence 1:4 by 10f1(4, 10)[[1]]: [1] 10 [[2]]: [1] 20 [[3]]: [1] 30 [[4]]: [1] 40

Cool use of the lapply function which can be used in many clever ways.

list1 [[1]] [1] 0.796063 [[2]] [1] 0.5456884 0.8709621 [[3]] [1] 0.6957483 0.2939853 0.3849384 [[4]] [1] 0.77135125 0.14607271 0.58522428 0.09303452 [[5]] [1] 0.6610859 0.2150228 0.1366291 0.0921625 0.6901002 [[6]] [1] 0.1368554 0.2575741 0.1218799 0.6293610 0.4628676 0.8303309list2 [[1]] [1] 0.444944 [[2]] [1] 0.1236448 0.1709105 [[3]] [1] 0.01409603 0.76272480 0.65504591 [[4]] [1] 0.7724182 0.7856118 0.2360862 0.2319794 [[5]] [1] 0.1785266 0.5621904 0.3170615 0.2320846 0.6087983 [[6]] [1] 0.67940923 0.99266570 0.05010323 0.50740777 0.11782769 0.71910324lapply(1:6, function(i, x, y) x[[i]] + y[[i]], x = list1, y = list2)[[1]] [1] 1.241007 [[2]] [1] 0.6693333 1.0418726 [[3]] [1] 0.7098443 1.0567101 1.0399843 [[4]] [1] 1.5437694 0.9316845 0.8213105 0.3250139 [[5]] [1] 0.8396126 0.7772132 0.4536906 0.3242471 1.2988985 [[6]] [1] 0.8162646 1.2502398 0.1719831 1.1367687 0.5806953 1.5494342