In studies of survival or modeling discrete-time events, one compact way to store data is in what may be called, “person-level” or generally “observation-level”. For example, you could have three variables, one indicating the observation, one indicating the time period the event occurred or the last follow-up period and one indicating whether the observation was censored. The book, Applied Longitudinal Data Analysis. has an example of this sort of data in chapter 10. Teachers are observed in a school district for a 12 year period, with the event being whether or not they left their job.
Another way to store the same data would be in what could be called, “person-period”. Here, there is a separate row for each person for each period they were observed. The event occurs in the last period they were observed unless the observation was censored. The question then is, how can a dataset be converted from one format to the other? For demonstration, we will use the dataset on teachers from the book mentioned above.
Basics
We define a custom function that will convert back and forth for us. Note that this is just one of many possible ways to do this in R. In this section we describe how to use the function and give examples. The next section (“Details”) explains how it works and what happens at each step of the way.
The PLPP function takes five arguments. The first, data is the data set to be converted. The second, id is the name of the variable containing the identifier for each observation. The third, period is the name of the variable that indicates how many periods the person or observation was in. The fourth, event is the name of the variable that indicates whether the event occurred or not or whether the observation was censored (depending on which direction you are converting). The fifth, direction indicates whether the function should go from person-level to person-period or from person-period to person-level. There are two options, “period” to go to person-period or “level” to go to person-level. Now let’s try it out. For the examples that follow to work, you need to source the function into R.
## Person-Level Person-Period Converter Function PLPP <- function(data, id, period, event, direction = c("period", "level")) { ## Data Checking and Verification Steps stopifnot(is.matrix(data) || is.data.frame(data)) stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data))) if (any(is.na(data[, c(id, period, event)]))) { stop("PLPP cannot currently handle missing data in the id, period, or event variables") } ## Do the conversion switch(match.arg(direction), period = { index <- rep(1:nrow(data), data[, period]) idmax <- cumsum(data[, period]) reve <- !data[, event] dat <- data[index, ] dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) dat[, event] <- 0 dat[idmax, event] <- reve}, level = { tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) dat <- data[index, ] dat[, event] <- as.integer(!dat[, event]) }) rownames(dat) <- NULL return(dat) }
Now we can use the function to convert between person-level and person-period datasets.
## Read in the person-level dataset teachers <- read.csv("https://stats.idre.ucla.edu/stat/examples/alda/teachers.csv") ## Look at a subset of the cases subset(teachers, id %in% c(20, 126, 129)) id t censor 19 20 3 0 125 126 12 0 128 129 12 1 ## Uses PLPP to convert to person-period and store in object, 'tpp' tpp <- PLPP(data = teachers, id = "id", period = "t", event = "censor", direction = "period") ## Look at a subset of the cases subset(tpp, id %in% c(20, 126, 129)) id t censor 95 20 1 0 96 20 2 0 97 20 3 1 740 126 1 0 741 126 2 0 742 126 3 0 743 126 4 0 744 126 5 0 745 126 6 0 746 126 7 0 747 126 8 0 748 126 9 0 749 126 10 0 750 126 11 0 751 126 12 1 760 129 1 0 761 129 2 0 762 129 3 0 763 129 4 0 764 129 5 0 765 129 6 0 766 129 7 0 767 129 8 0 768 129 9 0 769 129 10 0 770 129 11 0 771 129 12 0
Because there is an example dataset in person-period form provided from the book, we can read that dataset in and compare our results with that one.
## Read in person-period dataset teachers.pp <- read.csv("https://stats.idre.ucla.edu/stat/examples/alda/teachers_pp.csv") ## Look at a subset of the cases subset(teachers.pp, id %in% c(20, 126, 129)) id period event 95 20 1 0 96 20 2 0 97 20 3 1 740 126 1 0 741 126 2 0 742 126 3 0 743 126 4 0 744 126 5 0 745 126 6 0 746 126 7 0 747 126 8 0 748 126 9 0 749 126 10 0 750 126 11 0 751 126 12 1 760 129 1 0 761 129 2 0 762 129 3 0 763 129 4 0 764 129 5 0 765 129 6 0 766 129 7 0 767 129 8 0 768 129 9 0 769 129 10 0 770 129 11 0 771 129 12 0
Aside from different column names, they look identical. In fact if we were to change the names, we could whether the two datasets are equal using the all.equal function.
colnames(tpp) <- colnames(teachers.pp) all.equal(tpp, teachers.pp) [1] TRUE
The PLPP function can also convert from a person-period to a person-level dataset. Here is an example going the opposite direction. Note that the names passed to the period and event arguments changed from before because we overwrote the column names in tpp with those in teachers.pp.
## Convert from person-period to person-level t2 <- PLPP(data = tpp, id = "id", period = "period", event = "event", direction = "level") ## Look at a subset of the cases subset(t2, id %in% c(20, 126, 129)) id period event 19 20 3 0 125 126 12 0 128 129 12 1
In fact, as before, we might test that t2 is equal to teachers. We can do this, all we need to do is change the column names.
colnames(t2) <- colnames(teachers) all.equal(t2, teachers) [1] TRUE ## We could go one step farther that the conversion back and forth ## is the same. We convert teachers to period and back to level ## and compare with unconverted teachers all.equal(PLPP(data = PLPP(teachers, "id", "t", "censor", "period"), "id", "t", "censor", "level"), teachers) [1] TRUE
Details
The basic concept of how the conversion works is straightforward, but the code to do it is a bit tricky. First, we will look at a list of what needs to be accomplished to go from person-level to person-period.
- If each row contains one observation, then each row needs to be replicated the number of periods that it fell into.
- Each row for each ID in the new data needs to have an indicator of what period it is for. Because we assume discrete time, this can just count from 1 to however many total periods that ID was in.
- The last period that an ID is in should be marked as the one where the event occured, unless it is censored.
To show each step of the function, I will show the log from an interactive debugging session (this allows me to step through the function line by line playing with what is there at each step).
debug(PLPP) test <- PLPP(teachers, "id", "t", "censor", "period") debugging in: PLPP(teachers, "id", "t", "censor", "period") debug: { stopifnot(is.matrix(data) || is.data.frame(data)) stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data))) if (any(is.na(data[, c(id, period, event)]))) { stop("PLPP cannot currently handle missing data in the id, period, or event variables") } switch(match.arg(direction), period = { index <- rep(1:nrow(data), data[, period]) idmax <- cumsum(data[, period]) reve <- !data[, event] dat <- data[index, ] dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) dat[, event] <- 0 dat[idmax, event] <- reve }, level = { tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) dat <- data[index, ] dat[, event] <- as.integer(!dat[, event]) }) rownames(dat) <- NULL return(dat) } Browse[2]> ## The first thing we see is an echo of the parsed function definition Browse[2]> ## note that in the parsed version, comments and spaces are removed Browse[2]> ## now R will proceed through the function line by line Browse[2]> ## the first part of the function is just a series of logical checks Browse[2]> ## to make sure that the data is in an acceptable format and all the Browse[2]> ## arguments are appropriate for the function Browse[2]> ## it is not fool proof, but it helps to ensure that the results are Browse[2]> ## accurate and trustworthy. We can also list everything in the function Browse[2]> ls() [1] "data" "direction" "event" "id" "period" Browse[2]> ## This shows us all the objects currently defined inside the function Browse[2]> ## environment. These are recognizable as all the arguments Browse[2]> class(data) [1] "data.frame" Browse[2]> direction [1] "period" Browse[2]> event [1] "censor" Browse[2]> id [1] "id" Browse[2]> period [1] "t" Browse[2]> ## So the data is a data frame and we can see that all the argument variables Browse[2]> ## contain the values that we set them to (as they should) Browse[2]> debug: stopifnot(is.matrix(data) || is.data.frame(data)) Browse[2]> ## this is the first check, data must be a matrix or data frame Browse[2]> debug: stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data))) Browse[2]> ## the second check, the values passed to id, period, and event Browse[2]> ## must be in the column names of the data or in the column indices Browse[2]> debug: if (any(is.na(data[, c(id, period, event)]))) { stop("PLPP cannot currently handle missing data in the id, period, or event variables") } Browse[2]> ## this checks whether any of id, period, or event contain missing values Browse[2]> ## if there were missing values, the function would return the error message Browse[2]> ## written above Browse[2]> ## is.na() returns TRUE/FALSE for each cell, any() says, are there *any* TRUE Browse[2]> debug: NULL Browse[2]> ## NULL is returned because the condition for the if() statement was not met Browse[2]> ## this means that R will skip over the expression inside the braces { } Browse[2]> ## and continue evaluation Browse[2]> debug: switch(match.arg(direction), period = { index <- rep(1:nrow(data), data[, period]) idmax <- cumsum(data[, period]) reve <- !data[, event] dat <- data[index, ] dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) dat[, event] <- 0 dat[idmax, event] <- reve }, level = { tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) dat <- data[index, ] dat[, event] <- as.integer(!dat[, event]) }) Browse[2]> ## Now the debugger is showing us the next line of code that is going to Browse[2]> ## be debugged. On the surface, this may seem like more than one line Browse[2]> ## but it is all one call to the function, switch() so it is one line to R Browse[2]> ## in order to examine each step more closely, we will need to turn on Browse[2]> ## debugging for the switch function too (or everything in there would fly by) Browse[2]> debug(switch) Browse[2]> ## now we can continue Browse[2]> debug: index <- rep(1:nrow(data), data[, period]) Browse[2]> ## notice we are now debugging inside of switch() Browse[2]> ## so the first thing that happened is based on the argument, direction Browse[2]> ## the switch function jumps to the code after period or the code after level Browse[2]> ## it will only pick one, and what is not chosen, is not evaluated Browse[2]> direction [1] "period" Browse[2]> ## since direction == "period", we are in the 'period' track Browse[2]> ## the first line (above) of the period track created an index Browse[2]> ## this is used to replicate each row of the original dataset as many times as Browse[2]> ## necessary. Browse[2]> ## for 1 through n rows of the dataset, it repeats the row number Browse[2]> ## the number of times in the variable period Browse[2]> ## so for example Browse[2]> data[1:5, period] [1] 1 2 1 1 12 Browse[2]> ## the first row will be repeated 1 time, the second 2 times the third 1 time Browse[2]> ## the fourth 1 time and the fifth 12 times Browse[2]> ## on a small scale this looks like: Browse[2]> rep(1:5, data[1:5, period]) [1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 Browse[2]> ## These values are then stored in the object, index Browse[2]> ls() [1] "data" "direction" "event" "id" "period" Browse[2]> debug: idmax <- cumsum(data[, period]) Browse[2]> index[1:20] [1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 6 7 7 Browse[2]> ## note that the first 20 elements of index match the first few from our example above Browse[2]> ## next we create an object called 'idmax' Browse[2]> ## this contains the cumulative sum of the period variable in the original data Browse[2]> ## this will serve as an index of the row numbers in the person-period data Browse[2]> ## where the event should occur (unless it is censored) Browse[2]> ## again just for the first 5 rows of the person-level data Browse[2]> cumsum(data[1:5, period]) [1] 1 3 4 5 17 Browse[2]> ## compare this with the index Browse[2]> index[1:20] [1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 6 7 7 Browse[2]> ## the first row is repeated once, so its event occurs in row 1 of the person-period data Browse[2]> ## the second row is repeated twice, 1 + 2 = 3 so the event for row 2 of the person-level Browse[2]> ## appears in row 3 of the person-period, and so on and so forth for all the rows Browse[2]> ## Now for the event, we want it to be 1 if the event occurs or 0 otherwise Browse[2]> ## However, if the observation is censored, (1 for censored, 0 for not censored) Browse[2]> ## then the event should not occur Browse[2]> ## in short, for a given observation, the last period should contain the event Browse[2]> ## if it is not censored Browse[2]> ## R stores TRUE/FALSE internally as 1/0, so if we want an event (1) if censored is FALSE (0) Browse[2]> ## we can just negate the value in the censored variable to use as the value for the event Browse[2]> ## variable. This is what the next line of code does Browse[2]> debug: reve <- !data[, event] Browse[2]> ## now we create the person-period data, 'dat', using the index we created (replicates rows) Browse[2]> debug: dat <- data[index, ] Browse[2]> ## the next line of code creates a sequence (seq_along) of period from the Browse[2]> ## person-period data, _for each ID_ Browse[2]> ## For example, for the first row, it would just be 1 (since that was only repeated once) Browse[2]> ## for the second row, it would be 1 2 (since that row was repeated twice) Browse[2]> ## in particular, the variable will be an integer class starting at 1 going to whatever the Browse[2]> ## number of times a particular ID was repeated in the person-period data Browse[2]> ## (this should always be the same as the value in the period variable from the Browse[2]> ## person level data, because that is what _determined_ how many times the rows were repeated Browse[2]> ## in the person-period dataset) Browse[2]> debug: dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) Browse[2]> ## now since events only occur (if they occur at all) in the last Browse[2]> ## period of observation for a given ID, we will set all events equal to 0 Browse[2]> debug: dat[, event] <- 0 Browse[2]> ## now using the idmax index (which contains the rows of the last period for each ID) Browse[2]> ## and the reve object which is just the negation of whether an ID was censored Browse[2]> ## we can overwrite the relevant rows of the event variable in the new person-period Browse[2]> ## dataset with the appropriate values (TRUE/FALSE) Browse[2]> ## however, since we set everything to 0, it is currently an integer variable Browse[2]> ## and since we are only replacing a few rows of it, the TRUE/FALSE values will Browse[2]> ## be upgraded from logical (a lower class) to integer (a higher class) Browse[2]> ## which means they will be converted to their underling 1/0 values Browse[2]> ## basically R is quietly doing Browse[2]> reve[1:10] [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE Browse[2]> as.integer(reve[1:10]) [1] 1 1 1 1 0 1 0 1 1 1 Browse[2]> debug: dat[idmax, event] <- reve Browse[2]> ## once that is done the 'dat' variable contains the finalized person-period data Browse[2]> ## the only thing left to do is since rows were repeated Browse[2]> ## the rownames will be rather strange (orig.number_of_repeats) Browse[2]> ## for example 1.1 Browse[2]> ## 2.1 and 2.2, etc. Browse[2]> ## we will simply set them to NULL so that R will use the default Browse[2]> ## which is just an index from 1 to the total number of rows Browse[2]> debug: rownames(dat) <- NULL Browse[2]> ## (note that we jumped out of the switch debugging because it was done) Browse[2]> ## now we can just return the object 'dat' which is the converted dataset Browse[2]> debug: return(dat) Browse[2]> ## and we are done Browse[2]> exiting from: PLPP(teachers, "id", "t", "censor", "period")
Now we will do an abbreviated version of the same, going from person-period to person-level. Only the differences will be highlighted during the debugging.
test2 <- PLPP(teachers.pp, "id", "period", "event", "level") debugging in: PLPP(teachers.pp, "id", "period", "event", "level") debug: { stopifnot(is.matrix(data) || is.data.frame(data)) stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data))) if (any(is.na(data[, c(id, period, event)]))) { stop("PLPP cannot currently handle missing data in the id, period, or event variables") } switch(match.arg(direction), period = { index <- rep(1:nrow(data), data[, period]) idmax <- cumsum(data[, period]) reve <- !data[, event] dat <- data[index, ] dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) dat[, event] <- 0 dat[idmax, event] <- reve }, level = { tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) dat <- data[index, ] dat[, event] <- as.integer(!dat[, event]) }) rownames(dat) <- NULL return(dat) } Browse[2]> ## Same as before Browse[2]> debug: stopifnot(is.matrix(data) || is.data.frame(data)) Browse[2]> debug: stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data))) Browse[2]> debug: if (any(is.na(data[, c(id, period, event)]))) { stop("PLPP cannot currently handle missing data in the id, period, or event variables") } Browse[2]> debug: NULL Browse[2]> debug: switch(match.arg(direction), period = { index <- rep(1:nrow(data), data[, period]) idmax <- cumsum(data[, period]) reve <- !data[, event] dat <- data[index, ] dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along) dat[, event] <- 0 dat[idmax, event] <- reve }, level = { tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) dat <- data[index, ] dat[, event] <- as.integer(!dat[, event]) }) Browse[2]> ## Now we are entering switch() again, this time it will go to the Browse[2]> ## level track because Browse[2]> direction [1] "level" Browse[2]> ## the argument direction matches a different value Browse[2]> ## the first thing we do is create a small temporary dataset Browse[2]> ## it just contains the rows necessary for the conversion (i.e., period and id) Browse[2]> ## the reason is that in theory, the data could have had 100s or 1000s of columns Browse[2]> ## containing other variables measured along with the period and event Browse[2]> ## we are only going to extract one row for each ID, the final period it was observed Browse[2]> ## but to do this, we need to subset the dataset by ID Browse[2]> ## rather than subset a potentially enormous dataset Browse[2]> ## we make a small copy, then once the row indices are known Browse[2]> ## the final data returned is just the appropriate rows from the full data Browse[2]> debug: tmp <- cbind(data[, c(period, id)], i = 1:nrow(data)) Browse[2]> ## Now that we have our small working data set, we need to find the row indices Browse[2]> ## that contain the highest value of period for each ID Browse[2]> ## we use the by() function which applies the function we specify Browse[2]> ## to the tmp data, subset by each unique id Browse[2]> ## the final results (which are in a special class called 'by' that Browse[2]> ## is not unlike a list, but is not quite a list) are converted to a vector Browse[2]> ## using the as.vector() function Browse[2]> debug: index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[, period]), "i"])) Browse[2]> ## now we can create the person-level data set by just extracting the relevant rows Browse[2]> ## using our index variable Browse[2]> debug: dat <- data[index, ] Browse[2]> index[1:20] [1] 1 3 4 5 17 18 30 31 33 35 42 54 55 67 79 81 93 94 97 99 Browse[2]> ## in fact this should look quite familiar to us Browse[2]> ## it is the same as idmax from the other direction created with cumsum Browse[2]> ## the final step is to convert from whether or not an event occurred in the final Browse[2]> ## observed period to whether the observation is censored Browse[2]> ## just like we negated censoring to create the event, we can negate the event to create Browse[2]> ## the censoring Browse[2]> ## the only new trick is that because we are replacing the entire column Browse[2]> ## if we want 1/0s instead of TRUE/FALSE, we will need to be explicit Browse[2]> ## this is done using the as.integer() function Browse[2]> debug: dat[, event] <- as.integer(!dat[, event]) Browse[2]> ## now as before we will reset the rownames to their default by setting to NULL Browse[2]> debug: rownames(dat) <- NULL Browse[2]> ## and we can return 'dat' which is the person-level dataset created from Browse[2]> ## the person-period data Browse[2]> debug: return(dat) Browse[2]> exiting from: PLPP(teachers.pp, "id", "period", "event", "level")
Finally, to avoid entering debugging everytime you use the function (until you close and restart R), use the undebug function.
undebug(PLPP) undebug(switch)
and from person-level to person-period, that takes you there and back again.