In studies of survival or modeling discrete-time events, one compact way to store data is in what may be called, “person-level” or generally “observation-level”. For example, you could have three variables, one indicating the observation, one indicating the time period the event occurred or the last follow-up period and one indicating whether the observation was censored. The book, Applied Longitudinal Data Analysis. has an example of this sort of data in chapter 10. Teachers are observed in a school district for a 12 year period, with the event being whether or not they left their job.
Another way to store the same data would be in what could be called, “person-period”. Here, there is a separate row for each person for each period they were observed. The event occurs in the last period they were observed unless the observation was censored. The question then is, how can a dataset be converted from one format to the other? For demonstration, we will use the dataset on teachers from the book mentioned above.
Basics
We define a custom function that will convert back and forth for us. Note that this is just one of many possible ways to do this in R. In this section we describe how to use the function and give examples. The next section (“Details”) explains how it works and what happens at each step of the way.
The PLPP function takes five arguments. The first, data is the data set to be converted. The second, id is the name of the variable containing the identifier for each observation. The third, period is the name of the variable that indicates how many periods the person or observation was in. The fourth, event is the name of the variable that indicates whether the event occurred or not or whether the observation was censored (depending on which direction you are converting). The fifth, direction indicates whether the function should go from person-level to person-period or from person-period to person-level. There are two options, “period” to go to person-period or “level” to go to person-level. Now let’s try it out. For the examples that follow to work, you need to source the function into R.
## Person-Level Person-Period Converter Function
PLPP <- function(data, id, period, event, direction = c("period", "level")) {
## Data Checking and Verification Steps
stopifnot(is.matrix(data) || is.data.frame(data))
stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
if (any(is.na(data[, c(id, period, event)]))) {
stop("PLPP cannot currently handle missing data in the id, period, or event variables")
}
## Do the conversion
switch(match.arg(direction),
period = {
index <- rep(1:nrow(data), data[, period])
idmax <- cumsum(data[, period])
reve <- !data[, event]
dat <- data[index, ]
dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
dat[, event] <- 0
dat[idmax, event] <- reve},
level = {
tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
index <- as.vector(by(tmp, tmp[, id],
FUN = function(x) x[which.max(x[, period]), "i"]))
dat <- data[index, ]
dat[, event] <- as.integer(!dat[, event])
})
rownames(dat) <- NULL
return(dat)
}
Now we can use the function to convert between person-level and person-period datasets.
## Read in the person-level dataset
teachers <- read.csv("https://stats.idre.ucla.edu/stat/examples/alda/teachers.csv")
## Look at a subset of the cases
subset(teachers, id %in% c(20, 126, 129))
id t censor
19 20 3 0
125 126 12 0
128 129 12 1
## Uses PLPP to convert to person-period and store in object, 'tpp'
tpp <- PLPP(data = teachers, id = "id", period = "t",
event = "censor", direction = "period")
## Look at a subset of the cases
subset(tpp, id %in% c(20, 126, 129))
id t censor
95 20 1 0
96 20 2 0
97 20 3 1
740 126 1 0
741 126 2 0
742 126 3 0
743 126 4 0
744 126 5 0
745 126 6 0
746 126 7 0
747 126 8 0
748 126 9 0
749 126 10 0
750 126 11 0
751 126 12 1
760 129 1 0
761 129 2 0
762 129 3 0
763 129 4 0
764 129 5 0
765 129 6 0
766 129 7 0
767 129 8 0
768 129 9 0
769 129 10 0
770 129 11 0
771 129 12 0
Because there is an example dataset in person-period form provided from the book, we can read that dataset in and compare our results with that one.
## Read in person-period dataset
teachers.pp <- read.csv("https://stats.idre.ucla.edu/stat/examples/alda/teachers_pp.csv")
## Look at a subset of the cases
subset(teachers.pp, id %in% c(20, 126, 129))
id period event
95 20 1 0
96 20 2 0
97 20 3 1
740 126 1 0
741 126 2 0
742 126 3 0
743 126 4 0
744 126 5 0
745 126 6 0
746 126 7 0
747 126 8 0
748 126 9 0
749 126 10 0
750 126 11 0
751 126 12 1
760 129 1 0
761 129 2 0
762 129 3 0
763 129 4 0
764 129 5 0
765 129 6 0
766 129 7 0
767 129 8 0
768 129 9 0
769 129 10 0
770 129 11 0
771 129 12 0
Aside from different column names, they look identical. In fact if we were to change the names, we could whether the two datasets are equal using the all.equal function.
colnames(tpp) <- colnames(teachers.pp) all.equal(tpp, teachers.pp) [1] TRUE
The PLPP function can also convert from a person-period to a person-level dataset. Here is an example going the opposite direction. Note that the names passed to the period and event arguments changed from before because we overwrote the column names in tpp with those in teachers.pp.
## Convert from person-period to person-level
t2 <- PLPP(data = tpp, id = "id", period = "period",
event = "event", direction = "level")
## Look at a subset of the cases
subset(t2, id %in% c(20, 126, 129))
id period event
19 20 3 0
125 126 12 0
128 129 12 1
In fact, as before, we might test that t2 is equal to teachers. We can do this, all we need to do is change the column names.
colnames(t2) <- colnames(teachers) all.equal(t2, teachers) [1] TRUE ## We could go one step farther that the conversion back and forth ## is the same. We convert teachers to period and back to level ## and compare with unconverted teachers all.equal(PLPP(data = PLPP(teachers, "id", "t", "censor", "period"), "id", "t", "censor", "level"), teachers) [1] TRUE
Details
The basic concept of how the conversion works is straightforward, but the code to do it is a bit tricky. First, we will look at a list of what needs to be accomplished to go from person-level to person-period.
- If each row contains one observation, then each row needs to be replicated the number of periods that it fell into.
- Each row for each ID in the new data needs to have an indicator of what period it is for. Because we assume discrete time, this can just count from 1 to however many total periods that ID was in.
- The last period that an ID is in should be marked as the one where the event occured, unless it is censored.
To show each step of the function, I will show the log from an interactive debugging session (this allows me to step through the function line by line playing with what is there at each step).
debug(PLPP)
test <- PLPP(teachers, "id", "t", "censor", "period")
debugging in: PLPP(teachers, "id", "t", "censor", "period")
debug: {
stopifnot(is.matrix(data) || is.data.frame(data))
stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
if (any(is.na(data[, c(id, period, event)]))) {
stop("PLPP cannot currently handle missing data in the id, period, or event variables")
}
switch(match.arg(direction), period = {
index <- rep(1:nrow(data), data[, period])
idmax <- cumsum(data[, period])
reve <- !data[, event]
dat <- data[index, ]
dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
dat[, event] <- 0
dat[idmax, event] <- reve
}, level = {
tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[,
period]), "i"]))
dat <- data[index, ]
dat[, event] <- as.integer(!dat[, event])
})
rownames(dat) <- NULL
return(dat)
}
Browse[2]> ## The first thing we see is an echo of the parsed function definition
Browse[2]> ## note that in the parsed version, comments and spaces are removed
Browse[2]> ## now R will proceed through the function line by line
Browse[2]> ## the first part of the function is just a series of logical checks
Browse[2]> ## to make sure that the data is in an acceptable format and all the
Browse[2]> ## arguments are appropriate for the function
Browse[2]> ## it is not fool proof, but it helps to ensure that the results are
Browse[2]> ## accurate and trustworthy. We can also list everything in the function
Browse[2]> ls()
[1] "data" "direction" "event" "id" "period"
Browse[2]> ## This shows us all the objects currently defined inside the function
Browse[2]> ## environment. These are recognizable as all the arguments
Browse[2]> class(data)
[1] "data.frame"
Browse[2]> direction
[1] "period"
Browse[2]> event
[1] "censor"
Browse[2]> id
[1] "id"
Browse[2]> period
[1] "t"
Browse[2]> ## So the data is a data frame and we can see that all the argument variables
Browse[2]> ## contain the values that we set them to (as they should)
Browse[2]>
debug: stopifnot(is.matrix(data) || is.data.frame(data))
Browse[2]> ## this is the first check, data must be a matrix or data frame
Browse[2]>
debug: stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
Browse[2]> ## the second check, the values passed to id, period, and event
Browse[2]> ## must be in the column names of the data or in the column indices
Browse[2]>
debug: if (any(is.na(data[, c(id, period, event)]))) {
stop("PLPP cannot currently handle missing data in the id, period, or event variables")
}
Browse[2]> ## this checks whether any of id, period, or event contain missing values
Browse[2]> ## if there were missing values, the function would return the error message
Browse[2]> ## written above
Browse[2]> ## is.na() returns TRUE/FALSE for each cell, any() says, are there *any* TRUE
Browse[2]>
debug: NULL
Browse[2]> ## NULL is returned because the condition for the if() statement was not met
Browse[2]> ## this means that R will skip over the expression inside the braces { }
Browse[2]> ## and continue evaluation
Browse[2]>
debug: switch(match.arg(direction), period = {
index <- rep(1:nrow(data), data[, period])
idmax <- cumsum(data[, period])
reve <- !data[, event]
dat <- data[index, ]
dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
dat[, event] <- 0
dat[idmax, event] <- reve
}, level = {
tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[,
period]), "i"]))
dat <- data[index, ]
dat[, event] <- as.integer(!dat[, event])
})
Browse[2]> ## Now the debugger is showing us the next line of code that is going to
Browse[2]> ## be debugged. On the surface, this may seem like more than one line
Browse[2]> ## but it is all one call to the function, switch() so it is one line to R
Browse[2]> ## in order to examine each step more closely, we will need to turn on
Browse[2]> ## debugging for the switch function too (or everything in there would fly by)
Browse[2]> debug(switch)
Browse[2]> ## now we can continue
Browse[2]>
debug: index <- rep(1:nrow(data), data[, period])
Browse[2]> ## notice we are now debugging inside of switch()
Browse[2]> ## so the first thing that happened is based on the argument, direction
Browse[2]> ## the switch function jumps to the code after period or the code after level
Browse[2]> ## it will only pick one, and what is not chosen, is not evaluated
Browse[2]> direction
[1] "period"
Browse[2]> ## since direction == "period", we are in the 'period' track
Browse[2]> ## the first line (above) of the period track created an index
Browse[2]> ## this is used to replicate each row of the original dataset as many times as
Browse[2]> ## necessary.
Browse[2]> ## for 1 through n rows of the dataset, it repeats the row number
Browse[2]> ## the number of times in the variable period
Browse[2]> ## so for example
Browse[2]> data[1:5, period]
[1] 1 2 1 1 12
Browse[2]> ## the first row will be repeated 1 time, the second 2 times the third 1 time
Browse[2]> ## the fourth 1 time and the fifth 12 times
Browse[2]> ## on a small scale this looks like:
Browse[2]> rep(1:5, data[1:5, period])
[1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5
Browse[2]> ## These values are then stored in the object, index
Browse[2]> ls()
[1] "data" "direction" "event" "id" "period"
Browse[2]>
debug: idmax <- cumsum(data[, period])
Browse[2]> index[1:20]
[1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 6 7 7
Browse[2]> ## note that the first 20 elements of index match the first few from our example above
Browse[2]> ## next we create an object called 'idmax'
Browse[2]> ## this contains the cumulative sum of the period variable in the original data
Browse[2]> ## this will serve as an index of the row numbers in the person-period data
Browse[2]> ## where the event should occur (unless it is censored)
Browse[2]> ## again just for the first 5 rows of the person-level data
Browse[2]> cumsum(data[1:5, period])
[1] 1 3 4 5 17
Browse[2]> ## compare this with the index
Browse[2]> index[1:20]
[1] 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 6 7 7
Browse[2]> ## the first row is repeated once, so its event occurs in row 1 of the person-period data
Browse[2]> ## the second row is repeated twice, 1 + 2 = 3 so the event for row 2 of the person-level
Browse[2]> ## appears in row 3 of the person-period, and so on and so forth for all the rows
Browse[2]> ## Now for the event, we want it to be 1 if the event occurs or 0 otherwise
Browse[2]> ## However, if the observation is censored, (1 for censored, 0 for not censored)
Browse[2]> ## then the event should not occur
Browse[2]> ## in short, for a given observation, the last period should contain the event
Browse[2]> ## if it is not censored
Browse[2]> ## R stores TRUE/FALSE internally as 1/0, so if we want an event (1) if censored is FALSE (0)
Browse[2]> ## we can just negate the value in the censored variable to use as the value for the event
Browse[2]> ## variable. This is what the next line of code does
Browse[2]>
debug: reve <- !data[, event]
Browse[2]> ## now we create the person-period data, 'dat', using the index we created (replicates rows)
Browse[2]>
debug: dat <- data[index, ]
Browse[2]> ## the next line of code creates a sequence (seq_along) of period from the
Browse[2]> ## person-period data, _for each ID_
Browse[2]> ## For example, for the first row, it would just be 1 (since that was only repeated once)
Browse[2]> ## for the second row, it would be 1 2 (since that row was repeated twice)
Browse[2]> ## in particular, the variable will be an integer class starting at 1 going to whatever the
Browse[2]> ## number of times a particular ID was repeated in the person-period data
Browse[2]> ## (this should always be the same as the value in the period variable from the
Browse[2]> ## person level data, because that is what _determined_ how many times the rows were repeated
Browse[2]> ## in the person-period dataset)
Browse[2]>
debug: dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
Browse[2]> ## now since events only occur (if they occur at all) in the last
Browse[2]> ## period of observation for a given ID, we will set all events equal to 0
Browse[2]>
debug: dat[, event] <- 0
Browse[2]> ## now using the idmax index (which contains the rows of the last period for each ID)
Browse[2]> ## and the reve object which is just the negation of whether an ID was censored
Browse[2]> ## we can overwrite the relevant rows of the event variable in the new person-period
Browse[2]> ## dataset with the appropriate values (TRUE/FALSE)
Browse[2]> ## however, since we set everything to 0, it is currently an integer variable
Browse[2]> ## and since we are only replacing a few rows of it, the TRUE/FALSE values will
Browse[2]> ## be upgraded from logical (a lower class) to integer (a higher class)
Browse[2]> ## which means they will be converted to their underling 1/0 values
Browse[2]> ## basically R is quietly doing
Browse[2]> reve[1:10]
[1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
Browse[2]> as.integer(reve[1:10])
[1] 1 1 1 1 0 1 0 1 1 1
Browse[2]>
debug: dat[idmax, event] <- reve
Browse[2]> ## once that is done the 'dat' variable contains the finalized person-period data
Browse[2]> ## the only thing left to do is since rows were repeated
Browse[2]> ## the rownames will be rather strange (orig.number_of_repeats)
Browse[2]> ## for example 1.1
Browse[2]> ## 2.1 and 2.2, etc.
Browse[2]> ## we will simply set them to NULL so that R will use the default
Browse[2]> ## which is just an index from 1 to the total number of rows
Browse[2]>
debug: rownames(dat) <- NULL
Browse[2]> ## (note that we jumped out of the switch debugging because it was done)
Browse[2]> ## now we can just return the object 'dat' which is the converted dataset
Browse[2]>
debug: return(dat)
Browse[2]> ## and we are done
Browse[2]>
exiting from: PLPP(teachers, "id", "t", "censor", "period")
Now we will do an abbreviated version of the same, going from person-period to person-level. Only the differences will be highlighted during the debugging.
test2 <- PLPP(teachers.pp, "id", "period", "event", "level")
debugging in: PLPP(teachers.pp, "id", "period", "event", "level")
debug: {
stopifnot(is.matrix(data) || is.data.frame(data))
stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
if (any(is.na(data[, c(id, period, event)]))) {
stop("PLPP cannot currently handle missing data in the id, period, or event variables")
}
switch(match.arg(direction), period = {
index <- rep(1:nrow(data), data[, period])
idmax <- cumsum(data[, period])
reve <- !data[, event]
dat <- data[index, ]
dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
dat[, event] <- 0
dat[idmax, event] <- reve
}, level = {
tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[,
period]), "i"]))
dat <- data[index, ]
dat[, event] <- as.integer(!dat[, event])
})
rownames(dat) <- NULL
return(dat)
}
Browse[2]> ## Same as before
Browse[2]>
debug: stopifnot(is.matrix(data) || is.data.frame(data))
Browse[2]>
debug: stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
Browse[2]>
debug: if (any(is.na(data[, c(id, period, event)]))) {
stop("PLPP cannot currently handle missing data in the id, period, or event variables")
}
Browse[2]>
debug: NULL
Browse[2]>
debug: switch(match.arg(direction), period = {
index <- rep(1:nrow(data), data[, period])
idmax <- cumsum(data[, period])
reve <- !data[, event]
dat <- data[index, ]
dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
dat[, event] <- 0
dat[idmax, event] <- reve
}, level = {
tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[,
period]), "i"]))
dat <- data[index, ]
dat[, event] <- as.integer(!dat[, event])
})
Browse[2]> ## Now we are entering switch() again, this time it will go to the
Browse[2]> ## level track because
Browse[2]> direction
[1] "level"
Browse[2]> ## the argument direction matches a different value
Browse[2]> ## the first thing we do is create a small temporary dataset
Browse[2]> ## it just contains the rows necessary for the conversion (i.e., period and id)
Browse[2]> ## the reason is that in theory, the data could have had 100s or 1000s of columns
Browse[2]> ## containing other variables measured along with the period and event
Browse[2]> ## we are only going to extract one row for each ID, the final period it was observed
Browse[2]> ## but to do this, we need to subset the dataset by ID
Browse[2]> ## rather than subset a potentially enormous dataset
Browse[2]> ## we make a small copy, then once the row indices are known
Browse[2]> ## the final data returned is just the appropriate rows from the full data
Browse[2]>
debug: tmp <- cbind(data[, c(period, id)], i = 1:nrow(data))
Browse[2]> ## Now that we have our small working data set, we need to find the row indices
Browse[2]> ## that contain the highest value of period for each ID
Browse[2]> ## we use the by() function which applies the function we specify
Browse[2]> ## to the tmp data, subset by each unique id
Browse[2]> ## the final results (which are in a special class called 'by' that
Browse[2]> ## is not unlike a list, but is not quite a list) are converted to a vector
Browse[2]> ## using the as.vector() function
Browse[2]>
debug: index <- as.vector(by(tmp, tmp[, id], FUN = function(x) x[which.max(x[,
period]), "i"]))
Browse[2]> ## now we can create the person-level data set by just extracting the relevant rows
Browse[2]> ## using our index variable
Browse[2]>
debug: dat <- data[index, ]
Browse[2]> index[1:20]
[1] 1 3 4 5 17 18 30 31 33 35 42 54 55 67 79 81 93 94 97 99
Browse[2]> ## in fact this should look quite familiar to us
Browse[2]> ## it is the same as idmax from the other direction created with cumsum
Browse[2]> ## the final step is to convert from whether or not an event occurred in the final
Browse[2]> ## observed period to whether the observation is censored
Browse[2]> ## just like we negated censoring to create the event, we can negate the event to create
Browse[2]> ## the censoring
Browse[2]> ## the only new trick is that because we are replacing the entire column
Browse[2]> ## if we want 1/0s instead of TRUE/FALSE, we will need to be explicit
Browse[2]> ## this is done using the as.integer() function
Browse[2]>
debug: dat[, event] <- as.integer(!dat[, event])
Browse[2]> ## now as before we will reset the rownames to their default by setting to NULL
Browse[2]>
debug: rownames(dat) <- NULL
Browse[2]> ## and we can return 'dat' which is the person-level dataset created from
Browse[2]> ## the person-period data
Browse[2]>
debug: return(dat)
Browse[2]>
exiting from: PLPP(teachers.pp, "id", "period", "event", "level")
Finally, to avoid entering debugging everytime you use the function (until you close and restart R), use the undebug function.
undebug(PLPP) undebug(switch)
and from person-level to person-period, that takes you there and back again.
