How can I read binary data into R?

The code needed to read binary data into R is relatively easy. However, reading the data in correctly requires that you are either already familiar with your data or possess a comprehensive description of the data structure.

In the binary data file, information is stored in groups of binary digits. Each binary digit is a zero or one and eight binary digits grouped together is a byte. In order to successfully read binary data, you must know how pieces of information have been parsed into binary. For example, if your data consists of integers, how may bytes should you interpret as representative of one integer in your data? Or if your data contains both positive and negative numbers, how can you distinguish the two? How many pieces of information do you expect to find in the binary data?

Ideally, you know the answers to these questions before starting to read in the binary file. If you do not, you can explore the read in options in R. To get started, we establish a connection to a file and indicate that we will be using the connection to read in binary data. We do this with the file command, providing first the pathname, and the “rb” for “reading binary”. For more details, see help(file) in R.

to.read = file("https://stats.idre.ucla.edu/stat/r/faq/bintest.dat", "rb")

Next, we use the readBin command to begin. If we think the file contains integers, we can start by reading in the first integer and hoping that the size of the integer does not require further specifications. Different platforms store binary data in different ways, and which end of a string of binary values represents the greatest values or smallest values is a difference that can yield very different results from the same set of binary values. This characteristic is called the “endian”. The binary files in the examples on this page were written using a PC, which suggests they are little-endian. When reading in binary data that may or may not have been written on a different platform, indicating an endian can be crucial. For example, without adding endian = “little” to the command below while running R on a Mac, the command reads the first integer as 16777216.

readBin(to.read, integer(), endian = "little")
[1] 1

Thus, it looks like the first integer in the file is 1. As we repeatedly use readBin commands, we will work our way through the binary file until we hit the end. We can read in multiple integers at once by adding an n= option to our command. If the n you specify is greater than the number of integers you specified, readBin will read and display as much as is available, so there is no danger of guessing too large an n. Since we have already read in the first integer, this command will begin at the second.

readBin(to.read, integer(), n = 4, endian = "little")
[1] 2 3 4 5

If you know have additional information about what is in your file, you should incorporate that into the readBin command. For example, if you know that you wish to read in integers stored on 4 bytes each, you can indicate this with the size option:

readBin(to.read, integer(), n = 2, size = 4, endian = "little")
[1] 6 7

Similarly, if you know that your file contains characters, complex numbers, or some other type of information, you would adjust the readBin command accordingly, changing integer() to character() or complex(). See help(readBin) in R for more details.

Since you will likely want to do more than just look at what is contained in the binary file, you will need some strategies for formatting data as you read it in. For example, suppose you are given a binary file with the following description: three numeric variables collected from 200 subjects, the three variable names appear first in the file, the numeric values are integers store on two bytes each, and all of the values for the first variables are followed by all the values for the second and then all of the values for the third (as if they have be read in as columns, not rows). First, open a connection to the data.

newdata = file("https://stats.idre.ucla.edu/stat/r/faq/bindata.dat", "rb")

Next, let’s read in the variable names and save them to a vector in R.

varnames = readBin(newdata, character(), n=3)
varnames
[1] "read"  "write" "math"

To read in the integer values, we can opt to read all 300 onto one vector, and then separate it out into the three variables.

datavals = readBin(newdata, integer(), size = 4, n = 600, endian = "little")
readvals = datavals[1:200]
writevals = datavals[201:400]
mathvals = datavals[401:600]

Or we can read in each variable’s values with a separate readBin command.

readvals = readBin(newdata, integer(), size = 4, n = 200, endian = "little")
writevals = readBin(newdata, integer(), size = 4, n = 200, endian = "little")
mathvals = readBin(newdata, integer(), size = 4, n = 200, endian = "little")

Then, we can combine our three value vectors into one data frame with the variable names as our column names.

rdata = cbind(readvals, writevals, mathvals)
colnames(rdata) = varnames
rdata[1:5,]

     read write math
[1,]   57    52   41
[2,]   68    59   53
[3,]   44    33   54
[4,]   63    44   47
[5,]   47    52   57

Lastly, since we have finished reading data from the binary file, we can close the connection.

close(newdata)

If you wish to write a binary file from R, see R FAQ: How can I write a binary data file in R?