Please note: This FAQ is specific to reading files in a UNIX environment, and may not work in all UNIX environments.
It can be very efficient to store large raw data files compressed with gzip (as .gz files). Such files often are 20 times smaller than the original raw data file. For example, a raw data file that would take 200 megabytes could be compressed to be as small as 10 megabytes. Let’s illustrate how to read a compressed file with a small example. Consider the data file shown below.
AMC Concord 220 2930 4099 AMC Pacer 170 3350 4749 AMC Spirit 220 2640 3799 Buick Century 200 3250 4816 Buick Electra 150 4080 7827
If this were a raw data file called rawdata.txt we could read it using a SAS program like the one shown below.
FILENAME in "rawdata.txt" ; DATA test; INFILE in ; INPUT make $ 1-14 mpg 15-18 weight 19-23 price 24-27 ; RUN;
On most UNIX computers (e.g., Nicco, Aristotle) you could compress rawdata.txt by typing
gzip rawdata.txt &
and this would create a compressed version named rawdata.txt.gz . To read this file into SAS, normally you would first uncompress the file, and then read the uncompressed version into SAS. This can be very time consuming to uncompress the file, and consume a great deal of disk space. Instead, you can read the compressed file rawdata.txt.gz directly within SAS without having to first uncompress it. SAS can uncompress the file "on the fly" and never create a separate uncompressed version of the file. On most UNIX computers (e.g., Nicco, Aristotle) you could read the file with a program like this.
FILENAME in PIPE "gzip -dc rawdata.txt.gz" LRECL=80 ; DATA test; INFILE in ; INPUT make $ 1-14 mpg 15-18 weight 19-23 price 24-27 ; RUN;
In your program, be sure to change the lrecl=80 to be the width of your raw data file (the width of the longest line of data). If you are unsure of how wide the file is, just use a value that is certainly wider than the widest line of your file.
You would most likely use this technique when you are reading a very large file. You can test your program by just reading a handful of observations by using the obs= parameter on the infile statement, e.g., infile in obs=20; would read just the first 20 observations from your file.