What kinds of problems can I solve with Unix-like Commands?
This page will show a number of examples using Unix-like commands to perform ASCII data file management tasks. The ASCII data file that these examples are based on can be downloaded here.
- Example 1: Counting number of rows in a data file
- Example 2: Displaying the length of the longest record in a data file
- Example 3: Translating a word to another
- Example 4: Printing the first few lines of a file
- Example 5: Stacking files
- Example 6: Splitting a big file
- Example 7: Searching lines with specific characters
Example 1: Counting number of rows in an ASCII (text) data file
Let’s say you have a huge file on your machine and you are going to input it into a statistical software. You may want to know before hand the number of records that the file has so that you can double check once the data file is read in. The example below shows how to use the wc (for word count) command to show the number of records.
D:\temp>ls -l test.csv -rw-rw-rw- 1 user group 11599 Sep 24 2002 test.csv
D:\temp>wc -l test.csv 407 test.csv
D:\temp>more test.csv MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$ 18,307,130,3504,12,70,1,8,0 15,350,165,3693,12,70,1,8,0 18,318,150,3436,11,70,1,8,0 16,304,150,3433,12,70,1,8,0 17,302,140,3449,11,70,1,8,0 15,429,198,4341,10,70,1,8,0 14,454,220,4354,9,70,1,8,0 14,440,215,4312,9,70,1,8,0 14,455,225,4425,10,70,1,8,0 15,390,190,3850,9,70,1,8,0 #NULL!,133,115,3090,18,70,2,4,1 #NULL!,350,165,4142,12,70,1,8,0 #NULL!,351,153,4034,11,70,1,8,0 #NULL!,383,175,4166,11,70,1,8,0 ...............................
ls -l: displaying the file attributes, such as access privilege, file size and the date the file was created;
wc -l: displaying the number of records of the file;
more: displaying the file one page at a time until you hit key q to quit.
Now we know that file test.csv has 407 rows with the first row being variable names.
Example 2: Displaying the length of the longest record in a data file
Very often we need to know the length of the longest record in order to input the data file correctly. Again we can use wc (for word count) to get the information.
D:temp>wc -L test.csv 59 test.csv
Now we know that the length of the longest record of test.csv is 59 characters.
Example 3: Translating a word to another
We have seen from the example above that our data file test.csv has some characters in it. For example, we have “#NULL!” for missing values instead of “.” as most commonly used. We want to change “#NULL!” to simply a dot “.” in the data file. We can use tr command to perform the translation. Command tr takes a few parameters and has a few options.
D:\temp>tr -s '#NULL!' . <test.csv > test_dot.csv
D:\temp>more test_dot.csv MPG,E.GI.E,HORSE,WEIGHT,ACCE.,YEAR,ORIGI.,CY.I.DER,FI.TER_$ 18,307,130,3504,12,70,1,8,0 15,350,165,3693,12,70,1,8,0 18,318,150,3436,11,70,1,8,0 16,304,150,3433,12,70,1,8,0 ........................... .,133,115,3090,18,70,2,4,1 .,350,165,4142,12,70,1,8,0 .,351,153,4034,11,70,1,8,0 .,383,175,4166,11,70,1,8,0 .,360,175,3850,11,70,1,8,0 15,383,170,3563,10,70,1,8,0 14,340,160,3609,8,70,1,8,0 .,302,140,3353,8,70,1,8,0 ..........................
tr:
- “< test.csv”: to take the input file test.csv
- “>test_doct.csv”: to output to test_dot.csv
- “-s”: an option to squeeze out repeated output characters
- “‘#NULL!”‘: the first string to be substituted from
- “.”: the second string to be substituted with
more: displaying the file one page at a time until you hit key q to quit.
From the output of command more, we can see that the character string “#NULL!” has been replaced by “.” in the output data file test_dot.csv.
Example 4: Printing the first few lines of a file
Sometimes, it is handy to work with a small subset of the original file. It can save time and it is easy to debug problems with a small data file. The command that we are going to use is head.
D:\temp>head test.csv MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$ 18,307,130,3504,12,70,1,8,0 15,350,165,3693,12,70,1,8,0 18,318,150,3436,11,70,1,8,0 16,304,150,3433,12,70,1,8,0 17,302,140,3449,11,70,1,8,0 15,429,198,4341,10,70,1,8,0 14,454,220,4354,9,70,1,8,0 14,440,215,4312,9,70,1,8,0 14,455,225,4425,10,70,1,8,0
head: will print out the first 10 lines.
D:temp>head -20 test.csv > test20.csv D:temp>wc -l test20.csv 20 test20.csv
head: with option “-20” will print out the first 20 lines, but we now redirect the printout to a file called test20.csv.
wc: “-l” is an option to print out the total number of lines in the file.
So we see that in the file test20.csv created by command head has 20 lines.
Stacking files on top of each other is a very common task in data management. You want to make sure that the files that are going to be stacked have same number of variables and the variables are positioned the same in each of the files.
In this example, we have two data files, test.csv and test2.csv. We use command cat (for concatenate) to combine the two files. One thing that we need to pay attention is that data file test.csv has a header row (variable names), while test2.csv only has data rows. Therefore, when the two are stacked together, test.csv has to be on top.
D:\temp>head test.csv MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$ 18,307,130,3504,12,70,1,8,0 15,350,165,3693,12,70,1,8,0 18,318,150,3436,11,70,1,8,0 16,304,150,3433,12,70,1,8,0 17,302,140,3449,11,70,1,8,0 15,429,198,4341,10,70,1,8,0 14,454,220,4354,9,70,1,8,0 14,440,215,4312,9,70,1,8,0 14,455,225,4425,10,70,1,8,0 D:temp>head test2.csv 22,250,105,3353,15,76,1,6,1 24,200,81,3012,18,76,1,6,1 23,232,90,3085,18,76,1,6,1 29,85,52,2035,22,76,1,4,1 25,98,60,2164,22,76,1,4,1 29,90,70,1937,14,76,2,4,1 33,91,53,1795,17,76,3,4,1 20,225,100,3651,18,76,1,6,1 18,250,78,3574,21,76,1,6,1 19,250,110,3645,16,76,1,6,1 D:\temp>cat test.csv test2.csv > testall.csv D:\temp>wc -l testall.csv 607 testall.csv D:\temp>wc -l test.csv 407 test.csv D:\temp>wc -l test2.csv 200 test2.csv
head: displaying the first ten lines of a file. We can see that test2.csv does not have a header row but test.csv has.
cat: combining test.csv and test2.csv together vertically and output it to a file called testall.csv.
wc: counting the number of lines in each file to see if the stacking makes sense.
Example 6: Splitting a big file
Sometimes, a file is too big to be manipulated, for example to be transferred from one machine to another. One thing we can do is to split the file into several smaller files and re-assemble them back together later. Command split can be used for this task. Let’s say we need to split our test.csv into two files. Since test.csv has total of 407 lines, the two smaller files can have size about 210 each.
D:\temp>split -l210 test.csv sml_file
D:\temp>ls -l sml_file* -rw-rw-rw- 1 user group 6031 Dec 18 12:19 sml_fileaa -rw-rw-rw- 1 user group 5568 Dec 18 12:19 sml_fileab
D:\temp>wc -l sml_fileaa 210 sml_fileaa
D:\temp>wc -l sml_fileab 197 sml_fileab
wc: counting the number of lines in each file
split:
- “-l210” will be put first 210 lines into a file and next 210 in another file and so on;
- “sml_file” is a prefix for the names of these smaller files.
Example 7: Searching lines with specific characters
Data entry errors are very common with a large data file. Sometimes a numeric variable can contain characters and has to be read in as a character variable. It may be useful to know the number of lines of data that has some specific characters, so we can get a good idea on how severe the data entry error problem is. For example, the data set below has some data entry errors, assuming that everything should be numeric.
123456 23?445 334566 456x77 534345 676767 3435?3 123245 3435k4 4668a3 243354 575784
Let’s say this data set is called problem.txt.
D:temp>more problem.txt 123456 23?445 334566 456x77 534345 676767 3435?3 123245 3435k4 4668a3 243354 575784
D:\temp>grep "[a-z?]" problem.txt >pline.txt
D:temp>more pline.txt 23?445 456x77 3435?3 3435k4 4668a3
D:\temp>wc -l pline.txt 5 pline.txt