This module shows common commands for showing descriptive information and descriptive statistics about data files.
Getting an overview of your file
The sysuse command loads a specified Stata-format dataset that was shipped with Stata. Here we will use the auto data file.
sysuse auto
The describe command shows you basic information about a Stata data file. As you can see, it tells us the number of observations in the file, the number of variables, the names of the variables, and more.
describeContains data from auto.dta obs: 74 vars: 12 17 Feb 1999 10:49 size: 3,108 (99.6% of memory free) ------------------------------------------------------------------------------- 1. make str17 %17s 2. price int %9.0g 3. mpg byte %9.0g 4. rep78 byte %9.0g 5. hdroom float %9.0g 6. trunk byte %9.0g 7. weight int %9.0g 8. length int %9.0g 9. turn byte %9.0g 10. displ int %9.0g 11. gratio float %9.0g 12. foreign byte %9.0g ------------------------------------------------------------------------------- Sorted by:
The codebook command is a great tool for getting a quick overview of the variables in the data file. It produces a kind of electronic codebook from the data file. Have a look at what it produces below.
codebookmake -------------------------------------------------------------- (unlabeled) type: string (str17) unique values: 74 coded missing: 0 / 74 examples: "Cad. Deville" "Dodge Magnum" "Merc. XR-7" "Pont. Catalina" warning: variable has embedded blanks price ------------------------------------------------------------- (unlabeled) type: numeric (int) range: [3291,15906] units: 1 unique values: 74 coded missing: 0 / 74 mean: 6165.26 std. dev: 2949.5 percentiles: 10% 25% 50% 75% 90% 3895 4195 5006.5 6342 11385 mpg --------------------------------------------------------------- (unlabeled) type: numeric (byte) range: [12,41] units: 1 unique values: 21 coded missing: 0 / 74 mean: 21.2973 std. dev: 5.7855 percentiles: 10% 25% 50% 75% 90% 14 18 20 25 29 rep78 ------------------------------------------------------------- (unlabeled) type: numeric (byte) range: [1,5] units: 1 unique values: 5 coded missing: 5 / 74 tabulation: Freq. Value 2 1 8 2 30 3 18 4 11 5 hdroom ------------------------------------------------------------ (unlabeled) type: numeric (float) range: [1.5,5] units: .1 unique values: 8 coded missing: 0 / 74 tabulation: Freq. Value 4 1.5 13 2 14 2.5 13 3 15 3.5 10 4 4 4.5 1 5 trunk ------------------------------------------------------------- (unlabeled) type: numeric (byte) range: [5,23] units: 1 unique values: 18 coded missing: 0 / 74 mean: 13.7568 std. dev: 4.2774 percentiles: 10% 25% 50% 75% 90% 8 10 14 17 20 weight ------------------------------------------------------------ (unlabeled) type: numeric (int) range: [1760,4840] units: 10 unique values: 64 coded missing: 0 / 74 mean: 3019.46 std. dev: 777.194 percentiles: 10% 25% 50% 75% 90% 2020 2240 3190 3600 4060 length ------------------------------------------------------------ (unlabeled) type: numeric (int) range: [142,233] units: 1 unique values: 47 coded missing: 0 / 74 mean: 187.932 std. dev: 22.2663 percentiles: 10% 25% 50% 75% 90% 157 170 192.5 204 218 turn -------------------------------------------------------------- (unlabeled) type: numeric (byte) range: [31,51] units: 1 unique values: 18 coded missing: 0 / 74 mean: 39.6486 std. dev: 4.39935 percentiles: 10% 25% 50% 75% 90% 34 36 40 43 45 displ ------------------------------------------------------------- (unlabeled) type: numeric (int) range: [79,425] units: 1 unique values: 31 coded missing: 0 / 74 mean: 197.297 std. dev: 91.8372 percentiles: 10% 25% 50% 75% 90% 97 119 196 250 350 gratio ------------------------------------------------------------ (unlabeled) type: numeric (float) range: [2.19,3.89] units: .01 unique values: 36 coded missing: 0 / 74 mean: 3.01486 std. dev: .456287 percentiles: 10% 25% 50% 75% 90% 2.43 2.73 2.955 3.37 3.72 foreign ----------------------------------------------------------- (unlabeled) type: numeric (byte) range: [0,1] units: 1 unique values: 2 coded missing: 0 / 74 tabulation: Freq. Value 52 0 22 1
Another useful command for getting a quick overview of a data file is the inspect command. Here is what the inspect command produces for the auto data file.
inspectmake: Number of Observations ------- Non- Total Integers Integers | Negative - - - | Zero - - - | Positive - - - | ----- ----- ----- | Total - - - | Missing 74 +---------------------- ----- -8.99e+307 74(0 unique value) price: Number of Observations -------- Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 74 74 - | # ----- ----- ----- | # Total 74 74 - | # # . . . Missing - +---------------------- ----- 3291 15906 74 (74 unique values) mpg: Number of Observations ------ Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 74 74 - | # # ----- ----- ----- | # # # Total 74 74 - | # # # # . Missing - +---------------------- ----- 12 41 74 (21 unique values) rep78: Number of Observations -------- Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 69 69 - | # # ----- ----- ----- | # # # Total 69 69 - | . # # # # Missing 5 +---------------------- ----- 1 5 74 (5 unique values) hdroom: Number of Observations --------- Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 74 37 37 | # # # ----- ----- ----- | # # # # Total 74 37 37 | # # # # # Missing - +---------------------- ----- 1.5 5 74 (8 unique values) trunk: Number of Observations -------- Non- Total Integers Integers | # Negative - - - | # # Zero - - - | # # Positive 74 74 - | # # # ----- ----- ----- | # # # # # Total 74 74 - | # # # # # Missing - +---------------------- ----- 5 23 74 (18 unique values) weight: Number of Observations --------- Non- Total Integers Integers | # # Negative - - - | # # Zero - - - | # # # # Positive 74 74 - | # # # # ----- ----- ----- | # # # # Total 74 74 - | # # # # # Missing - +---------------------- ----- 1760 4840 74 (64 unique values) length: Number of Observations --------- Non- Total Integers Integers | # Negative - - - | # # Zero - - - | # # Positive 74 74 - | # # # ----- ----- ----- | # # # # # Total 74 74 - | # # # # # Missing - +---------------------- ----- 142 233 74 (47 unique values) turn: Number of Observations ------- Non- Total Integers Integers | # Negative - - - | # # Zero - - - | # # # Positive 74 74 - | # # # ----- ----- ----- | # # # # Total 74 74 - | # # # # . Missing - +---------------------- ----- 31 51 74 (18 unique values) displ: Number of Observations -------- Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 74 74 - | # # ----- ----- ----- | # # # # Total 74 74 - | # # # # . Missing - +---------------------- ----- 79 425 74 (31 unique values) gratio: Number of Observations --------- Non- Total Integers Integers | # Negative - - - | # Zero - - - | # Positive 74 - 74 | # # # # ----- ----- ----- | # # # # Total 74 - 74 | # # # # # Missing - +---------------------- ----- 2.19 3.89 74 (36 unique values) foreign: Number of Observations ---------- Non- Total Integers Integers | # Negative - - - | # Zero 52 52 - | # Positive 22 22 - | # ----- ----- ----- | # # Total 74 74 - | # # Missing - +---------------------- ----- 0 1 74 (2 unique values)
The list command is useful for viewing all or a range of observations. Here we look at make, price, mpg, rep78 and foreign for the first 10 observations.
list make price mpg rep78 foreign in 1/10make price mpg rep78 foreign 1. Dodge Magnum 5886 16 2 0 2. Datsun 510 5079 24 4 1 3. Ford Mustang 4187 21 3 0 4. Linc. Versailles 13466 14 3 0 5. Plym. Sapporo 6486 26 . 0 6. Plym. Arrow 4647 28 3 0 7. Cad. Eldorado 14500 14 2 0 8. AMC Spirit 3799 22 . 0 9. Pont. Catalina 5798 18 4 0 10. Chev. Nova 3955 19 3 0
Creating tables
The tabulate command is useful for obtaining frequency tables. Below, we make a table for rep78 and a table for foreign. The command can also be shortened to tab.
tabulate rep78rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00tabulate foreignforeign | Freq. Percent Cum. ------------+----------------------------------- 0 | 52 70.27 70.27 1 | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00
The tab1 command can be used as a shortcut to request tables for a series of variables (instead of typing the tabulate command over and over again for each variable of interest).
tab1 rep78 foreign-> tabulation of rep78 rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00 -> tabulation of foreign foreign | Freq. Percent Cum. ------------+----------------------------------- 0 | 52 70.27 70.27 1 | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00
We can use the plot option to make a plot to visually show the tabulated values.
tabulate rep78, plotrep78 | Freq. ------------+------------+----------------------------------------------------- 1 | 2 |** 2 | 8 |******** 3 | 30 |****************************** 4 | 18 |****************** 5 | 11 |*********** ------------+------------+----------------------------------------------------- Total | 69
We can also make crosstabs using tabulate. Let’s look at the repair history broken down by foreign and domestic cars.
tabulate rep78 foreign| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 2 | 8 0 | 8 3 | 27 3 | 30 4 | 9 9 | 18 5 | 2 9 | 11 -----------+----------------------+---------- Total | 48 21 | 69
With the column option, we can request column percentages. Notice that about 86% of the foreign cars received a rating of 4 or 5. Only about 23% of domestic cars were rated that highly.
tabulate rep78 foreign, column| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 | 4.17 0.00 | 2.90 -----------+----------------------+---------- 2 | 8 0 | 8 | 16.67 0.00 | 11.59 -----------+----------------------+---------- 3 | 27 3 | 30 | 56.25 14.29 | 43.48 -----------+----------------------+---------- 4 | 9 9 | 18 | 18.75 42.86 | 26.09 -----------+----------------------+---------- 5 | 2 9 | 11 | 4.17 42.86 | 15.94 -----------+----------------------+---------- Total | 48 21 | 69 | 100.00 100.00 | 100.00
We can use the nofreq option to suppress the frequencies, and just focus on the percentages.
tabulate rep78 foreign, column nofreq| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 4.17 0.00 | 2.90 2 | 16.67 0.00 | 11.59 3 | 56.25 14.29 | 43.48 4 | 18.75 42.86 | 26.09 5 | 4.17 42.86 | 15.94 -----------+----------------------+---------- Total | 100.00 100.00 | 100.00
Note that the order of the options does not matter. Just remember that the options must come after the comma.
tabulate rep78 foreign, nofreq column| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 4.17 0.00 | 2.90 2 | 16.67 0.00 | 11.59 3 | 56.25 14.29 | 43.48 4 | 18.75 42.86 | 26.09 5 | 4.17 42.86 | 15.94 -----------+----------------------+---------- Total | 100.00 100.00 | 100.00
Generating summary statistics with summarize
For summary statistics, we can use the summarize command. Let’s generate some summary statistics on mpg.
summarize mpgVariable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- mpg | 74 21.2973 5.785503 12 41
We can use the detail option of the summarize command to get more detailed summary statistics.
summarize mpg, detailmpg ------------------------------------------------------------- Percentiles Smallest 1% 12 12 5% 14 12 10% 14 14 Obs 74 25% 18 14 Sum of Wgt. 74 50% 20 Mean 21.2973 Largest Std. Dev. 5.785503 75% 25 34 90% 29 35 Variance 33.47205 95% 34 35 Skewness .9487176 99% 41 41 Kurtosis 3.975005
To get these values separately for foreign and domestic, we could use the by foreign: prefix as shown below. Note that we first had to sort the data before using by foreign:.
sort foreignby foreign: summarize mpg-> foreign= 0 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- mpg | 52 19.82692 4.743297 12 34 -> foreign= 1 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- mpg | 22 24.77273 6.611187 14 41
This is not the most efficient way to do this. Another way, which does not require the data to be sorted, is by using the summarize( ) option as part of the tabulate command.
tabulate foreign, summarize(mpg)| Summary of mpg foreign | Mean Std. Dev. Freq. ------------+------------------------------------ 0 | 19.826923 4.7432972 52 1 | 24.772727 6.6111869 22 ------------+------------------------------------ Total | 21.297297 5.7855032 74
Here is another example, showing the average price of cars for each level of repair history
tabulate rep78, summarize(price)| Summary of price rep78 | Mean Std. Dev. Freq. ------------+------------------------------------ 1 | 4564.5 522.55191 2 2 | 5967.625 3579.3568 8 3 | 6429.2333 3525.1398 30 4 | 6071.5 1709.6083 18 5 | 5913 2615.7628 11 ------------+------------------------------------ Total | 6146.0435 2912.4403 69
Summary
Provide information about the current data file, including the number of variables and observations and a listing of the variables in a data file.
describe
Produce codebook like information for the current data file.
codebook
Provide a quick overview of data file.
inspect
List out the variables make and mpg.
list make mpg
Make a table of mpg.
tabulate mpg
Make a two way table of rep78 by foreign.
tabulate rep78 foreign
Produce summary statistics of mpg and price.
summarize mpg price
Produce summary statistics for mpg separately for foreign and domestic cars.
sort foreign by foreign: summarize(mpg)
Produce summary statistics for mpg by foreign (prior sorting not required).
tabulate foreign, summarize(mpg)