/*Suppose that, in a junior high school, there are a total of 4,000 students in grades 7, 8, and 9. You want to now how household income and the number of children in a household affect students' average weekly spending for ice cream. In order to answer this question, you draw a sample using simple random sampling from the student population in the junior high school. You randomly select 40 students and ask them their average weekly expenditure for ice cream, their household income, and the number of children in their household. The answers from the 40 students are saved as a SAS data set. */; data IceCream; input Grade Spending Income Kids @@; datalines; 7 7 39 2 7 7 38 1 8 12 47 1 9 10 47 4 7 1 34 4 7 10 43 2 7 3 44 4 8 20 60 3 8 19 57 4 7 2 35 2 7 2 36 1 9 15 51 1 8 16 53 1 7 6 37 4 7 6 41 2 7 6 39 2 9 15 50 4 8 17 57 3 8 14 46 2 9 8 41 2 9 8 41 1 9 7 47 3 7 3 39 3 7 12 50 2 7 4 43 4 9 14 46 3 8 18 58 4 9 9 44 3 7 2 37 1 7 1 37 2 7 4 44 2 7 11 42 2 9 8 41 2 8 10 42 2 8 13 46 1 7 2 40 3 9 6 45 1 9 11 45 4 7 2 36 1 7 9 46 1 ; run; /*In the data set IceCream, the variable Grade indicates a student's grade. The variable Spending contains the dollar amount of each student's average weekly spending for ice cream. The variable Income specifies the household income, in thousands of dollars. The variable Kids indicates how many children are in a student's family. */; /* First let's try OLS regression which does not account for the sampling scheme and let's use dummy coding */; title1 'Ice Cream Spending Analysis'; title2 'OLS Regression estimates'; data reg; set IceCream; if kids=1 then do; k1= 1;k2= 0;k3= 0; end; if kids=2 then do; k1= 0;k2= 1;k3= 0; end; if kids=3 then do; k1= 0;k2= 0;k3= 1; end; if kids=4 then do; k1= 0;k2= 0;k3= 0; end; /*DUMMY CODING*/; run; proc reg data=reg; model Spending = Income k1--k3; run; /* Now let's try running the same data assuming SRS from a sample of 4000 */; title2 'Simple Random Sampling Design'; proc surveyreg data=IceCream total=4000; class Kids; model Spending = Income Kids / solution; run; /* Now let's suppose that the previous student sample is actually drawn from a stratified sampling (STRS). The strata are grades in the junior high school: the 7th grade, the 8th grade, and the 9th grade. Within strata, simple random samples are selected. The StudentTotal data provides the number of students in each grade. */; data StudentTotal; input Grade _TOTAL_; datalines; 7 1824 8 1025 9 1151 ; run; /*The variable Grade is the stratification variable, and the variable _TOTAL_ contains the total numbers of students in the strata in the survey population. PROC SURVEYREG requires you to use the keyword _TOTAL_ as the name of the variable that contains the population total information. The following statements demonstrate how you can fit the linear model while incorporating the sample design information (stratification). */; title2 'Stratified Simple Random Sampling Design'; proc surveyreg data=IceCream total=StudentTotal; strata Grade /list; class Kids; model Spending = Income Kids / solution; run; /*By comparing these statements to those in the section "Simple Random Sampling", the TOTAL=StudentTotal option replaces the previous TOTAL=4000 option. When the population totals and sample sizes differ among strata, the population totals must be provided by a data set. The STRATA statement specifies the stratification variable Grade. The LIST option in the STRATA statement requests that the stratification information be included in the output. */; --------------------------------------------- Ice Cream Spending Analysis 11:19 Wednesday, July 19, 2000 1 OLS Regression estimates The REG Procedure Model: MODEL1 Dependent Variable: Spending Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 915.30965 228.82741 38.10 <.0001 Error 35 210.19035 6.00544 Corrected Total 39 1125.50000 Root MSE 2.45060 R-Square 0.8132 Dependent Mean 8.75000 Adj R-Sq 0.7919 Coeff Var 28.00685 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -26.08468 3.07591 -8.48 <.0001 Income 1 0.77533 0.06431 12.06 <.0001 k1 1 0.89765 1.11649 0.80 0.4268 k2 1 1.49403 1.10259 1.36 0.1841 k3 1 -0.51318 1.23855 -0.41 0.6812 Ice Cream Spending Analysis 11:19 Wednesday, July 19, 2000 2 Simple Random Sampling Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Data Summary Number of Observations 40 Mean of Spending 8.75000 Sum of Spending 350.00000 Fit Statistics R-square 0.8132 Root MSE 2.4506 Denominator DF 39 Class Level Information Class Variable Levels Values Kids 4 1 2 3 4 ANOVA for Dependent Variable Spending Sum of Mean Source DF Squares Square F Value Pr > F Model 4 915.310 228.8274 38.10 <.0001 Error 35 210.190 6.0054 Corrected Total 39 1125.500 Tests of Model Effects Effect Num DF F Value Pr > F Model 4 119.15 <.0001 Intercept 1 153.32 <.0001 Income 1 324.45 <.0001 Kids 3 0.92 0.4385 NOTE: The denominator degrees of freedom for the F tests is 39. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > |t| Intercept -26.084677 2.46720403 -10.57 <.0001 Income 0.775330 0.04304415 18.01 <.0001 Kids 1 0.897655 1.12352876 0.80 0.4292 Kids 2 1.494032 1.24705263 1.20 0.2381 Ice Cream Spending Analysis 11:19 Wednesday, July 19, 2000 3 Simple Random Sampling Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > |t| Kids 3 -0.513181 1.33454891 -0.38 0.7027 Kids 4 0.000000 0.00000000 . . NOTE: The denominator degrees of freedom for the t tests is 39. Matrix X'X is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique. Ice Cream Spending Analysis 11:19 Wednesday, July 19, 2000 4 Stratified Simple Random Sampling Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Data Summary Number of Observations 40 Mean of Spending 8.75000 Sum of Spending 350.00000 Design Summary Number of Strata 3 Fit Statistics R-square 0.8132 Root MSE 2.4506 Denominator DF 37 Stratum Information Stratum Population Sampling Index Grade N Obs Total Rate 1 7 20 1824 0.01 2 8 9 1025 0.01 3 9 11 1151 0.01 Class Level Information Class Variable Levels Values Kids 4 1 2 3 4 ANOVA for Dependent Variable Spending Sum of Mean Source DF Squares Square F Value Pr > F Model 4 915.310 228.8274 38.10 <.0001 Error 35 210.190 6.0054 Corrected Total 39 1125.500 Tests of Model Effects Effect Num DF F Value Pr > F Model 4 114.60 <.0001 Intercept 1 150.05 <.0001 Ice Cream Spending Analysis 11:19 Wednesday, July 19, 2000 5 Stratified Simple Random Sampling Design The SURVEYREG Procedure Regression Analysis for Dependent Variable Spending Tests of Model Effects Effect Num DF F Value Pr > F Income 1 317.63 <.0001 Kids 3 0.93 0.4355 NOTE: The denominator degrees of freedom for the F tests is 37. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > |t| Intercept -26.084677 2.48241893 -10.51 <.0001 Income 0.775330 0.04350401 17.82 <.0001 Kids 1 0.897655 1.11778377 0.80 0.4271 Kids 2 1.494032 1.25209199 1.19 0.2404 Kids 3 -0.513181 1.36853454 -0.37 0.7098 Kids 4 0.000000 0.00000000 . . NOTE: The denominator degrees of freedom for the t tests is 37. Matrix X'X is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique.