* SPSS Syntax Seminar, Part 2:  Getting things done.

data list list / q1 to q5.
begin data.
3	3	.	.	2
2	2	-9	.	1
3	1	2	.	3
4	1	2 .	-9
-8	1	3	.	2
-8	2	1	.	-9
3	-9	4	.	2
4	4	2	.	3
1	1	1	.	1
2	-9	3	.	2
3	3	2	.	5
3	1	1	.	3
-9	4	4	.	2
.	2	4	.	1
2	3	1	.	4
end data.
dataset name example_data.
missing values q1 to q5 (-8 -9).

* notice that syntax above generates many errors, but the 
* data are read in correctly.

* creating a unique identifier.
* with system variable.
compute id = $casenum.
formats q1 to id (f5.0).
exe.
list.

* with create command.
compute id1 = 1.
create id1 = csum(id1).
formats id1 (f5.0).
list.

* taking a simple random sample.
set seed 156323669.
compute ran_num = uniform(100) + 1.
sort cases by ran_num.
list id ran_num.

temporary.
select if $casenum le 6.
list id ran_num.

temporary.
n of cases 6.
list id ran_num.

set seed 822916.
temporary.
sample .4.
list id ran_num.

set seed 822916.
temporary.
sample 6 from 15.
list id ran_num.

set seed 822916.
temporary.
sample 6 from 15.
exe.
list id ran_num.

* creating a random identification variable.
set seed 1856256.
compute ran_num1 = rv.normal(100, 10).
desc var = ran_num1.

sort cases by ran_num1.
list id ran_num ran_num1.

* numbering groups consecutively.
* uses scratch variables.
data list list / a.
begin data.
1
2
3
1
2
3
4
1
2
3
4
5
6
1
2
1
2
3
end data.
dataset name scratch.

compute #x = #x + 1.
if a ne 1 #x = lag(#x).
compute x = #x.
exe.
list.
* need to use a scratch variable because it is always 
* initialized to 0, and its value is not reinitialized when a new case 
* is read; their values always carry across cases.

if a ne lag(a) x1 = 0.
compute x1 = x1 + x.
leave x1.
exe.
list.


* creating an index number.

data list list / id famid * kidname (A5)  birth age wt * sex (A1).
begin data
1	1	Beth	1	9	60	f
2	1	Bob	2	6	40	m
3	1	Barb	3	3	20	f
4	2	Andy	1	8	80	m
5	2	Al	2	6	50	m
6	2	Ann	3	2	20	f
7	3	Pete	1	6	60	m
8	3	Pam	2	4	40	f
9	3	Phil	3	2	20	m
end data.
dataset name long_to_wide.
list famid kidname age.

compute index1 = 1.
if famid = lag(famid) index1 = lag(index1) + 1.
exe.
list famid index1 kidname age.

* creating a flag variable for complete cases.
* put counting here?.
dataset activate example_data.

compute comp_flag = 1.
if missing(q1) or missing(q2) or missing(q3) comp_flag = 0.
value labels comp_flag 0 "has at least one missing value" 1 "complete case" .
exe.
freq var = comp_flag.

list q1 q2 q3 comp_flag.

compute nvalid = nvalid(q1, q2, q3).
compute nmissing = nmissing(q1, q2, q3).
list q1 q2 q3 comp_flag nvalid nmissing.

* finding duplicate cases.
* you need to have both string and numeric versions
* of the id variable.
data list list
/ sid1 (A3) sid2 (A3).
begin data.
110 210
514 856
210 110
210 111
693 246
end data.
dataset name duplicates.

recode sid1 (convert) into nid1.
recode sid2 (convert) into nid2.

string pairid (A6).

if (nid1 lt nid2) pairid = concat(sid1, sid2).
if (nid1 gt nid2) pairid = concat(sid2, sid1).

sort cases by pairid.

compute flag = 0.
if pairid = lag(pairid) flag = 1.
value labels flag 0 "not duplicate" 1 "duplicate".
exe.
list.

* comparing data sets.
* both data files must be sorted in ascending order on the id variable.
* command introduced in version 21.
data list list / id var1 var2 var3.
begin data
1 1 2 3
2 4 5 6
3 7 8 9
end data.
dataset name filea.

data list list / id var1 var2 var3.
begin data
1 1 2 10
2 4 5 6
3 7 8 9
end data.
dataset name fileb.

dataset activate filea.
compare datasets
/compdataset fileb
/variables = all
/caseid id.

freq var = casescompare.

data list list / id var1 var2 var3.
begin data
1 1 2 3
2 4 5 6
3 7 8 9
end data.
variable labels var1 "This is the first variable.".
dataset name filec.

data list list / id var1 var2 var3.
begin data
1 1 2 3
2 4 5 6
3 7 8 9
end data.
missing values var1 (99).
dataset name filed.

dataset activate filec.
compare datasets
/compdataset filed
/variables = all
/output varproperties = all
/caseid id.

* updating data sets.
* all datasets must be sorted on the key variables 
* (listed on the by subcommand).
* this is for updating a file, not adding new
* cases or variables, although you can add
* new cases.
* think updating mailing lists.
* master data file.
data list list / id m1 m2 m3 m4 female.
begin data
1 25 36 41 56 0
2 26 31 49 55 1
3 22 33 44 56 0
4 17 37 41 59 1
5 23 35 48 54 0
6 29 39 41 51 11
end data.
dataset name master.

data list list / id m1 m2 m3 m4 female.
begin data
4 27 37 41 59 1
6 29 39 41 51 1
7 28 32 45 52 0
end data.
dataset name new_updates.
dataset activate master.
update file = *
/file = new_updates
/by id.
list.

* creating a level-2 variable.
data list list / id score1.
begin data.
1 6
1 2
1 4
2 2
2 2
2 5
3 4
3 5
3 6
4 1
4 2
4 3
end data.
dataset name level2.
list.

sort cases id.
aggregate
/break id
/ave_score = mean(score1).
list.

* collapsing across observations.
dataset activate example_data.
sort cases by comp_flag.
aggregate outfile 'd:\data\new.sav'
 /break comp_flag
 /aveq1 = mean(q1).
get file 'd:\data\new.sav'.
list.

dataset activate example_data.
compute gender = 0.
if (mod($casenum, 2) = 1) gender = 1.
compute bvar = 1.
if $casenum gt 5 bvar = 2.
if $casenum gt 9 bvar = 3.
if $casenum gt 12 bvar = 4.
exe.
sort cases by gender bvar.
aggregate outfile = * mode = addvariables
 /break gender bvar
 /sdq1 "standard deviation of q1" = sd(q1)
 /sumq1 "sum of q1" = sum(q1)
 /missq3 "unweighted number of system missing for q3" = numiss(q3)
 /pinq5 "percent of cases between values 2 and 4 for q5" = pin(q5, 2, 4).
list gender bvar q1 sdq1 sumq1 q3 missq3 q5 pinq5.

* leave - similar to SAS's retain statement.
* running total by group.
data list list /
group total.
begin data
1 100
1 150
1 125
1 100
2 200
2 250
2 225
2 200
end data.
dataset name leave_group.
sort cases by group.
if group ne lag(group) rtotal = 0.
compute rtotal = rtotal + total.
leave rtotal.
exe.
list.

if missing(lag(total)) id = 0.
leave id.
compute id = id + 1.
if missing(lag(total)) cumtotal_leave = 0.
leave cumtotal_leave.
compute cumtotal_leave = cumtotal_leave + total.
list.

* review:  3 ways to make a running total.
dataset activate leave_group.

create method1 = csum(total).

if missing(lag(total)) #method2 = 0.
compute #method2 = #method2 + total.
compute method2 = #method2.

if missing(lag(total)) method3 = 0.
leave method3.
compute method3 = method3 + total.

list method1 method2 method3.

* creating dummy variables.
data list free / race.
begin data
1
2
3
4
5
6
.
-9
end data.
missing values race (-9).
dataset name making_indicators.

* make dummies, method 1 .
compute race1=(race=1).
compute race2=(race=2).
compute race3=(race=3).
compute race4=(race=4).
compute race5=(race=5).
compute race6=(race=6).
list.

* make dummies, method 2 .
do repeat a=v1 to v6 /b=1 to 6.
compute a=(race=b).
end repeat.
list.

data list list / var1 var3 id var2.
begin data
3 3 3 3
2 2 2 2
end data.
do repeat v=var1 to var2 /val = 1 3 5 7.
compute v = val.
end repeat print.
exe.
list.

* user-defined missing to sysmis.
data list list / var1 var2 var3 var4.
begin data
1 2 3 4
-99 -88 4 5
-77 -99 6 2
5 9 6 8
end data.
*missing values var1 to var4 (lowest thru -99).
dataset name missing.
do repeat v = var1 to var4.
*if v le -77 v = $sysmis.
if any(v, -99, -88, -77) v = $sysmis.
end repeat.
exe.
list.

* do if.
data list free / var1 var2.
begin data
1 1 
2 1
end data.
if (var1 = 1) newvar1 = 1.
if (var2 = 1) newvar1 = 2.
do if var1 = 1.
compute newvar2 = 1.
else if var2 = 1.
compute newvar2 = 2.
end if.
exe.
list.

* need to evaluate missing values first.
data list free (",") /a.
begin data
1, , 1,  ,
end data.
dataset name do_if_missing.
compute b = a.
do if missing(b).
compute b1 = 2.
else if b = 1.
compute b1 = 1.
end if.
exe.
list.

* vectors pages 143-144.
* this is cool, notice that the coloring of the coding 
* does not work correctly.
data list free
/first second third fourth fifth.
begin data
1 2 3 4 5
10 9 8 7 6
1 4 4 4 2
end data.
dataset name vectors.
compute maxvalue = max(first to fifth).
compute maxcount = 0.
list.

vector vectorvar = first to fifth.
loop #cnt = 5 to 1 by -1.
do if maxvalue = vectorvar(#cnt).
compute maxvar = #cnt.
compute maxcount = maxcount+1.
end if.
end loop.
exe.
list.

* must have an active dataset.
vector vec(4).
loop #cnt = 1 to 4.
compute vec(#cnt) = uniform(1).
end loop.
exe.
list.

* loops.
* casewise data from aggregated data (pages 152-153).
data list free / age female male.
begin data
20 2 2
21 0 0
22 1 4
23 3 0
24 0 1
end data.
dataset name loops.
loop #cnt = 1 to sum(female, male).
compute gender = (#cnt > female).
xsave outfile "D:\data\tempdata.sav"
/keep = age gender.
end loop.
exe.
get file "D:\data\tempdata.sav".
list.

* reshaping data.
* wide to long.
* with convenience command.
data list list / id trial1 trial2 trial3 female.
begin data
1 16 14 15 0
2 17 19 12 1
3 16 15 19 1
4 17 18 19 0
5 11 12 17 0
6 14 19 18 1
end data.
dataset name wide_data.

varstocases
 /make trial from trial1 to trial3
 /index = number
 /id = id1.
list.

* wide to long manually.
vector Atrial = trial1 to trial3.
loop number = 1 to 3.
compute trial = Atrial(number).
xsave outfile 'd:\data\w2lm.sav'
  /drop trial1 trial2 trial3.
end loop.
execute.
get file 'd:\data\w2lm.sav'.
list.

* long to wide.
* with convenience command.

casestovars
/id = id
/index = number.
list.

* long to wide manually.
sort cases by id number.
vector trial(3).
compute trial(number) = trial.
list id trial trial1 trial2 trial3.

aggregate 
 /break id
 /trial1 to trial3 = max(trial1 to trial3).
delete variables trial1 to trial3.
list.

* matrix example (with oms).
* http://www.ats.ucla.edu/stat/spss/examples/asa2/chap4.htm .
* table 4.7, page 102.
oms
/select tables
/if 
   subtypes = ['Correlation Matrix of Regression Coefficients']
/destination format = sav
outfile  = 'd:\data\asa2\corr_table4_7.sav'.
oms
/select tables
/if 
   subtypes =  ['Variables in the Equation']
/destination format = sav
outfile  = 'd:\data\asa2\parms_table4_7.sav'.
coxreg foltime 
/method = enter age2 age3 age4
/status folstatus(1)
/print = corr.
omsend.

matrix.
get corr /file="d:\data\asa2\corr_table4_7.sav"
 /variables=age2 age3
/missing = 0.

compute d=nrow(corr).
compute a = make(d+1, d+1, 0).
loop i = 1 to d.
   loop j = i+1 to d+1.
     compute a(i, j) = corr(j-1, i).
     compute a(j, i) = corr(j-1, i). 
   end loop.
end loop.
compute c = ident(d+1) + a.

get se /file="d:\data\asa2\parms_table4_7.sav" 
/variables = se.
compute sigma=mdiag(se).
compute cov = sigma*c*sigma.
print cov.
end matrix.

* begin program end program (with Python).
* looping through two lists of variables.
begin program.
import spss, spssaux
spssaux.OpenDataFile('d:\data\elemapi2.sav')
vdict=spssaux.VariableDict()
dlist=vdict.range(start="api00", end="ell")
ilist=vdict.range(start="grad_sch", end="enroll")
ddim = len(dlist)
idim = len(ilist)

if ddim != idim: 
     print "The two sequences of variables don't have the same length."
else: 
        for i in range(ddim): 
             mydvar = dlist[i]
             myivar = ilist[i]
         
             spss.Submit(r"""
                    regression /dependent %s
                    /method = enter %s.
                                """ %(mydvar, myivar))
end program.