*graphs of calender time and study time.

clear
input subj tp censored str11 datestr
1 1 0 "1 jan 1990"
1 2 0 "1 mar 1991"
2 1 1 "1 feb 1990"
2 2 1 "1 feb 1991"
3 1 1 "1 jun 1990"
3 2 1 "31 dec 1991"
4 1 0 "1 sep 1990"
4 2 0 "1 apr 1991"
end

gen date = date(datestr, "dmy")
format date %dmy
hilite subj date, c(L) hilite(censored==0) s(o x) ylab(1 2 3 4)

gen time =0 if tp==1
replace time= (date-date[_n-1])/30.5 if tp==2
hilite subj time, c(L) hilite(censored==0) s(o x) ylab(1 2 3 4) xlab(0 8 12 14 19 24)

*inputting small version of the uis data
use c:\uis.dta, clear

gen id = ID
drop ID

stset time, failure(censor)

*graph of cumulative hazard curve
sts graph, na noborder

*looking at the first 10 observations of the data
list id time censor age ndrugtx treat site herco in 1/10, nodisplay

*logrank test for treat
sts test treat, logrank

*kaplan-meier curves for levels of treat
sts graph, by(treat) noborder 

*logrank test for site
sts test site, logrank

*kaplan-meier curve for levels of site
sts graph, by(site) noborder

*logrank test for herco
sts test herco

*kaplan-meier curve for levels of herco
sts graph, by(herco) noborder

*cox regression for univariate analysis of ndrugtx and age
stcox ndrugtx, nohr
stcox age, nohr

*first big cox regression model
xi: stcox age ndrugtx treat site i.herco, nohr
test _Iherco_2 _Iherco_3
drop _Iherco_2 _Iherco_3 

*eliminating herco
stcox age ndrugtx treat site, nohr

*testing interactions
*generate all the interaction variables
gen age_drug = age*ndrugtx
gen age_treat = age*treat
gen age_site = age*site
gen drug_treat = ndrugtx*treat
gen drug_site = ndrugtx*site
gen treat_site = treat*site

stcox age ndrugtx treat site age_drug, nohr
stcox age ndrugtx treat site drug_treat, nohr
stcox age ndrugtx treat site drug_site, nohr
stcox age ndrugtx treat site age_treat, nohr
stcox age ndrugtx treat site age_site, nohr
stcox age ndrugtx treat site treat_site, nohr

*final model and checking that including interaction provides a better fitting model
stcox age ndrugtx treat site age_site, nohr
lrtest, saving(0)
stcox age ndrugtx treat site, nohr
lrtest, using(0)

*lrtest is significant to the bigger model does fit the data better
*and that will be the final model

*final model for interpretation of hazard ratios
stcox age ndrugtx treat site age_site

*proportionality assumption

*testing proportionality using time-dependent interactions of predictors.
stcox age ndrugtx treat site age_site, nohr tvc(age ndrugtx treat site) texp(ln(_t))

*testing the proportionality assumption using the schoenfeld  and scaled 
*scaled schoenfeld residuals
*H0: proportionality holds

quietly stcox age ndrugtx treat site age_site, schoenfeld(sch*) scaledsch(sca*)

stphtest, detail
stphtest, plot(age)
stphtest, plot(ndrugtx)
stphtest, plot(treat)
stphtest, plot(site)
stphtest, plot(age_site)
drop sch1-sch5 sca1-sca5

*only treat appears to be a potential problem but it is not enough to warrant changing the model

*testing proportionality assumptions using log-log plots
*parallel curves indicate proportionality is upheld
stphplot, by(treat)
stphplot, by(site)

*solution to non-proportionality: stratifying
sort treat
by treat: stcox age ndrugtx site age_site, nohr

*graphing survival curves from cox regression models

stcox age ndrugtx treat site age_site, nohr basesurv(surv0)
gen surv1 = surv0^exp( (-0.0336943*30+0.0364537*5 - 0.2674113*1))
graph surv1 _t, s(.) c(J) sort ylab(0 .1 to 1) xlab(0 200 to 1200)

gen surv2 = surv0^exp( (-0.0336943*30+0.0364537*5))
label variable surv1 "long treatment"
label variable surv2 "short treatment"
graph surv1 surv2 _t, s(..) c(JJ) sort ylab(0 .1 to 1) xlab(0 200 to 1200) 
drop surv0-surv2

*Model Fit
*assessing goodness of fit by looking at the Cox-Snell residuals
*ideal is for cumulative hazard of Cox-Snell residuals to follow 45 degree line
*i.e. have an exponential distribution with a hazard rate of 1

quietly stcox age ndrugtx treat site age_site, nohr mgale(mg)
predict cs, csnell

stset cs, failure(censor)
sts gen H = na
graph H cs cs, c(ll) s(o.) sort xlab(0 1 to 4) ylab(0 1 to 4)
drop mg cs na

*Model fits well except at large values of time which is common with censored data.