Progress report-2-Luqing Ren

Data Preprocessing

Predictors

Predictors for clinical diagnose and procedure were coded under the International Classification of Diagnosis, 9th version (ICD-9-CM: 001-999, E and V codes ). In order to generalize, the first three digits of each code were taken and sorted into groups based on ICD9 classification of diseases system. Total 19 categories were classified for each diagnosis and procedure predictors. The cause of death predictor was coded under 10th version of under the International Classification of Diagnosis (ICD-10-CM: A00-Z99). It was categorized into 21 groups according to the first three digits of the ICD-codes. The top 20 CCS codes for diagnosis and procedure were categorized under the Clinical Classifications Software system and the left CCS codes were classified into “other”group. Predictors,“age_at_baseline”,“height_q1”,“weight_q1”,“bmi_q1”,“allex_hrs_q1”,“allex_life_hrs”,“smoke_totyrs”,“cig_day_avg”,“alchl_g_dayrecen”, “diet”, “total_charges_amt”, ”smoke years”, “length_of_stay were treated as continuous variables and rescaled to the range of 0 to 1. The other predictors were treated as binary or categorical and converted to dummy indicators. Covariates with 90% missing data were excluded.Redundant covariates were excluded.

Missing Values

Missing values on the continuous predictors were imputed with its median value.Missing values on categorical predictors were coded as “00” to make a separate category for each predictors. Some missing predictors appeared to co-occur in the same person, for instance, non-smoke people were corresponded with missing value on “smoke_toyrs”,”smoke_yrs_quit”. Therefore, these missing value were imputed with 0 rather than median value.

Model Perfromance

There are total 132538 subjects and 87 variables.The entire data set was randomly split into 70% training set and 30% test set.LASSO logistic classification with feature selection were performed using “mlr” package. The penalty on the coefficients was controlled by tuning 10-fold cross-validation, and the optimal lambda was found by cross-validation.Random forest was developed under “randomForest” package to fit the training data set.The Gini index is the criterion for selecting the predictors. The total number of trees are 500and 86 variables are tried at each split. The area under the receiver operating characteristic (ROC) curve (AUC) as the evaluation metric.

total_lasso<-readRDS("total_lasso.rds")
plot(getLearnerModel(total_lasso)) # LASSO optimal lambda

lassoroc<-readRDS("lassoroc.rds") # LASSo roc curve
lassoroc

rf_roc_test<-readRDS("rf_roc_test.rds")
plot(rf_roc_test,col=1,main="ROC",xlim=c(1,0),ylim=c(0,1))# random forest roc curve

table<-readRDS("metric_table.rds") #metric table
table

##                  AUC Sensitivity Specificity
## LASSO logistic 0.904       0.804       0.833
## Rndom Forest   0.927       0.825       0.870

Next steps:

Perform Bootstrap model
Assess the death prediction at different time window, (short_term, mid_term, long_term) separated models???????
Compare variable importance
Model selection