comp.soft-sys.sas - The SAS statistics package.
Has anyone on this list ever used k-fold cross validation to develop a predictive model? I raised this question before but didn't get a satisfying answer. My question is: you've divided you sample into 10 subsets, and used 9 subsets to train and 1 subset to validate. You repeated the process 10 times and ended up with 10 models and 10 validation results. Now how are you going to consolidate the 10 models (each model may contain different variables)? A related question: it has been suggested using the average of the 10 validation results to get the ultimate model performance figure. Fine, but which one of the 10 model does the average performance measure intend to measure? Thanks
Is it possible to perform v-fold cross-validation using some standard procedure in SAS to determine the optimal number of Clusters? What else could be used to determine the optimal number of Clusters in a dataset?
I'm considering using a cross-validation technique for a multivariate logistic regression model. Is there a formula for determining the optimal number of subsamples into which I should divide my sample? I'm curious if it would be appropriate to use a variation of the optimal allocation formula used to calculate optimal stratum sizes that minimize the variance of some variable of interest X in stratified sampling plans. Thanks, Robert
Hi All Below is some code at my attempt to incorporate some of David Cassell suggestions for doing k-fold CV from the article on Be Loopy However I believe my rainand estdatasets from step 3 is not really set up for a true k-fold cross validation. For example given my scenario of 20 observations 10(group A) and 10 (group B) and 10-fold CV, within each replication a partitioning of the 20 observations should have 10 sets of 2 observations. Thus the classifier will be built with 18 observations and testing on 2. Each of the 10 sets should take turn being classified and that average of error rates is one estimate of the true error rate. The code as it stand is not doing that. In step 3, the amprate allows me to set up the appropriate groupings; however, the ep option is partitioning a new set of 10 sets of 2 observations each time rather than having an indicator to allow me to perform a true 10- fold CV. Any suggestions would be appreciated. Thanks, Song %let nsize =20; %let nsim =2; %let nrep =3; /** step 1: create 2 Normal populations **/ data population; seed1=234; seed2=31; do i = 1 to 1000; y=rannor(seed1); group = 'A'; output; group = 'B'; y = 3 + rannor(seed2); output; end; keep group y; run; proc sort data=population; by group; run; /** step 2: Sampling of size N from the population **/ ods listing close; proc surveyselect data=population method=srs n=&nsize rep=&nsim seed=1953 out=Sample; strata group; run; ods listing; data sample; set sample; sim = replicate; keep sim group y; keep group y; run; proc sort data = sample; by sim; run; /* step 3: generate the k-fold cross-validation sample */ %let k = 10; %let rate = %sysevalf((&K-1)/&K); proc surveyselect data=sample out=nsample seed=495857 noprint samprate=&rate outall rep=&nrep; strata group; run; proc sort data = nsample; by replicate group; run; data train(where=(selected = 1)) test (where=(selected = 0)); set nsample; run; /*** obtain estimate of misclassification error rate using k-fold CV **/ ods listing close; proc discrim data=train outstat=KFOLD method=normal pool=test ; by sim replicate ; class group; priors prop; var y; run; ods listing; ods listing close; ods output ErrorTestClass = KFCV; proc discrim data=KFOLD testdata=test; by sim replicate; class group; var y; run; ods listing;