sas >> split sample dataset into several folds for cross validation

by dorjetarap » Fri, 08 May 2009 03:17:15 GMT

Hi Gary,

If you're looking at splitting a file randomly into a specific number
of files equally, this approach may work for you - I'm not a
statistician, so I'm not even sure this is a valid way to do this.

This works by dividing the number of observations in your dataset by
the number of datasets wanted, and randomly assigns them to a group. A
temporary array is used to keep track of how many records are in each
dataset.

NOTE: Borrowing heavily, and in some cases blatantly ripping-of
Dorfman techniques.

%let dsetNo = 10 ;
data sample ;
set sashelp.class ;
do _n_=1 to 10 ;
output ;
end ;
run ;
data randV /view=randV ;
array _ar(0:&dsetNo) _temporary_ ;
if _n_=1 then call pokelong(repeat(put (ceil(divide(obs,&dsetNo)),
rb8.),dim(_ar)),addrlong(_ar[1])) ;
set sample nobs=obs end=eof ;
do while(1) ;
_group = ceil(ranuni(0)*&dsetNo ) ;
if _ar[_group] then leave ;
end ;
_ar[_group] = _ar[_group] - 1 ;
run ;
proc sort data=randV out=split ;
by _group ;
run ;
data _null_ ;
call missing(num) ;
declare hash hh() ;
hh.definekey('_group','_n_') ;
hh.definedata('name','age','sex') ;
hh.definedone() ;
do _n_=1 by 1 until(last._group) ;
set split ;
by _group ;
hh.add() ;
end ;
hh.output (dataset: 'SAMPLE' || put(_group, best.-l)) ;
run ;



2009/5/7 Gary < XXXX@XXXXX.COM >:
> I was wondering there is any easier way to split the sample dataset
> into 10 folds for cross validation due to limit number of event. I am
> using PROC SURVEYSELECT samprate=0.5 to split the dataset into two
> folds, and then do this against each fold to 4, then again to 8. Is
> there any easier way?
>
> Thanks inadvance.
>

sas >> split sample dataset into several folds for cross validation

by rhigh » Fri, 08 May 2009 03:38:55 GMT


Gary,

one of many is to add a uniformly distributed random number, and then rank
the dataset on that number, depending on how many groups you need:

data one;
set sashelp.class;
x=ranuni(50709);

proc rank data=one out=two groups=3;
var x;
ranks grp;
run;

proc sort data=two; by grp;
proc print data=two; run;

you can also easily add by variables if you need to group within
classification levels.


Robin High
UNMC





Gary < XXXX@XXXXX.COM >
Sent by: "SAS(r) Discussion" < XXXX@XXXXX.COM >
05/07/2009 10:27 AM
Please respond to
Gary < XXXX@XXXXX.COM >


To
XXXX@XXXXX.COM
cc

Subject
split sample dataset into several folds for cross validation in SAS






I was wondering there is any easier way to split the sample dataset
into 10 folds for cross validation due to limit number of event. I am
using PROC SURVEYSELECT samprate=0.5 to split the dataset into two
folds, and then do this against each fold to 4, then again to 8. Is
there any easier way?

Thanks inadvance.

sas >> split sample dataset into several folds for cross validation

by dynamicpanel » Fri, 08 May 2009 08:42:19 GMT

simply add an id variable from ranuni(seed)
ID=round(ranuni(seed)*10);

in your cross-validation,
do i=0 to 9;
subset Data (where=(ID^=i))
end;

sas >> split sample dataset into several folds for cross validation

by NordlDJ » Fri, 08 May 2009 09:06:51 GMT

> -----Original Message-----


Actually, the code above will not do what was requested. It will produce IDs 0 through 10, with the IDs 1-9 having approximately 1/10 of the cases each, and IDs 0 and 10 having approximately 1/20 of the cases each. CEIL() or FLOOR() are better choices:

ID=ceil(ranuni(seed)*10); **-- range 1 to 10;
ID=floor(ranuni(seed)*10); **-- range 0 to 9;

Hope this is helpful,

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204

sas >> split sample dataset into several folds for cross validation

by HERMANS1 » Sat, 09 May 2009 06:16:18 GMT

Have you tried the SAMPSIZE option instead of SAMPRATE? You could compute that and save it to a SAS Macro variable.
S

-----Original Message-----
From: SAS(r) Discussion [mailto: XXXX@XXXXX.COM ] On Behalf Of Gary
Sent: Friday, May 08, 2009 8:37 AM
To: XXXX@XXXXX.COM
Subject: Re: split sample dataset into several folds for cross validation in SAS


Thanks for the reply. I tried, and I can divide the sample into 10 folds, but the problem is that in each folder there is different percentage of response. In my case, sample is 36000, and the total response is 240. After I split into 10 folds, one fold could have 38 response and the other one could have only 18. I tried surveyselect as suggested by David Cassell, but still have the same problem.

%let K=10;
%let rate=%sysevalf((&K-1)/&K);
proc surveyselect data=sampledata
out=xv
seed=1234567
samprate=&RATE
outall
rep=&K;
strata response ;
run;

Any suggestion?

Similar Threads

1. split sample dataset into several folds for cross validation in SAS

2. K-fold Cross Validation

Has anyone on this list ever used k-fold cross validation to develop a
predictive model? I raised this question before but didn't get a satisfying
answer.

My question is: you've divided you sample into 10 subsets, and used 9
subsets to train and 1 subset to validate. You repeated the process 10 times
and ended up with 10 models and 10 validation results. Now how are you going
to consolidate the 10 models (each model may contain different variables)?

A related question: it has been suggested using the average of the 10
validation results to get the ultimate model performance figure. Fine, but
which one of the 10 model does the average performance measure intend to
measure?

Thanks

3. How N-fold Cross Validation Works

4. v-fold cross validation in Clustering

Is it possible to perform v-fold cross-validation using some standard
procedure in SAS to determine the optimal number of Clusters?

What else could be used to determine the optimal number of Clusters in
a dataset?

5. Random sample cross Validation in Proc PLS?

6. cross validation - optimal number of samples

I'm considering using a cross-validation technique for a multivariate
logistic regression model. Is there a formula for determining the optimal
number of subsamples into which I should divide my sample?

I'm curious if it would be appropriate to use a variation of the optimal
allocation formula used to calculate optimal stratum sizes that minimize
the variance of some variable of interest X in stratified sampling plans.

Thanks,

Robert

7. Cross validation question

8. cross validation

Hi All 

Below is some code at my attempt to incorporate some of David
Cassell suggestions for doing k-fold CV from the article on Be
Loopy However I believe my rainand estdatasets from step 3
is not really set up for a true k-fold cross validation.

For example given my scenario of 20 observations 10(group A) and 10
(group B) and 10-fold CV, within each replication a partitioning of
the 20 observations should have 10 sets of 2 observations.  Thus the
classifier will be built with 18 observations and testing on 2.  Each
of the 10 sets should take turn being classified and that average of
error rates is one estimate of the true error rate.

The code as it stand is not doing that.  In step 3, the amprate
allows me to set up the appropriate groupings; however, the ep
option is partitioning a new set of 10 sets of 2 observations each
time rather than having an indicator to allow me to perform a true 10-
fold CV.

Any suggestions would be appreciated.

Thanks,

Song


%let nsize =20;
%let nsim =2;
%let nrep =3;

/** step 1:  create 2 Normal populations **/
data population;
   seed1=234; seed2=31;
      do i = 1 to 1000;
         y=rannor(seed1);
         group = 'A';
         output;
         group = 'B';
         y = 3 + rannor(seed2);
         output;
         end;
  keep group y;
run;

proc sort data=population;
 by group;
run;

/** step 2:  Sampling of size N from the population **/
ods listing close;
proc surveyselect data=population
         method=srs n=&nsize rep=&nsim
         seed=1953 out=Sample;
      strata group;
run;
ods listing;
data sample;
 set sample;
 sim = replicate;
 keep sim group y;
 keep group y;
run;

proc sort data = sample;
 by sim;
run;

/* step 3: generate the k-fold cross-validation sample */
%let k = 10;
%let rate = %sysevalf((&K-1)/&K);

proc surveyselect data=sample out=nsample seed=495857 noprint
  samprate=&rate outall rep=&nrep;
  strata group;
run;

proc sort data = nsample;
 by replicate group;
run;

data train(where=(selected = 1))
     test (where=(selected = 0));
  set nsample;
run;

/*** obtain estimate of misclassification error rate using k-fold CV
**/

ods listing close;
proc discrim data=train outstat=KFOLD method=normal pool=test ;
   by sim replicate ;
   class group;
   priors prop;
   var y;
 run;
ods listing;

ods listing close;
ods output ErrorTestClass = KFCV;
proc discrim data=KFOLD testdata=test;
	  by sim replicate;
      class group;
      var y;
   run;
ods listing;