sas >> Pick random rows from data

by knguyen » Fri, 10 Aug 2007 11:10:17 GMT

Hello,

I have to pick 30.000 random rows from a file of 20 millions rows. The
file is 12GB. I am new at SAS. Any advices, please?

Thanks,

-k


sas >> Pick random rows from data

by Tree Frog » Fri, 10 Aug 2007 11:43:20 GMT


Hi there

I've never done this, but according to the documentation for SAS/STAT,
this should work:

proc surveyselect
data=yourdata
method=srs
n=30000
out=sampledata;
run;

See how it goes. If you don't have STAT licensed, then no doubt there
will be some data step suggestions from others.

TF

sas >> Pick random rows from data

by knguyen » Fri, 10 Aug 2007 11:56:06 GMT

Thank you so much,

Would you mind pointing me to the documentation if it is available
online?

Sincerely,

-k

sas >> Pick random rows from data

by Tree Frog » Fri, 10 Aug 2007 13:40:15 GMT

Sure: from this link: http://support.sas.com/documentation/onlinedoc/index.html
you can get either html or pdf versions of all the SAS documentation
for either v8 or v9.

For the SAS/STAT documentation, use the online documentation =>
contents in the left pane and expand SAS/STAT, then click on PROC
SURVEYSELECT from there.

Good luck - there's a lot to read in there!

Tree Frog

sas >> Pick random rows from data

by nichas » Fri, 10 Aug 2007 17:40:00 GMT


I have a doubt with proc surveyselect. I will suggest to go with it
only if you have sufficient memory for your work directory since it
creates an intermediate utility file in the work and that takes huge
amount of data. Say I had data of about 9 million, and i tried to
create a random sample of 10 observations, it took me about 9GB of
utility file size created in the work directory. So check it, or refer
to other ways of dividing that task if you are having multiple
processors with setting another alternate location in the .cfg
(config) file of SAS installation by using the -utilloc='other path',
so that if workspace is full it would use this other location. But I
dont know if that would work fine.
Multiple processors can help using the inbuilt multi threading
facility with sas9.1
Others can add to it and correct me if I am incorrect.
And if you got through it then I would be happy to know how u did it
as it would help me too.

sas >> Pick random rows from data

by Frank.Boggasch » Fri, 10 Aug 2007 18:17:27 GMT

you may consider an approach like

data all;
do i=1 to 20000000;
output;
end;
run;

data sample;
set all;
retain have need;
if _n_=1 then do;
have=20000000;
need=30000;
end;
ratio=need/have;
if ranuni(0) le ratio then do;
output;
need=need-1;
end;
have=have-1;
drop have need ratio;
run;

-Frank

-----Original Message-----
From: SAS(r) Discussion [mailto: XXXX@XXXXX.COM ]On Behalf Of
knguyen
Sent: Freitag, 10. August 2007 05:10
To: XXXX@XXXXX.COM
Subject: Pick random rows from data


Hello,

I have to pick 30.000 random rows from a file of 20 millions rows. The
file is 12GB. I am new at SAS. Any advices, please?

Thanks,

-k

sas >> Pick random rows from data

by davidlcassell » Wed, 22 Aug 2007 09:52:53 GMT


I would do this using a data step, instead of PROC SURVEYSELECT,
just because you can avoid the utility files which would have to be
maintained for the possible complexities of a survey sample. The
k/n method that I see in your answers will do here.

BUT... why are you doing this? What are the requirements that
you need for these 30K records? Do you really want a pure,
"simple random sample without replacement" method with no
selection priorities and no other features? If you need something
else, you ought to work out the details. You may not be able
to get such a sample easily with a simple data step.

HTH,
David
_________________________________________________________________
Tease your brain--play Clink! Win cool prizes!
http://club.live.com/clink.aspx?icid=clink_hotmailtextlink2

Similar Threads

1. Picking up random numbers

2. How to pick up the random variables in a dataset

In a data set I have 10 variables.just i want to pickup only even varibles
like ex:

data new;
input a1 a2 a3 a4 a5 a6 a7 a8 a9 a10;
cards;
1 2 3 4 5 6 7 8 9 10
;
run;


I need out put like

 a2 a4 a6 a8 a10
 2   4  6  8  10

If u know anyone to solve this problem

3. how to pick the last 65000 rows

4. Reading the data from the excel file into SAS by row by row

5. Pick up highest observations from each variable in a data set

6. picking up duplicate records in a data set

Below is shown a way to identify duplicate records in a data set, using PROC 
SQL. One of the records in DS1 has one duplicate, and the other has two.

I want is to make another data set which contains the actual duplicates. Not 
only a record with their unique value. In the case below, the data set of 
duplicates should have 3 records, not 2.

I have looked in the online documentation for PROC SORT with the NODUPRECS 
option, but the out= data set contains the records *without* the duplicates.

Like this:
proc sort data = ds1 noduprecs out = ds1_without_dups ;
   by
      PartNumber
      Description
      InStock
      ReceivedDate
      Price;
run;
proc print; run;


Is there a way then to make a third data set - ds1_the_dups - that could 
contain all records that are in ds1 but not in ds1_without_dups ?

Rune



data ds1;
   input
      PartNumber        $
      Description       $
      InStock
      @19 ReceivedDate  date9.
      Price;
   format ReceivedDate  date9.;
   datalines;
K89R seal   34    27jul1998 245.00
M4J7 sander 98    20jun1998 45.88
LK43 filter 121   19may1999 10.99
MN21 brace  43    10aug1999 27.87
BC85 clamp  80    16aug1999 9.55
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
KJ66 cutter 6     18jun1999 19.77
UYN7 rod    211   09sep1999 11.55
UYN7 rod    211   09sep1999 11.55
JD03 switch 383   09jan2000 13.99
BV1E timer  26    03aug2000 34.50
;
proc sql;
   create table ds1_dup as
      select
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price,
         count(*)
      from ds1
      group by
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price
      having count(*) > 1
   ;
quit;
proc print;run; 


7. How to pick records randomly from a data sets

8. FW: picking up duplicate records in a data set

From: Sigurd Hermansen
Sent: Thursday, April 27, 2006 2:14 PM
To: 'Rune Runnest;  XXXX@XXXXX.COM 
Subject: RE: picking up duplicate records in a data set


Rune:
SQL tells you that a distinct row PartNumber,Description,InStock,ReceivedDate,Price occurs n times. It does not keep track of the physical location of rows in a dataset. By design ....

If you store row order as data (as in the example below), SQL will select the rows that have duplicates of the values of interest: data ds1;
   input
      PartNumber        $
      Description       $
      InStock
      @19 ReceivedDate  date9.
      Price;
   format ReceivedDate  date9.;
   rowN=_N_;
   datalines;
K89R seal   34    27jul1998 245.00
M4J7 sander 98    20jun1998 45.88
LK43 filter 121   19may1999 10.99
MN21 brace  43    10aug1999 27.87
BC85 clamp  80    16aug1999 9.55
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
KJ66 cutter 6     18jun1999 19.77
UYN7 rod    211   09sep1999 11.55
UYN7 rod    211   09sep1999 11.55
JD03 switch 383   09jan2000 13.99
BV1E timer  26    03aug2000 34.50
;
proc sql;
   create table ds1_dup as
      select
             rowN,
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price
      from ds1
      group by
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price
      having count(*)>> 1
   ;
quit;
Sig
-----Original Message-----
From:  XXXX@XXXXX.COM  [mailto: XXXX@XXXXX.COM ] On Behalf Of Rune Runnest
Sent: Thursday, April 27, 2006 3:16 AM
To:  XXXX@XXXXX.COM 
Subject: picking up duplicate records in a data set


Below is shown a way to identify duplicate records in a data set, using PROC SQL. One of the records in DS1 has one duplicate, and the other has two.

I want is to make another data set which contains the actual duplicates. Not only a record with their unique value. In the case below, the data set of duplicates should have 3 records, not 2.

I have looked in the online documentation for PROC SORT with the NODUPRECS option, but the out= data set contains the records *without* the duplicates.

Like this:
proc sort data = ds1 noduprecs out = ds1_without_dups ;
   by
      PartNumber
      Description
      InStock
      ReceivedDate
      Price;
run;
proc print; run;

Note that the query selects one set of three rows and one set of two rows. Sig

Is there a way then to make a third data set - ds1_the_dups - that could contain all records that are in ds1 but not in ds1_without_dups ?

Rune



data ds1;
   input
      PartNumber        $
      Description       $
      InStock
      @19 ReceivedDate  date9.
      Price;
   format ReceivedDate  date9.;
   datalines;
K89R seal   34    27jul1998 245.00
M4J7 sander 98    20jun1998 45.88
LK43 filter 121   19may1999 10.99
MN21 brace  43    10aug1999 27.87
BC85 clamp  80    16aug1999 9.55
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
NCF3 valve  198   20mar1999 24.50
KJ66 cutter 6     18jun1999 19.77
UYN7 rod    211   09sep1999 11.55
UYN7 rod    211   09sep1999 11.55
JD03 switch 383   09jan2000 13.99
BV1E timer  26    03aug2000 34.50
;
proc sql;
   create table ds1_dup as
      select
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price,
         count(*)
      from ds1
      group by
         PartNumber,
         Description,
         InStock,
         ReceivedDate,
         Price
      having count(*> > 1
   ;
quit;
proc print;run;