sas >> PROC SORT: nodups versus nodupkey

by kimberly.lebouton » Wed, 23 Jun 2004 07:59:11 GMT

Can anyone explain what nodups does with PROC SORT?

From the V6, V8, and V9 documentation, it is not an option or an alias for
PROC SORT. If I'm wrong, please provide me the name of the documentation
that you found it noted.

I inherited some code with nodups, and was getting strange results with my
PROC SORT. When I changed it to nodupkeys, I got the results that I hoped
for.

Just interested in the history of nodups.

TIA,
Kim LeBouton

sas >> PROC SORT: nodups versus nodupkey

by chang_y_chung » Wed, 23 Jun 2004 08:04:48 GMT


On Tue, 22 Jun 2004 19:59:11 -0400, Kimberly LeBouton



Hi, Kim,

I recommend searching for a sas-l archive with keyword, nodup. There are
plenty of threads discussing this topic.

Cheers,
Chang

sas >> PROC SORT: nodups versus nodupkey

by quentin_mcmullen » Wed, 23 Jun 2004 08:49:57 GMT

On Tue, 22 Jun 2004 19:59:11 -0400, Kimberly LeBouton



Hi Kim,

It looks like nodups does show up in the v8 docs (procedure manual- proc
sort) as an alias for "noduprecs".

As Chang mentioned, there should be plenty of previous threads on nodups,
and problems with it managing to leave dups (see the "TIP" at the bottom).
I think the take-home message is that if you want to use nodups to get rid
of all dups, sort by _all_. I tend to use nodupkey rather than nodups. But
in general I don't find it to be useful. I can never remember whether
nodupkey keeps the first record or the last. And in general if I want to
get a unique dataset, I tend to get it from a data step, where I can choose
whether I want first. or last., and I can make sure the duplicate records
get funneled to a dups dataset rather than dropped.

Kind Regards,
--Quentin

<doc>
NODUPRECS
checks for and eliminates duplicate observations. If you specify this
option, PROC SORT compares all variable values for each observation to those
for the previous observation that was written to the output data set. If an
exact match is found, the observation is not written to the output data set.
Alias : NODUP
Interaction: When you are removing consecutive duplicate observations
in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can
have an effect on which observations are removed.
Interaction: The action of NODUPRECS is directly related to the setting
of the SORTDUP data set option. When SORTDUP= is set to LOGICAL, NODUPRECS
removes only the duplicate variables that are present in the input data set
after a DROP or KEEP operation. Setting SORTDUP=LOGICAL increases the number
of duplicate records that are removed because it eliminates variables before
record comparisons takes place. Also, setting SORTDUP=LOGICAL can improve
performance because dropping variables before sorting reduces the amount of
memory required to perform the sort. When SORTDUP= is set to PHYSICAL,
NODUPRECS removes all duplicate variables in the data set, regardless if
they have been kept or dropped. For more information about the data set
option SORTDUP=, see SAS Language Reference: Dictionary.
Tip: Because NODUPRECS checks only consecutive observations, some
nonconsecutive duplicate observations may remain in the output data set. You
can remove all duplicates with this option by sorting on all variables.
See also: NODUPKEY
</doc>

sas >> PROC SORT: nodups versus nodupkey

by Lou » Wed, 23 Jun 2004 09:27:23 GMT

ODUP is the alias for the NODUPRECS option of PROC SORT.

Click on http://v8doc.sas.com/sashtml/ then in the left hand frame click
successively on Base SAS Software / SAS Procedures Guide / Procedures / The
SORT Procedure. At that point in the right hand frame click on "PROC SORT
Statement" and scroll down to NODUPRECS.

If you have a hardcopy of the Procedures Guide, you'll find it listed as an
option of the PROC SORT statement.

In Version 8.2, click on Help in the menu bar, click on SAS System Help.
Click on the Index tab, type SORT in the keyword to find area. Click on
"SORT procedure", and in the Topics Found dialog that pops up click on "PROC
SORT". In the right hand frame, click on "Syntax", then click on "PROC
SORT". Scroll down to and click on "NODUPRECS" and scroll down to
NODUPRECS.

Alternatively, once you have the Help system invoked, you could click on the
Search tab and type "nodup" in the keyword to find area, click the "List
Topics" button. Click on "PROC SORT Statement: Options" and scroll down to
NODUPRECS. As long as you're in the Help on the Search tab, you might also
peruse the SAS SYSTEM OPTIONS: SORTDUP=Option

In version 6.12, invoke the Help system, click on the "Find" tab, type
"nodup" in the word to find area, click on "nodup" in the Select some
matching words to narrow your search area, and click on "SORT: Procedure
Options in the Click on a topic, then click Display area, then click the
Display button.

NODUPRECS is an option that's been around at least since version 5 -
NODUPKEY didn't make it's appearance until version 6. Back in version 5 it
was called NODUPLICATES - the short form was NODUP. Presumably the name was
changed because people who couldn't be bothered to read the documentation
thought it did what NODUPKEY does.

It's been completely documented in every version of the manual I've ever
seen for at least the past 20 years - maybe longer but my library of manuals
goes back only to 1985.

NODUPRECS/NODUP eliminates successively duplicate records (observations)
from the output data set. The data set is sorted by the variables in the BY
list, and if every variable value in the current observation is identical to
the values of the previous observation output, the current observation is
deleted. If at least one variable value is different, the observation is
output. Note that unless you're sorting by _ALL_ (or you list every
variable in the BY statement) this doesn't guarantee that all but one of a
set of identical observations will be eliminated, just that no observation
will be immediately followed by another that has all variable values the
same IN THE ORIGINAL DATASET if the SORTDUP option is set to physical (the
default - meaning before KEEP and DROP data set options are applied). If
you want no successive identical observations in the output (meaning you
want to take KEEP and DROP data set options into account) set the SORTDUP
system option to LOGICAL.

"Kimberly LeBouton" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...



sas >> PROC SORT: nodups versus nodupkey

by Lou » Thu, 24 Jun 2004 09:27:52 GMT


I feel I must take exception to the tone expressed here. NODUPRECS has
behaved precisely as documented for the last 20 years.

For instance, the code

PROC SORT DATA = FEE OUT = FIE NODUP;
BY _ALL_;
QUIT;

produces exactly the same output as

PROC SQL;
CREATE TABLE FIE AS
SELECT DISTINCT *
FROM FEE;
QUIT;

Yet I'd hesitate to call the DISTINCT operand of the SELECT clause
"unreliable".

sas >> PROC SORT: nodups versus nodupkey

by lpogoda » Mon, 28 Jun 2004 21:03:38 GMT


NODUPRECS works "just fine" (that is, it does just what the
documentation says it will do) whether you sort by _ALL_ or by some
subset of variables.


... that some people can't be bothered to read the documentation,
and/or feel that their personal preferences should set the standard,
and/or assume they know what's going on when they demonstrably do not.


As if that's any different from using a shorter instead of a longer BY
list at any other time.


If an ignorant programmer writes sloppy code then I guess you can't
trust the processing result. You have to know your language. Doesn't
hurt to know something about your data either.


I still don't see the "issue" - you seem to want noduprec to do the
same thing as nodupkey, or BY _ALL_ even when we don't sort by _all_,
or select distinct. We have nodupkey, by _all_, and select distinct -
if that's what you want, use one of them, and if it isn't, don't.



Only if your database manages doesn't know what s/he's doing.


50 year old literary allusions aside, you seem to be saying that a low
proportion of practicing programmers have read the documentation, and
by implication very few know what they're doing. That is as may be,
what doesn't follow is that it's the language that's to blame for the
situation.

sas >> PROC SORT: nodups versus nodupkey

by ghellrieg » Mon, 28 Jun 2004 23:18:56 GMT

that is something I do not understand! Ok, NODUPKEY does something other
than NODUPREC. Where is the problem? There are hundreds of examples where
this will happen! TRANWRD does something different as TRANSLATE, PUT and
INPUT, INTNX and INTCK, ...
How will programmers know, what everything does, without reading (ok: and
UNDERSTANDING) the documentation? Maybe it would be better to use WRZLGMF
instead of NODUPKEY! Nobody would "understand" that, so everybody would
read the manuals!
Maybe for that guys C or FORTRAN, even Assembler would be better? Nobody
can "read" assembler and understand what happens. In SAS some people try
that. But in the end it's the same with SAS: you sometimes only believe
that you understand it!

Similar Threads

1. what is difference UNIQUE and DISTINCT and NODUPS and NODUPKEY

2. Getting PROC SQL to emulate PROC SORT NODUPKEY

Hi,

Sorry if this is in the PROC SQL doc, but I couldn't find it...

Test code:

data one;
   length start label $32;
   input start label;
   cards;
A  A
A  A
A  B
C  D
;
run;
proc sort data=one out=test1 nodupkey;
   by start;
run;
proc sql noprint;
   create view test2 as
      select distinct start, label
         from one
   ;
quit;

How do I create a PROC SQL view to duplicate the PROC SORT NODUPKEY
output?  The DISTINCT keyword considers the entire row (start and label
together), rather than just start.

In the actual application, the SQL views would be permanent views creating
cntlin datasets for PROC FORMAT from DB2 tables that could contain repeats
of the start value.

Thanks,
Scott

3. Difference between PROC SORT NODUPKEY and PROC SQL?

4. PROC SORT NODUPKEY - highest output dataset

Hello List,

I have a dataset with 8 variables and about 600 entries and am trying to get
a table of data without any double entries:

proc sort data=D1 nodupkey;
      by var1;
proc sort data=D1 nodupkey;
      by var2;
proc sort data=D1 nodupkey;
      by var3;
proc sort data=D1 nodupkey;
      by var4;
proc sort data=D1 nodupkey;
      by var5;
proc sort data=D1 nodupkey;
      by var6;
proc sort data=D1 nodupkey;
      by var7;
proc sort data=D1 nodupkey;
      by var8;
run;
Now, obviously, the order in which I do these sorting steps plays a role in
how many datasets there will be in the output table. There are 8! = 40'320
possible orders in which to do this, and I want to find the one that yields
the largest possible output dataset. Is there an elegant way to achieve this
in SAS?

Thanks

5. How to control Proc SORT nodupkey

6. Doing proc sort nodupkey in a data step

Hi All

I was wondering if I can do the same thing that

proc sort data=ps.psmpl out=x1 nodupkey;
by hospid lastnm firstnm;
run;

is doing in a data step (eventually with even less coding).

Thanks Toby

7. sql distinct vs. proc sort nodupkey

8. proc sort nodupkey results

Paul,

If just the dups are needed, you could replace OUT=SORTED with OUT=_null_.

If the data set is large enough where sorting is not feasible/reasonable,
using the hash object would be a way to go.

HTH,
Ken



In a message dated 10/21/2006 9:52:53 AM Eastern Standard Time,
 XXXX@XXXXX.COM  writes:
Paul,

Would the dupout option (e.g., proc sort data=have out=sorted nodupkey
dupout=want) do what you want?

Art