sas >> Parsing Raw Data to remove Carriage Returns

by chriske2 » Mon, 03 Jan 2005 04:40:32 GMT

I have a large XML file that I am trying to convert to a sas dataset. It is
concatenated and not well formed so XML Map nor the information posted at
http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0412c&L=sas-l&D=1&O=D&m=130735&
P=41017 will help. I will instead have to parse it using SAS and abuse it
until its the way I want it

I had been using the following code to get it into as SAS dataset:

data test ;
infile "C:\temp\file.XML" scanover ;
input @"<?xml" record $ ;
run ;

Unfortunately there are carriage returns scattered throughout the dataset
that I need to eliminate, or at least disregard in order to read the records
properly. I am very new to SAS so the previous listings on this list-serv
were not entirely helpful.

I was thinking that using an if statement with substr(_infile_, 1, 3)="0D"X
then would work but I'm not sure how to properly write it (again, due to my
newbie nature). I'm using SAS 9.1.2 on Win 2000.

Thanks in advance,

Kevin Christensen

P.S. Anyone know a good site or book that will help with parsing data?
Something on SAS functions and data steps perhaps?


sas >> Parsing Raw Data to remove Carriage Returns

by tobydunn » Mon, 03 Jan 2005 05:15:02 GMT


Kevin,

Give us some representative data that we can work with and I am more than
sure one of us can come up with a solution.


Remember by providing data you are helping us help you.


Toby Dunn




From: Kevin Christensen < XXXX@XXXXX.COM >
Reply-To: Kevin Christensen < XXXX@XXXXX.COM >
To: XXXX@XXXXX.COM
Subject: Parsing Raw Data to remove Carriage Returns
Date: Sun, 2 Jan 2005 15:40:32 -0500
MIME-Version: 1.0
Received: from mc7-f17.hotmail.com ([65.54.253.24]) by mc7-s7.hotmail.com
with Microsoft SMTPSVC(5.0.2195.6824); Sun, 2 Jan 2005 12:41:17 -0800
Received: from malibu.cc.uga.edu ([128.192.1.103]) by mc7-f17.hotmail.com
with Microsoft SMTPSVC(5.0.2195.6824); Sun, 2 Jan 2005 12:40:35 -0800
Received: from listserv.cc.uga.edu (128.192.1.75) by malibu.cc.uga.edu
(LSMTP for Windows NT v1.1b) with SMTP id < XXXX@XXXXX.COM >; 2
Jan 2005 15:40:34 -0500
Received: from LISTSERV.UGA.EDU by LISTSERV.UGA.EDU (LISTSERV-TCP/IP release
1.8d) with spool id 10657 for XXXX@XXXXX.COM ; Sun, 2 Jan
2005 15:40:34 -0500
Received: from smtp.ufl.edu (sp11en1.nerdc.ufl.edu [128.227.74.11]) by
listserv.cc.uga.edu (8.12.11/8.12.11) with ESMTP id j02KeY6V003760
for < XXXX@XXXXX.COM >; Sun, 2 Jan 2005 15:40:34 -0500
Received: from 26VLT21 (8cuvg.cba.ufl.edu [128.227.106.99] (may be forged))
by smtp.ufl.edu (8.13.1/8.13.1/2.5.0) with SMTP id j02KeW2d104214
for < XXXX@XXXXX.COM >; Sun, 2 Jan 2005 15:40:33 -0500
X-Message-Info: loPmDlX8LgckMUmc4BV2/w8WDzP2hG7t7ML5/jnFqzk=
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
X-Spam-Status: hits=-0.904, required=5, tests=BAYES_30
X-UFL-Spam-Status: hits=-0.904, required=5, tests=BAYES_30
X-Scanned-By: CNS Open Systems Group
( http://open-systems.ufl.edu/services/smtp-relay/ )
X-UFL-Scanned-By: CNS Open Systems Group
( http://open-systems.ufl.edu/services/smtp-relay/ )
Newsgroups: bit.listserv.sas-l
Return-Path: XXXX@XXXXX.COM
X-OriginalArrivalTime: 02 Jan 2005 20:40:35.0882 (UTC)
FILETIME=[4FCA78A0:01C4F10B]

I have a large XML file that I am trying to convert to a sas dataset. It is
concatenated and not well formed so XML Map nor the information posted at
http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0412c&L=sas-l&D=1&O=D&m=130735&
P=41017 will help. I will instead have to parse it using SAS and abuse it
until its the way I want it

I had been using the following code to get it into as SAS dataset:

data test ;
infile "C:\temp\file.XML" scanover ;
input @"<?xml" record $ ;
run ;

Unfortunately there are carriage returns scattered throughout the dataset
that I need to eliminate, or at least disregard in order to read the records
properly. I am very new to SAS so the previous listings on this list-serv
were not entirely helpful.

I was thinking that using an if statement with substr(_infile_, 1, 3)="0D"X
then would work but I'm not sure how to properly write it (again, due to my
newbie nature). I'm using SAS 9.1.2 on Win 2000.

Thanks in advance,

Kevin Christensen

P.S. Anyone know a good site or book that will help with parsing data?
Something on SAS functions and data steps perhaps?



sas >> Parsing Raw Data to remove Carriage Returns

by chriske2 » Mon, 03 Jan 2005 06:22:13 GMT

See attached...

-----Original Message-----
From: toby dunn [mailto: XXXX@XXXXX.COM ]
Sent: Sunday, January 02, 2005 4:15 PM
To: XXXX@XXXXX.COM ; XXXX@XXXXX.COM
Subject: RE: Parsing Raw Data to remove Carriage Returns


Kevin,

Give us some representative data that we can work with and I am more than
sure one of us can come up with a solution.


Remember by providing data you are helping us help you.


Toby Dunn


Parsing Raw Data to remove Carriage Returns

by art297 » Mon, 03 Jan 2005 21:57:29 GMT

evin,

On Sun, 2 Jan 2005 15:40:32 -0500, Kevin Christensen wrote:


As shown below, Kevin sent me a sample of his data off line.

A search of the SAS-L archives will provide a lot of useful hints and
references to helpful guides.

Unfortunately, getting what you want probably isn't as trivial a problem as
I made it out to be in the following code but, then again, that you will
have to discover for yourself.

Upon reviewing your sample file, it appears to come from a structured data
base, thus the problem could end up being trivial. I didn't see carriage
returns as posing any significant problem in accomplishing what you want:

infile "C:\sample.XML" truncover end=lst;
retain dnum date;
if not(lst) then do;
input record $255. ;
if index(record,'<DNUM><PDAT>') > 0 then do;
start=index(record,'<DNUM><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
dnum=substr(record,start,numchars);
end;
else if index(record,'<DATE><PDAT>') > 0 then do;
start=index(record,'<DATE><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
date=substr(record,start,numchars);
output;
dnum='';
date='';
end;
end;
run;

Art
---------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE PATDOC SYSTEM "ST32-US-Grant-025xml.dtd" [
<!ENTITY USD0484671-20040106-D00000.TIF SYSTEM "USD0484671-20040106-
D00000.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00001.TIF SYSTEM "USD0484671-20040106-
D00001.TIF" NDATA TIF>
]>
<PATDOC DTD="2.5" STATUS="Build 20030724">
<SDOBI>
<B100>
<B110><DNUM><PDAT>D0484671</PDAT></DNUM></B110>
<B130><PDAT>S1</PDAT></B130>
<B140><DATE><PDAT>20040106</PDAT></DATE></B140>
<B190><PDAT>US</PDAT></B190>
</B100>
<B200>
<B210><DNUM><PDAT>29174009</PDAT></DNUM></B210>
<B211US><PDAT>29</PDAT></B211US>
<B220><DATE><PDAT>20030110</PDAT></DATE></B220>
</B200>
<B400>
<B472>
<B474><PDAT>14</PDAT></B474>
</B472>
</B400>
<B500>
<B510>
<B511><PDAT>0201</PDAT></B511>
<B516><PDAT>7</PDAT></B516>
</B510>
<B520>
<B521><PDAT>D 2712</PDAT></B521>
</B520>
<B540><STEXT><PDAT>Apparel</PDAT></STEXT></B540>
<B560>
<B561>
<PCIT>
<DOC><DNUM><PDAT>1998140</PDAT></DNUM>
<DATE><PDAT>19350400</PDAT></DATE>
<KIND><PDAT>A</PDAT></KIND>
</DOC>
<PARTY-US>
<NAM><SNM><STEXT><PDAT>Loew</PDAT></STEXT></SNM></NAM>
</PARTY-US>
</PCIT><CITED-BY-OTHER/>
</B561>
<B561>
<PCIT>


Parsing Raw Data to remove Carriage Returns

by Alan Churchill » Mon, 03 Jan 2005 23:36:44 GMT

Kevin,

I noticed that sas can now handle concatenated xml files:

http://support.sas.com/onlinedoc/913/getDoc/en/engxml.hlp/engxmlwhatsnew900.htm

If you need some more work on the utility I built to help you do the conversion, let me know. Since it works at the basic level, it is trivial to fix it and make it handle what you need.

Thanks,
Alan
Savian
"Bridging SAS and Microsoft Technologies"

?
nntp://news.qwest.net/comp.soft-sys.sas/
I have a large XML file that I am trying to convert to a sas dataset. It is
concatenated and not well formed so XML Map nor the information posted at
http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0412c&L=sas-l&D=1&O=D&m=130735&
P=41017 will help. I will instead have to parse it using SAS and abuse it
until its the way I want it

I had been using the following code to get it into as SAS dataset:

data test ;
infile "C:\temp\file.XML" scanover ;
input @"<?xml" record $ ;
run ;

Unfortunately there are carriage returns scattered throughout the dataset
that I need to eliminate, or at least disregard in order to read the records
properly. I am very new to SAS so the previous listings on this list-serv
were not entirely helpful.

I was thinking that using an if statement with substr(_infile_, 1, 3)="0D"X
then would work but I'm not sure how to properly write it (again, due to my
newbie nature). I'm using SAS 9.1.2 on Win 2000.

Thanks in advance,

Kevin Christensen

P.S. Anyone know a good site or book that will help with parsing data?
Something on SAS functions and data steps perhaps?

[comp.soft-sys.sas]



Parsing Raw Data to remove Carriage Returns

by chriske2 » Tue, 04 Jan 2005 05:44:34 GMT

rt, Thanks for the code it sparked a different way for me to look at
things. I now see what you mean about carriage returns not necessarily
impacting what I want to do. I broke out bits and pieces of your code in
order to figure out what was going on. There are a couple of things that
I'm not entirely clear on.

1) As I understand it, the 'end=1st' part of the infile statment is a 0 or
1 identifying the end of a line. I'm not sure why you included it however
since, wouldn't SAS automatically move to the next record when it reached a
carriage return?

2) What exactly is being saved in the SAS dataset (if anything). Is it
'dnum'? In the 'else if' statment it would appear that date is being saved
due to the 'output;'. If so then shouldn't the 'if' statment also have an
'output;' line?

-----Original Message-----
From: Arthur Tabachneck [mailto: XXXX@XXXXX.COM ]
Sent: Monday, January 03, 2005 8:57 AM
To: XXXX@XXXXX.COM ; Kevin Christensen
Subject: Re: Parsing Raw Data to remove Carriage Returns


Kevin,

On Sun, 2 Jan 2005 15:40:32 -0500, Kevin Christensen wrote:


As shown below, Kevin sent me a sample of his data off line.

A search of the SAS-L archives will provide a lot of useful hints and
references to helpful guides.

Unfortunately, getting what you want probably isn't as trivial a problem as
I made it out to be in the following code but, then again, that you will
have to discover for yourself.

Upon reviewing your sample file, it appears to come from a structured data
base, thus the problem could end up being trivial. I didn't see carriage
returns as posing any significant problem in accomplishing what you want:

infile "C:\sample.XML" truncover end=lst;
retain dnum date;
if not(lst) then do;
input record $255. ;
if index(record,'<DNUM><PDAT>') > 0 then do;
start=index(record,'<DNUM><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
dnum=substr(record,start,numchars);
end;
else if index(record,'<DATE><PDAT>') > 0 then do;
start=index(record,'<DATE><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
date=substr(record,start,numchars);
output;
dnum='';
date='';
end;
end;
run;

Art
---------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE PATDOC SYSTEM "ST32-US-Grant-025xml.dtd" [
<!ENTITY USD0484671-20040106-D00000.TIF SYSTEM "USD0484671-20040106-
D00000.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00001.TIF SYSTEM "USD0484671-20040106-
D00001.TIF" NDATA TIF>
]>
<PATDOC DTD="2.5" STATUS="Build 20030724">
<SDOBI>
<B100>
<B110><DNUM><PDAT>D0484671</PDAT></DNUM></B110>
<B130><PDAT>S1</PDAT></B130>
<B140><DATE><PDAT>20040106</PDAT></DATE></B140>
<B190><PDAT>US</PDAT></B190>
</B100>
<B200>
<B210><DNUM><PDAT>29174009</PDAT></DNUM></B210>
<B211US><PDAT>29</PDAT></B211US>
<B220><DATE><PDAT>20030110</PDAT></DATE></B220>
</B200>
<B400>
<B472>
<B474><PDAT>14</PDAT></B474>
</B472>
</B400>
<B500>
<B510>
<B511><PDAT>0201</PDAT></B511>
<B516><P

Parsing Raw Data to remove Carriage Returns

by chriske2 » Tue, 04 Jan 2005 05:51:11 GMT

Alan,

I looked at the concatenation stuff before posting my original message to
the SAS-List. Unfortunately it still does not support dtd's so
concatenation is only a small part of my XML problem.

The XML utility you wrote worked for the most part (I did get one or two
errors but it still spit out a bunch of files) however I also have some
SGML stuff to parse through and I thought SAS would be good to use for
both. Plus its a good opportunity for me to learn the ins and outs of
SAS.

All that being said, if you're so inclined to develop the utility further
I'm sure more than a few people on this list would be in your debt.

Best,

Kevin Christensen


Parsing Raw Data to remove Carriage Returns

by art297 » Tue, 04 Jan 2005 06:21:16 GMT

Kevin,



End of file, actually, but not needed in this case.

While you probably want to parse more than those two fields, my example was
based on 'dnum' being the first variable of interest, and 'date' being the
last variable of interest within each set. The retain statement was used
to hold all variables you are trying to parse (in my example just 'dnum'
and 'date'), and then write a new record containing the parsed results once
you had attempted to capture the last variable of interest.

Thus, in my example, each set of records will result in one record, with
those records containing only 'dnum' and 'date'.

Art


Parsing Raw Data to remove Carriage Returns

by chriske2 » Thu, 06 Jan 2005 09:34:23 GMT

Art,

I've finally had some luck with parsing my data but only partially. When I
used the code you suggested I continually got negative values for the
numchars. It appeared to me that the carriage returns are messing with me
again, making numchars an impossible approach. Instead of reading in 3000+
records I instead read 15576, one for each carriage return in the dataset.
I've gotten around that by doing the following (no making fun of my dataset
names):

data kevin ;
infile "C:\temp\pgb040106.XML" truncover end=lst;
input record $255. ;
IF substr(record, 1, 6)='<B110>'
OR substr(record, 1, 6)='<B140>'
OR substr(record, 1, 6)='<B210>'
OR substr(record, 1, 6)='<B220>'
;
run;

data kevin2 ;
set kevin ;
record2 = substr(record, 19, 8) ;
ID = _n_ ;
TYPE = substr(record, 1, 6) ;
if ID = 1 then patno = 1 ;
retain patno ;
if TYPE ^= '<B110>' then patno = patno ;
else patno = patno + 1 ;
run ;

proc transpose data=kevin2 out=kevin_trans ;
by patno ;
id TYPE;
var record2 ;
run ;


proc print data=kevin_trans (obs=30);
run ;

The code identifies some of the lines of interest similar to how you were
doing it. Unfortunately that does not completely get me where I need to be.

As I mentioned to Alan offline the xml tags identify datapoints for each
record. Since the carriage returns interfere with how my dataset is read
into sas the tags are now in different rows than the data, making it
impossible to figure out the type of data I'm reading (or parsing). If
there were a way to remove the carriage returns then, to me anyway, I would
avoid the problem of separating the tags from the data. Further, even if I
concatenated long strings of tags and data into one variable I could easily
parse them once they were in a tabular form. Any ideas on how to do that?


Parsing Raw Data to remove Carriage Returns

by chriske2 » Sun, 09 Jan 2005 03:37:27 GMT

Well I think I figured a crude way to remove these characters that were
screwing up my file. The characters were not "truly" carriage returns as I
originally stated. I tried to clear them out using '0d'x but they wouldn't
go away. (Not sure what else they could be but that's well beyond my
knowledge.) They did however act as if they were when I tried reading the
file into SAS. Since I couldn't search for these characters directly I
thought that I would replace any non-line feed character with a space.
Thereby eliminating them. While the code seems to work it unfortunately
doesn't get me anywhere closer to where I need to be.

I put the code here for the listserv archives. It is a very slight
derivation of the code found at The coding relied heavily on
http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0408d&L=sas-l&D=1&O=D&P=29772


data _null_;
length pchar $1;
infile 'C:\temp\pgb040106.XML' recfm=N;
input pchar $1. @@;

file 'C:\temp\pgb040106_2.XML' recfm=N;
if pchar = '0a'x
and lag(pchar) ^= '0a'x then
put ' ' @;
else if pchar ^= '0d'x then
put pchar $1. @;
run;


Similar Threads

1. Remove Carriage Return Symbol ?

Hi Folks! SAS Ver8.2 Win2000

I am importing an MS Access97 data table with the following code;

PROC IMPORT OUT= WORK.TEST 
DATATABLE= "TEST _OUTPUT" 
DBMS=ACCESS97 REPLACE;
DATABASE="\\james\SAS_INPUT\data_ms_access97.mdb"; 
RUN;

Problem is that one of the comment fields allows multi-lines. When SAS
reads the data in I get an odd symbol representing the carriage return
in the value.

It looks like two pipe symbols together '||' only bold and somewhat
smaller? I would like to use the following code to replace the symbol
with a space.

comment=translate(comment,'','|');

How do I get SAS to display that character? 

Better yet how do I get SAS to replace a carriage return with a space
in place of a symbol when importing? A search of the archives didn't
help...

TIA
James

2. Question on parsing raw data (with delimiters)

3. Printing Carriage Returns

Hello,

I would like to print a variable that contains 2 carriage returns and
have it actually execute the carriage returns so that there are 3
lines for that 1 observation.  How can this be accomplished?  Thanks.


data test;

line1= 'line1';
line2= 'line2';
line3= 'line3';

text= cat(line1,'0D0A'x,line2,'0D0A'x,line3,'0D0A'x);

run;

proc print data=test;
run;

4. Problem with carriage return/newline character when writing

5. Problem with pipe delimited file with embedded carriage returns

Hi,

I have some raw data files (CSV) which use the pipe character ('|') as
the delimiter.  No problems reading these into SAS - just use the
DLM='|' option.  However, some of the fields have embedded carriage
return / line feed characters in them.  SAS appears to think that this
indicates a new observation when in fact it does not.  The following
code simulates the problem (much simplified - just a few records):

data one;
  infile datalines dlm='|' dsd truncover;
  input var1 :$25. var2 :$4. var3 :$500. var4 :ddmmyy10.;
  format var4 ddmmyy10.;
  datalines;
This is text|123|XXXXXXXXXXXXXXXXXXXXXXXXXXXXX|10/12/2004
This is more|9999|YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYYYYYYYYY|25/12/2003
And some more|789|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|14/02/1999
;
run;

I shoud have a data set here with 4 variables and 3 observations.
However, because of the embedded hard return in the second row of data,
I get four rows and the middle two rows are incorrectly read.

Is there any way I can tell SAS to say ignore the carriage return and
read in everything between the 2nd and 3rd delimiter ?  I have searched
this group and I see that it appears to be a common problem that no one
appears to have addressed adequately.  My data is very structured apart
from these line breaks ?

Any help greatly appreciated.

Thanks,

Tony

6. Carriage Returns when exporting

7. SAS to Excel - carriage return in Excel

Dear all,
    I have a SAS dataset that I would like to put into excel without
the carriage return skipping to the next line, but I would still like
the alt+enter carriage return to occur within the cell.  I have tired
doing a find/replace by putting into excel the sting <cr> and then
replacing it with the '0d'x, but at that point excel just recognizes it
as a string of caracters and not a carriage return.

Any suggestion?

Thanks,
Erick

8. Carriage Return in %LABEL/Annotate Graphics text?