moderated >> Malformed utf-8 not caught by Encode

by StewartL » Sat, 04 Sep 2004 08:00:11 GMT

[I posted this to perl.unicode over a week ago and have received no
replies -sl]

I have an app which accepts text supposedly emitted as utf-8. I use:

$valid_txt = Encode::decode('utf-8', $user_txt, Encode::FB_QUIET);

to truncate at badly-formed characters and set the utf-8 bit as
required.

When my app went live I started to get:

Malformed UTF-8 character (UTF-16 surrogate 0xd87c) in substitution
(s///)...

messages in this and other operations using '$valid_txt' above.

I found the utf-8 sequence x'eda1bc' was one sequence causing a
problem. It's an invalid character according to the chart in
perlunicode.pod, 'Unicode Encodings' which shows:


Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte

...
U+D000..U+D7FF ED 80..9F 80..BF
...


Attempting to figure out why this sequence didn't get flagged as an
error, I wrote this:

1 #!/usr/local/bin/perl -w
2
3 use Encode;
4
5 my $good_string = 'ed9fbc';
6 my $bad_string = 'eda1bc';
7
8 my $good_char = pack('H*',$good_string);
9 my $bad_char = pack('H*',$bad_string);
10
11 my $cnv_good_char = decode('utf8',$good_char);
12 my $cnv_bad_char = decode('utf8',$bad_char,Encode::FB_WARN);
13
14 my $goodflag = utf8::valid($cnv_good_char);
15 my $good2flag = Encode::is_utf8($cnv_good_char, 1);
16 print "goodflag=$goodflag\n";
17 print "good2flag=$good2flag\n";
18 print "good_length=",length($cnv_good_char),"\n\n";
19
20 my $badflag = utf8::valid($cnv_bad_char);
21 my $bad2flag = Encode::is_utf8($cnv_bad_char,1 );
22 print "badflag=$badflag\n";
23 print "bad2flag=$bad2flag\n";
24 print "bad_length=",length($cnv_bad_char),"\n\n";
25
26 my $good_lc = lc($cnv_good_char);
27 my $bad_lc = lc($cnv_bad_char);


On my main machine, Red Hat 9 (with the latest version of Encode,
2.01, installed from CPAN) I get:

sgl@linus > perl -v

This is perl, v5.8.4 built for i686-linux

sgl@linus > ./xx
goodflag=1
good2flag=1
good_length=1

badflag=1
bad2flag=1
bad_length=1

Malformed UTF-8 character (UTF-16 surrogate 0xd87c) in lc at ./xx line
27.


Where nothing shows an error until I try to use it!

Same results on linux 5.8.3 and Solaris 5.8.0 with unknown versions of
Encode.

Can anyone tell me what I'm doing wrong? How do I filter out invalid
utf-8 sequences?

TIA.

Stewart

Similar Threads

1. Reading Text File Encoding and converting to Perls internal UTF-8 encoding

Need help from Unicode guru's or anybody with some knowledge on the subject.

I maybe have a text (character) file I just open. But I don't know the encoding and I
can't open it with any encoding attribute.

It would appear to me that at the start of the file, there is an encoding mark (or none),
assuming a text file, a sort of BOM sequence of octets that mark what its encoding is.

Given that I might be passed a file descriptor only, I am module, and I rewind the position
to the start of the file, is there any  way I can tell the encoding. If I could, and
its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
create a temp file decoded, or possibly re-open it with the proper encoding.

I think that encoding is the usual 8/16/32 bit utf but with many locales (chars).

I am still sketchy where to find a list of encoding markers to be able to find out
this information. And still sketchy on the methods available for analysis and transformation.

I know Perl has a massive 'use Encode' lib, nevertheless, this is what I need to do to finalize
a module I'm working on.

Thanks for the help.
-sln

2. UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug - Perl

3. Malformed UTF-8 character

I get a nasty error...
Malformed UTF-8 character (unexpected non-continuation byte 0x02,
immediately af
ter start byte 0xe7) in null operation at C:/Perl/site/lib/Tk.pm line
406.
This is my code:
my @dbnames;
my $dbhost  = "localhost";
my $dbuser  = "testuser";
my $dbpass  = "logpass";
my($dbh,$sth,$query);
my $dsn = "DBI:mysql:host=$dbhost";
$dbh    = DBI->connect($dsn,$dbuser,$dbpass,{PrintError => 0, RaiseError
=> 1});
$query  = qq^SHOW DATABASES^;
$sth    = $dbh->prepare($query);
$sth->execute();
while(my $ref_db = $sth->fetchrow_array()) {
    print "Found database: $ref_db\n";
    push(@dbnames, $ref_db);
}
$sth->finish();
$dbh->disconnect();
$i=0;
foreach(@dbnames){
  $w=$pane ->Label(-text=>"$_", -relief=>'groove', -width=>30);
  $pane->windowCreate('end', -window=>$w);
  $ch_t = $w->cget(-text);
  print "$ch_t\n";
  $w2 = $pane->Checkbutton(-bg=>'white',
-text=>\$dbnames[$i],-command=>sub{print "Check
$dbnames[$i]\n"})->pack(-anchor=>'nw');
  $pane->windowCreate('end', -window=> $w2);
  $pane->insert('end',"\n");
  $i++;
}
$pane->configure(-state=>'disabled');
-----------------------------------------------------------
I get the error because of this: -text=>\$dbnames[$i]
I've looked at utf8 and use Encode... but didn't solved this error.

4. Meaning of "Malformed UTF-8 character"? - Perl

5. encoding - utf-8

Hello,

I have a problem, apparently on an encoding issue, but can't figure out 
where it comes from. Could someone please help?

I'm reading from an XML file that contains the line

[1]	...Bergson referred as "dur"; the way...

Then I parse the file with XML::DOM::Parser and print it out again.
The line now becomes:

[2]	...Bergson referred as "dur㩥; the way...


Where can this possibly come from? Does "standard" reading and printing 
not produce UTF-8? And does XML::DOM::Parser not read input as UTF-8? 
So, when I print it out, should it not be UTF-8 again?

The file containing the first line was written like this:

	#!/usr/bin/perl
	use strict;
	use warnings;
	use encoding 'utf-8';

	my $infile = "file1.xml";
	open IN, "$infile" or die "\ncannot read specified infile\n";
	my $text = join "", <IN>;
	close IN;

	# some processing...

	my $outfile = "file2.xml";
	open OUT, ">:encoding(utf-8)", $outfile or die "cannot create out file";
	print OUT $text;
	close OUT;

	# alternatively I tried:
	# open IN, "<:encoding(utf-8)", "$infile"; # and
	# open OUT, ">$outfile" or die "cannot create out file";
	# respectively. It makes no difference.


The second script reads/writes like this:

	#!/usr/bin/perl
	use strict;
	use XML::DOM;
	use warnings;

	my $infile = "file2.xml";
	my $dom_parser = new XML::DOM::Parser();
	my $TREE = $dom_parser->parsefile($infile);

	open OUT, ">file3.xml" or die "could not open log file";
	print OUT $TREE->toString();
	close OUT;


Thanks for any comments!

Alois Heuboeck

6. XML::Twig produces double encoded UTF-8

7. LWP and UTF-8 encoding

I have a short perl script that among other things tries to open web
pages using LWP::UserAgent.

here is an excerpt from the code:

my $ua = LWP::UserAgent->new;

$ua->default_headers->push_header('Accept-Charset' =>
"iso-8859-1,iso-8859-2,utf-8");
$ua->default_headers->push_header('Accept' => "text/html, text/plain,
image/*");

$req = new HTTP::Request('GET', $url);
$res = $ua->request($req);

I'm using $res->content to get the content of the retrieved page.
Sometimes the script produces the following warnings with some Web
sites. I don't know why i'm getting this kind of messages.

Parsing of undecoded UTF-8 will give garbage when decoding entities at
/usr/lib/perl5/vendor_perl/5.8.6/LWP/Protocol.pm line 114.
Parsing of undecoded UTF-8 will give garbage when decoding entities at
/usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi/HTML/PullParser.pm
line 83.

8. Question about Encode (Windows-1252 to utf-8) - Perl