[I posted this to perl.unicode over a week ago and have received no
replies -sl]
I have an app which accepts text supposedly emitted as utf-8. I use:
$valid_txt = Encode::decode('utf-8', $user_txt, Encode::FB_QUIET);
to truncate at badly-formed characters and set the utf-8 bit as
required.
When my app went live I started to get:
Malformed UTF-8 character (UTF-16 surrogate 0xd87c) in substitution
(s///)...
messages in this and other operations using '$valid_txt' above.
I found the utf-8 sequence x'eda1bc' was one sequence causing a
problem. It's an invalid character according to the chart in
perlunicode.pod, 'Unicode Encodings' which shows:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
...
U+D000..U+D7FF ED 80..9F 80..BF
...
Attempting to figure out why this sequence didn't get flagged as an
error, I wrote this:
1 #!/usr/local/bin/perl -w
2
3 use Encode;
4
5 my $good_string = 'ed9fbc';
6 my $bad_string = 'eda1bc';
7
8 my $good_char = pack('H*',$good_string);
9 my $bad_char = pack('H*',$bad_string);
10
11 my $cnv_good_char = decode('utf8',$good_char);
12 my $cnv_bad_char = decode('utf8',$bad_char,Encode::FB_WARN);
13
14 my $goodflag = utf8::valid($cnv_good_char);
15 my $good2flag = Encode::is_utf8($cnv_good_char, 1);
16 print "goodflag=$goodflag\n";
17 print "good2flag=$good2flag\n";
18 print "good_length=",length($cnv_good_char),"\n\n";
19
20 my $badflag = utf8::valid($cnv_bad_char);
21 my $bad2flag = Encode::is_utf8($cnv_bad_char,1 );
22 print "badflag=$badflag\n";
23 print "bad2flag=$bad2flag\n";
24 print "bad_length=",length($cnv_bad_char),"\n\n";
25
26 my $good_lc = lc($cnv_good_char);
27 my $bad_lc = lc($cnv_bad_char);
On my main machine, Red Hat 9 (with the latest version of Encode,
2.01, installed from CPAN) I get:
sgl@linus > perl -v
This is perl, v5.8.4 built for i686-linux
sgl@linus > ./xx
goodflag=1
good2flag=1
good_length=1
badflag=1
bad2flag=1
bad_length=1
Malformed UTF-8 character (UTF-16 surrogate 0xd87c) in lc at ./xx line
27.
Where nothing shows an error until I try to use it!
Same results on linux 5.8.3 and Solaris 5.8.0 with unknown versions of
Encode.
Can anyone tell me what I'm doing wrong? How do I filter out invalid
utf-8 sequences?
TIA.
Stewart