XML::Parser's char handler behaves differently with UTF-8 and ISO-8859-1 (bug?)

XML::Parser's char handler behaves differently with UTF-8 and ISO-8859-1 (bug?)

Post by skake » Wed, 02 Jul 2003 04:20:28


I am looking for the explanation of why this happens (is it a bug???)

As the subject says, whith everyting else being the same, this
document (where elipsis signify several K of text data):

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?>
<param>
...
</param>

gets parsed differently than this document:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<param>
...
</param>


The difference is that the char handler gets called after each 1K of
data in the first case (ISO) and just once at the end of the data in
the second case (UTF). The consequece of this is that if my char
handler were wrighting (not appending) the data to a file, I would
only get the last chunk in the first case, and all of it in the second
case.

Here is a Perl program that illustrates the problem. Run it with the
two sample XML documents passed on standard in (and at least 1K of
text between the <param> tags) and you will see exactly what I mean.

use XML::Parser;

sub startHandler { return; }
sub endHandler { return; }

sub charHandler {
my ($xp, $string) = @_;
print "$string";
print "\n\n ## end of data ##\n\n";
}

my $parser = new XML::Parser(
ErrorContext => 2,
Style => 'Stream',
Handlers => {
Start => \&startHandler,
End => \&endHandler,
Char => \&charHandler,
}
);

while ( <STDIN> ) {
$input = join( "", $input, $_); # load $input from STDIN
}
$input =~ s/\n//g; # strip off newlines
eval '$parser->parse( $input )'; # parse the input...
die ( "Failed to parse: $@" ) if ( $@ ); # ...die if can't parse
 
 
 

XML::Parser's char handler behaves differently with UTF-8 and ISO-8859-1 (bug?)

Post by Klaus Joha » Wed, 02 Jul 2003 23:26:25


The char handler may get called one or more times for text, your program
should not expect the char handler to be called exactly once, as explained
in the documentation of XML::Parser: "This event is generated when
non-markup is recognized. The non-markup sequence of characters is in
String. A single non-markup sequence of characters may generate multiple
calls to this handler."