Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by fuzz » Sat, 04 Mar 2006 00:17:12


Hello Usenet.

I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML::Parser.
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.

Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:

(0) cmp ?(0) = -1
?(1) cmp ?(0) = 0

The first line was parsed by XML::Sax and the second line was parsed by
XML::Parser. My conclusions:

1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax

Your opinion?

The code (written in ISO-8859-1 on disc):

#!/usr/bin/perl -w

use strict;
use warnings;

use XML::Simple;
use Encode;

foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>?/a>");
my $q2 = "?;

printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple::PREFERRED_PARSER = 'XML::Parser';
}

PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.

--
So long... Fuzz
 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by A. Sinan U » Sat, 04 Mar 2006 00:57:53


XXXX@XXXXX.COM (Erik Wasser) wrote in



First off, let me say I don't know much about this stuff. I am on the US
English version of XP. I copied and pasted the code above into Gvim, and
then ran it. I got:


D:\Home\asu1\UseNet\clpmisc> r > results.txt

D:\Home\asu1\UseNet\clpmisc> cat results.txt
?(1) cmp ?(0) = 0
?(1) cmp ?(0) = 0

I would be inclined to look at what changed in XML-SAX between versions
0.12 and 0.13, but then, as I said, I don't know much about encodings
etc.

I have XML-SAX-0.12 and XML-Parser-2.34 and

D:\Home\asu1\UseNet\clpmisc> perl -v

This is perl, v5.8.7 built for MSWin32-x86-multi-thread
(with 14 registered patches, see perl -V for more detail)

Copyright 1987-2005, Larry Wall

Binary build 815 [211909] provided by ActiveState
http://www.yqcomputer.com/
ActiveState is a division of Sophos.
Built Nov 2 2005 08:44:52

Sinan
--
A. Sinan Unur < XXXX@XXXXX.COM >
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.yqcomputer.com/ ~tadmc/clpmisc/clpmisc_guidelines.html

 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by » Mon, 06 Mar 2006 10:30:09


You didn't try to decode in German! You might have changed the "code page"
to German to get different character sets. It doesn't matter. I'm looking at
your character in whatever "code page" is on my machine. UTF8 is Unicode.
Its not discernable unless you have a Unicode "aware" renderer. You can't
just change the characters on the page via cut & paste and it turns into
Unicode. If you open or save a Unicode document from a Unicode aware editor
the represented character will not be noticable as Unicode, so it's not
something that can be "cut 'n pasted" into a newsgroup, as code to be
tested! UTF8, even "multi-byte" is transparent to the user and only known
to the renderer. Data from a file that is read into a parser (or a Perl
program that is UTF8 aware) that is Unicode is treated as Unicode in its
variable representation and interaction with other variables. If a regex
is to be applied to Unicode data from an aware Perl parser, it works
every time.
 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by » Mon, 06 Mar 2006 10:43:12


Just a followup, I know your question was with xml, but if you wan't to use
unicode "outside" the 0-128 bracket fro regex you might want to use the
codes as in this simple example (which just uses various "ranges"):

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by fuzz » Mon, 06 Mar 2006 20:49:22


My question was: why two XML parsers are getting different results? The
different results are confusing me not unicode itself.

--
So long... Fuzz
 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by Peter J. H » Tue, 07 Mar 2006 07:09:03


[XML::Simple gives correct results with XML::Parser, but wrong results
with XML::SAX]


Looks like a bug in XML::SAX or one of the libraries it uses.
However, like Sinan, I cannot reproduce it here on a Debian Sarge
system:

perl, v5.8.4 built for i386-linux-thread-multi
XML::Simple version 2.14
XML::SAX version 0.12
XML::Parser version 2.34
libexpat1 1.95.8-3

So it may be caused by something weird in your einvironment.

hp

--
This is not a signature
 
 
 

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Post by » Fri, 10 Mar 2006 09:54:10


I'm going to have to agree. Using many parsers at the same time will
cause either slowdows or indeterminate results.
xml:sax is not a good parser. Just because it has "sax" in the title
(simple api xml) is bellweather on its functionality/performance.
After using xml:sax in place of expat once, the performance fell off
by %800. If you are going to parse and capture and expand a string
to be later converted into a hash, be carefull of what you use and
how its used. Expat and Simple (with expat directive) are a good
combination. Good cleanup is required. Keep your instantiation,
single operation, closure, sub-scoped. If you are doing schema checking
with Xerces, keep that at a different scope, and a preliminary to
data extraction parsing.

Any ?'s (oh *** my spelling), let me know