Help with XML processing using DOM

Help with XML processing using DOM

Post by kernc » Fri, 29 Jun 2007 18:01:50


I'm trying to use XML files as input to my program. I decided to use
the DOM (DocumentBuilder, Node, etc.) instead of SAX. I have loaded
the file into an instance of Document, call it 'doc', and now I want
to go through and extract the data.

XML file:
<?xml version="1.0"?>

<a>
<b>
<c>foo</c>
<d>foo</d>
<d>foo</d>
</b>
<e>
</e>
</a>

Relevant code:
Node domNode = doc.getDocumentElement();
System.out.println(domNode.getNodeName());
domNode = domNode.getFirstChild();
System.out.println(domNode.getNodeName());

do {
if (domNode.getNodeName().equals("b")) {
...
} else if ...
...
} while ((domNode = domNode.getNextSibling()) != null);


You will notice my println statements there to help figure out the
problem. The first one prints "a" as expected, I'd then expect the
first child to be "b", but instead it prints "#text". I changed the
second println to print the node value instead of name, and it prints
a single whitespace character. Why isn't this getting the Node at
"b"?

Thanks,
Colin
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Fri, 29 Jun 2007 19:50:28


try Element.getTagName()

 
 
 

Help with XML processing using DOM

Post by kernc » Sat, 30 Jun 2007 06:39:57


When I try to cast the Node to an Element, I get a runtime error:
Exception in thread "main" java.lang.ClassCastException:
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast
to org.w3c.dom.Element
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Sat, 30 Jun 2007 07:19:08


You're not on an Element node you're on a Text node.
As you iterate the children nodes you'll have to check
what type of node you're currently referencing.
See the table at:
< http://www.yqcomputer.com/ ;
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Sat, 30 Jun 2007 07:57:39


Also note that every Element node contains a Text node, so that
Element <b> is probably the next sibling of the Text node you're
printing out. Also when you print the value of this Text node it
is equal to a single whitespace because parser collapses all
whitespace to a single space unless you tell it not to.
 
 
 

Help with XML processing using DOM

Post by Joshua Cra » Sun, 01 Jul 2007 05:46:50


Not necessarily. <br />'s don't contain any text nodes and writing
<a><b>Foo</b></a> gives a only one child: the b element.
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Sun, 01 Jul 2007 06:16:33


Yes, thanks for correcting that.

<a><b/></a> Element node <a> contains Element node <b>

<a>
<b/>
</a> Element node <a> contains Text node and Element node <b>
 
 
 

Help with XML processing using DOM

Post by Lew » Sun, 01 Jul 2007 07:26:19


I'm not as familiar with DOM as SAX, but isn't the whitespace ignorable?

--
Lew
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Sun, 01 Jul 2007 08:45:12


Well, good question. One I haven't considered.
According to the DocumentBuilderFactory Javadoc for the method
setIgnoringElementContentWhitespace(boolean)

|quote|
Specifies that the parsers created by this factory must eliminate
whitespace in element content (sometimes known loosely as 'ignorable
whitespace') when parsing XML documents (see XML Rec 2.10). Note that
only whitespace which is directly contained within element content
that has an element only content model (see XML Rec 3.2.1) will be
eliminated. Due to reliance on the content model this setting requires
the parser to be in validating mode.
By default the value of this is set to false.
|unquote|

So, it looks like yes if I've specified an \element only content model \
in my dtd or schema for the particular Element in question.

Does this sound right?
Thanks for the jog.
JH
 
 
 

Help with XML processing using DOM

Post by Jeff Higgi » Sun, 01 Jul 2007 21:49:05


import java.io.IOException;
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class TestDomWhitespace
{
public static void main(String argv[])
{
Document document = null;
final String instance =
"<?xml version='1.0' standalone='yes'?>" + "\n" +
"<!DOCTYPE a [" + "\n" +
"<!ELEMENT a (b , e)>" + "\n" +
"<!ELEMENT b (c , d*)>" + "\n" +
"<!ELEMENT c (#PCDATA)>" + "\n" +
"<!ELEMENT d (#PCDATA)>" + "\n" +
"<!ELEMENT e ANY>]>" + "\n" +
"<a>" + "\n" +
" <b>" + "\n" +
" <c>foo</c>" + "\n" +
" <d>foo</d>" + "\n" +
" <d>foo</d>" + "\n" +
" </b>" + "\n" +
" <e></e>" + "\n" +
"</a>" + "\n";
System.out.println(instance);
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setValidating(true);
// Set following method true produces abe
// Set following method false produces a#textb
factory.setIgnoringElementContentWhitespace(false);
DocumentBuilder builder;
try
{
builder = factory.newDocumentBuilder();
document = builder.parse(new InputSource(
new StringReader(instance)));
}
catch (ParserConfigurationException e)
{
e.printStackTrace();
}
catch (SAXException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
Node domNode = document.getDocumentElement();
System.out.print(domNode.getNodeName());
domNode = domNode.getFirstChild();
System.out.print(domNode.getNodeName());
domNode = domNode.getNextSibling();
System.out.print(domNode.getNodeName());
}
}