I'm trying to parse some HTML returned by an external system with XOM. The HTML looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body>
<div>
Help I am trapped in a fortune cookie factory
</div>
</body>
</html>
(Actually it's significantly messier, but it has this DOCTYPE declaration and these namespace and language declarations, and the HTML above exhibits the same problem as the real HTML.)
What I want to do is extract the content of the <div>
, but the namespace declaration seems to be confusing XPath. If I strip out the namespace declaration (by hand, from the file), the following code finds the <div>
, no problem:
Document document = ...
Nodes divs = document.query("//div");
But with the namespace, the returned Nodes
has a size of 0.
All right, how about if I strip the namespace programmatically?
Element rootElement = document.getRootElement();
rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());
...looks like it should work, but does nothing. From the javadoc:
This method only removes additional namespaces added with
addNamespaceDeclaration.
Okay, I thought, I'll provide the namespace to the query:
XPathContext context =
XPathContext.makeNamespaceContext(document.getRootElement());
Nodes divs = document.query("//div", context);
Size still zero.
How about constructing the namespace context by hand?
XPathContext context = context = new XPathContext(
rootElement.getNamespacePrefix(), rootElement.getNamespaceURI());
Nodes divs = document.query("//div", context);
The XPathContext
constructor blows up with:
nu.xom.NamespaceConflictException:
XPath expressions do not use the default namespace
So, I'm looking for either:
- a way to make this query work, or
- a way to programmatically strip the namespace declarations, or
- an explanation of the correct approach, assuming both of these are wrong.
Update: Based on Lev Levitsky's answer and the Jaxen FAQ I came up with the following hack:
XPathContext context = new XPathContext(
"foo",
document.getRootElement().getNamespaceURI());
Nodes divs = document.query("//foo:div");
This still seems a bit demented to me, but I guess it's the way Jaxen wants you to do things.
Update #2: As noted below and all over the Internet, this isn't Jaxen's fault; it's just XPath being XPath.
So, while this hack works, I would still like a way to strip the namespace declaration. Preferably without going as far as XSLT.
You can write:
You should either specify the namespace directly with something like
or using prefixes that are mapped to respective namespaces (I guess that is what
NamespaceContext
is for, but there are no prefixes in your query).Unfortunately, I don't know how it's implemented in Java, but I can provide a Python example if it helps.