Groovy XMLSlurper issue

2019-07-25 15:41发布

I want to parse with XmlSlurper a HTML document which I read using HTTPBuilder. Initialy I tried to do it this way:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
def xml = slurper.parse(response)

But it produces an exception:

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

I found a workaround to provide cached DTD files. I found a simple implementation of class which should help here:

class CachedDTD {
/**
 * Return DTD 'systemId' as InputSource.
 * @param publicId
 * @param systemId
 * @return InputSource for locally cached DTD.
 */
  def static entityResolver = [
          resolveEntity: { publicId, systemId ->
            try {
              String dtd = "dtd/" + systemId.split("/").last()
              Logger.getRootLogger().debug "DTD path: ${dtd}"
              new org.xml.sax.InputSource(CachedDTD.class.getResourceAsStream(dtd))
            } catch (e) {
              //e.printStackTrace()
              Logger.getRootLogger().fatal "Fatal error", e
              null
            }
          }
  ] as org.xml.sax.EntityResolver

}

My package tree looks as shown below:

alt text

I modified also a little code for parsing response, so it looks like this:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
slurper.setEntityResolver(org.yuri.CachedDTD.entityResolver)
def xml = slurper.parse(response)

But now I'm getting java.net.MalformedURLException. Logged DTD path from CachedDTD entityResolver is org/yuri/dtd/xhtml1-transitional.dtd and I can't get it working...

2条回答
贼婆χ
2楼-- · 2019-07-25 16:06

I was able to solve my parsing issue by using another XmlSlurper constructor:

public XmlSlurper(boolean validating, boolean namespaceAware, boolean allowDocTypeDeclaration)

like this:

def parser = new XmlSlurper(false, false, true)

In my XML case, disabling the validation (1st parameter false) and enabling the DOCTYPE declaration (3rd parameter true) did the trick.

Note:

查看更多
女痞
3楼-- · 2019-07-25 16:07

there is a HTML parse that you could use, in conjunction with XmlSlurper to address these problems

http://sourceforge.net/projects/nekohtml/

Sample useage here

http://groovy.codehaus.org/Testing+Web+Applications

查看更多
登录 后发表回答