Java parsing html elements generated by JS

2019-01-20 06:20发布

I'm very new to html parsing with Java, I used JSoup previously to parse simple html without it dynamically changing, however I now need to parse a web page that has dynamic elements. This is the code I attempted to parse the web page with prior however it was impossible to find the elements since they where added after the page had loaded. The situation is question is a page that uses google maps with markers on it, I'm attempting to scrape the images of these markers.

    public static void main(String[] args) {
try {
    doc = Jsoup.connect("https://pokevision.com")
            .userAgent(
                    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36")
            .get();
} catch (IOException e) {
    e.printStackTrace();
}
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");

for (Element image : images) {
    System.out.println("src : " + image.attr("src"));
}

}

So since apparently this operation is impossible with JSoup, what other libraries can I use to find the image sources. Example of an element I am attempting to select

1条回答
甜甜的少女心
2楼-- · 2019-01-20 06:46

The problem you are facing is Jsoup retrieves the static source code, as it would be delivered to a browser. What you want is the DOM after the javaScript has been invoked. For this, you can use HTML Unit to get the rendered page and then pass its content to Jsoup for parsing.

// capture rendered page
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage("https://pokevision.com");

// convert to jsoup dom
Document doc = Jsoup.parse(myPage.asXml());

// extract data using jsoup selectors
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
    System.out.println("src : " + image.attr("src"));
}

// clean up resources
webClient.close();
查看更多
登录 后发表回答