Jsoup parsing page knowing Url

2019-09-01 07:01发布

I'm in front of a very big problem to me.. I'm parsing this page http://multiplayer.it/articoli/ with inside some articles.. As you can see, there are some informations i can parse: Tile, date of the article, comments and little preview of the article.

THE GOAL : My goal is click on the article i parse(this operation it's already ok, i have the list with the informations i wrote below) and onClick i want enter in the article itself to see the content. Example: if i click in the first article right now, it brings me at this URL: http://multiplayer.it/notizie/127771-peter-moore-getta-acqua-sul-fuoco-e-descrive-nintendo-come-un-grande-partner-per-ea.html with all content i need view. The appplication has to do the same.

THE PROBLEM I don't know how can do it. But parsing the url of each post i can know the absolute path of post. I can parse it in this way:

try {

                        Document doc = Jsoup.connect(BLOG_URL).get();
                        Elements links = doc.select("div.col-1-1 h2 a[href]"); 

                        for(Element sezione : links)
                        {

                            Log.d("Links",  sezione.attr("abs:href"));
                        }
                    } catch (Exception e) {

                        Log.e("ERROR", "Parsing Error");
                    }

And it returns each href.

QUESTION

Is it possible knwoing the href parse each page content? (the 'p' tag) Thanks

OnClick method

lista.setOnItemClickListener(new OnItemClickListener() {

                @Override
                public void onItemClick(AdapterView<?> parent, View view,
                        int position, long id) {
                    //What here?
                }
            });

1条回答
冷血范
2楼-- · 2019-09-01 07:20

jsoup wouldn't handle your dynamic actions on a web page. You would need to use an API which can handle these dynamic executions - an example being HtmlUnit.

Let's say you have a possibility all the links stored as part of a Java Collection instance like an ArrayList. If I try to parse the first url in the form of a specific method (which can be looped over to get the contents at runtime for all the url on your page dynamically):

Using HtmlUnit

public static void main(String... args)
            throws FailingHttpStatusCodeException, IOException {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);

        WebRequest request = new WebRequest(
                new URL(
                        "http://multiplayer.it/articoli/"));

        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.setJavaScriptTimeout(10000);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setTimeout(10000);

        HtmlPage page = webClient.getPage(request);
        webClient.waitForBackgroundJavaScript(10000);

        System.out.println("Current page: Articoli videogiochi - Multiplayer.it");

        // Current page:
        // Title=Articoli videogiochi - Multiplayer.it
        // URL=http://multiplayer.it/articoli/

        List<HtmlAnchor> anchors1 =  page.getAnchors();
        HtmlAnchor link2 = null;
        for(HtmlAnchor anchor: anchors1)
        {
             if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
             {
                  link2 = anchor;
                  break;
             }
        }
        page = link2.click();

        System.out.println("Current page: Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...");

        // Current page:
        // Title=Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...
        // URL=http://multiplayer.it/recensioni/127745-dead-rising-3-operation-broken-eagle-una-delle-storie-di-los-perdidos.html


        webClient.waitForBackgroundJavaScript(10000);

        DomNodeList<DomElement> paras = page.getElementsByTagName("p");
        for (DomElement el : paras.toArray(new DomElement[paras.size()])) {
            System.out.println(el.asText());
        }
    }

In the above code, it displays all the <p> available on the landing page. Below is the screenshot of the output:

enter image description here

In the above code block, you have the ability to loop over all the anchor tags on the web page, and I choose a specific anchor link to get the resulting content:

List<HtmlAnchor> anchors1 =  page.getAnchors();
            HtmlAnchor link2 = null;
            for(HtmlAnchor anchor: anchors1)
            {
                 if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
                 {
                      link2 = anchor;
                      break;
                 }
            }

You might want to right an appropriate logic to parse all the dynamic links on your page and display their contents.

EDIT:

You can try generating these dynamic scripts through htmlunitscripter Firefox plugin and customize it later to your needs too.

查看更多
登录 后发表回答