How to get only hyperlinks by removing its tags an

2019-07-10 03:53发布

问题:

I want to get only:

http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/

and not all these:

<a href="http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/"></a>

I just want to apply this to my loop (section):

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class NewClassssssss {
    public static void main(String[] args) throws IOException {
        Document doc =  Jsoup.connect("http://tamilblog.ishafoundation.org/page/3//").get();

        Elements section = doc.select("section#content");
        Elements article = section.select("article");
        Elements links = doc.select("a[href]");

        for (Element a : section) {
            //   System.out.println("Title : \n" + a.select("a").text());
            System.out.println(a.select("a[href]"));
        }

        System.out.println(links);
    }
}

回答1:

There are some problems in the code:

1. Invalid search scope

Elements links = doc.select("a[href]");

The above line gets all links from the whole document instead of the articles only.

2. Invalid node used in loop

for (Element a : section) {
   // ...
}

The above for loop works on the sections instead of the links.

3. Repetitive calls to select method

Elements section = doc.select("section#content");
Elements article = section.select("article");
Elements links = doc.select("a[href]");

It's not necessary to perform a selection for each node in the hierarchy. Jsoup can navigate through it for you. Those three lines can be replaced with one line:

Elements links = doc.select("section#content article a");

SAMPLE CODE

Here is a sample code resuming all the three precedent points:

Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/").get();

for (Element a : doc.select("section#content article a")) {
    System.out.println("Title : \n" + a.text());
    System.out.println(a.absUrl("href")); // absUrl is used here for *always* having absolute urls.
}

OUTPUT Title :

http://tamilblog.ishafoundation.org/kalyana-parisaga-isha-kaattupoo/
Title : 
இதயம் பேசுகிறது
http://tamilblog.ishafoundation.org/isha-pakkam/idhyam-pesugiradhu/
Title : 
வாழ்க்கை
http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/
Title : 
கல்யாணப் பரிசாக ஈஷா காட்டுப்பூ…
http://tamilblog.ishafoundation.org/kalyana-parisaga-isha-kaattupoo/
... (truncated for brievety)