Using Java, how can I extract all the links from a given web page?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
You would probably need to use regular expressions on the HTML link tags
<a href=>
and</a>
You can use the HTML Parser library to achieve this:
download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use
and find all links and then get the detials using
Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax
The selectors have same syntax as
jQuery
if you know jQuery function chaining then you will certainly love it.EDIT: In case you want more tutorials, you can try out this one made by mkyong.
http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/
Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.
A simple regex which would match 99% of pages could be this:
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:
And access the link part with
.group(1)
and the text part with.group(2)
This simple example seems to work, using a regex from here
and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with
https
urls as well.