问题:

I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.

What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.

Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.

回答1:

What you are looking for is an HTML Dom Parse.

This link of a previous question should help you out. Also check out this question

回答2:

It is correct, there are lots of libraries for parsing html data. For example, if you use Perl, you can use HTML::Parse.

If you just want a fast result and you agree to use a system command you can use:

lynx -dump http://4chan.org

links -dump http://4chan.org

Is there a library for extracting data from an HTM

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮