I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.
What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.
Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.