I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
The result will be
[Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]
as expected.Another cool tool that you can use is
Web Harvest
. It basically does everything I did above but using an XML file to configure the extraction pipeline.