I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each:
1
Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages?
2
Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html. The 'Parser' plugin interface in Nutch 2.1 is: Parse getParse(String url, WebPage page) And the 'WebPage' class has some predefined attributs:
public class WebPage extends PersistentBase {
// ...
private Utf8 baseUrl;
// ...
private ByteBuffer content; // <== This becomes null in IndexFilter
// ...
private Utf8 title;
private Utf8 text;
// ...
private Map<Utf8,Utf8> headers;
private Map<Utf8,Utf8> outlinks;
private Map<Utf8,Utf8> inlinks;
private Map<Utf8,Utf8> markers;
private Map<Utf8,ByteBuffer> metadata;
// ...
}
So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.
3
After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql. My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior?
Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me? Thanks for your help.