I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.
相关问题
- Sorting 3 numbers without branching [closed]
- Graphics.DrawImage() - Throws out of memory except
- Generic Generics in Managed C++
- Why am I getting UnauthorizedAccessException on th
- 求获取指定qq 资料的方法
Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.
The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.
I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.
Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.
My Advice:
You could look around for a HTML Parser and then use it to parse out information from sites. (Like here). Then all you would need to do is save that data into your database however you see fit.
I've made my own scraper a few times, it's pretty easy and allow you to customize the data that is saved.
Data Mining Tools
If you really just want to get a tool to do this then you should have no problem finding some.
For simple websites ( = plain html only), Mechanize works really well and fast. For sites that use Javascript, AJAX or even Flash, you need a real browser solution such as iMacros.