I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term.
Basically, I need to extract the "harmonic values" from this website. Specifically, I need the 9 numbers displayed on the second link. The numbers are not passed to HTML, they just seem to update automatically every few seconds. I need to able to extract these values in real time as they update. Even if that is not possible I still need to show that doing such web scraping is impossible. I am not given any API's to any of the back end, and do not know how they're site receives the data.
Overall, ANY help would be appreciated, even if its just some simple search terms to put me in the right direction. I am currently clueless in terms of web scraping/data mining/
Web Scraping
To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general “User” what you can have as “Public Data”, you’ll be able to scrap that using HTML Agility Pack easily.
Try These :
ASP.NET : HTMLAgilityPack (open source library)
Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET
PHP & CURL : WEB SCRAPING WITH PHP & CURL
Node.js : Screen Scraping with Node.js
YQL & Ajax : Screen scraping using YQL and AJAX
The second link is pulling information from an API every few seconds. Using Google Chrome you can inspect things like this using the developer tools and clicking on "Network" then. You then see which requests are sent and can easily replicate them by right clicking the request -> copy as CURL. You then get something like this, which includes all headers and post data sent by the request in an CURL command. This is what the second link was calling:
The API returns XML wrapped in JSON.
You might wanna use CURL with PHP as codeSpy said, you just have to set all the headers and post data and replicate the request properly, otherwise the API wont respond to your request.
Try http://code.google.com/p/crawler4j/ It is very easy to use, you have to override one classe which is Controller.java.
You only need to specify the seeds and it returns the text and the HTML data in two variables for every website crawled.