Web Scraping, data mining, data extraction

2019-06-03 18:09发布

I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term.

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

Basically, I need to extract the "harmonic values" from this website. Specifically, I need the 9 numbers displayed on the second link. The numbers are not passed to HTML, they just seem to update automatically every few seconds. I need to able to extract these values in real time as they update. Even if that is not possible I still need to show that doing such web scraping is impossible. I am not given any API's to any of the back end, and do not know how they're site receives the data.

Overall, ANY help would be appreciated, even if its just some simple search terms to put me in the right direction. I am currently clueless in terms of web scraping/data mining/

3条回答
我只想做你的唯一
2楼-- · 2019-06-03 18:40

Web Scraping

To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general “User” what you can have as “Public Data”, you’ll be able to scrap that using HTML Agility Pack easily.

Try These :

ASP.NET : HTMLAgilityPack (open source library)

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

PHP & CURL : WEB SCRAPING WITH PHP & CURL

Node.js : Screen Scraping with Node.js

YQL & Ajax : Screen scraping using YQL and AJAX

查看更多
Anthone
3楼-- · 2019-06-03 18:46

The second link is pulling information from an API every few seconds. Using Google Chrome you can inspect things like this using the developer tools and clicking on "Network" then. You then see which requests are sent and can easily replicate them by right clicking the request -> copy as CURL. You then get something like this, which includes all headers and post data sent by the request in an CURL command. This is what the second link was calling:

curl 'http://utilsub.lbs.ubc.ca/ion/default.aspx/GetRTxmlData' -H 'Cookie: ASP.NET_SessionId=oq0qiwuqbb3g3453jvyysvjx' -H 'Origin: http://utilsub.lbs.ubc.ca' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Host: utilsub.lbs.ubc.ca' -H 'Accept-Language: de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36' -H 'Content-Type: application/json; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' --data-binary $'{\'dgm\':\'x-pml:/diagrams/ud/network.dgm\',\'id\':\'75660a13-5145-42d5-b661-a50f328306c7\',\'node\':\'\'}' --compressed

The API returns XML wrapped in JSON.

You might wanna use CURL with PHP as codeSpy said, you just have to set all the headers and post data and replicate the request properly, otherwise the API wont respond to your request.

查看更多
Melony?
4楼-- · 2019-06-03 18:48

Try http://code.google.com/p/crawler4j/ It is very easy to use, you have to override one classe which is Controller.java.

You only need to specify the seeds and it returns the text and the HTML data in two variables for every website crawled.

查看更多
登录 后发表回答