Web Scraping, data mining, data extraction

I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term.

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

Basically, I need to extract the "harmonic values" from this website. Specifically, I need the 9 numbers displayed on the second link. The numbers are not passed to HTML, they just seem to update automatically every few seconds. I need to able to extract these values in real time as they update. Even if that is not possible I still need to show that doing such web scraping is impossible. I am not given any API's to any of the back end, and do not know how they're site receives the data.

Overall, ANY help would be appreciated, even if its just some simple search terms to put me in the right direction. I am currently clueless in terms of web scraping/data mining/

标签： html parsing screen-scraping data-mining

3条回答

我只想做你的唯一

2楼-- · 2019-06-03 18:40

Web Scraping

To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general “User” what you can have as “Public Data”, you’ll be able to scrap that using HTML Agility Pack easily.

Try These :

ASP.NET : HTMLAgilityPack (open source library)

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

PHP & CURL : WEB SCRAPING WITH PHP & CURL

Node.js : Screen Scraping with Node.js

YQL & Ajax : Screen scraping using YQL and AJAX

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-06-03 18:46

The second link is pulling information from an API every few seconds. Using Google Chrome you can inspect things like this using the developer tools and clicking on "Network" then. You then see which requests are sent and can easily replicate them by right clicking the request -> copy as CURL. You then get something like this, which includes all headers and post data sent by the request in an CURL command. This is what the second link was calling:

curl 'http://utilsub.lbs.ubc.ca/ion/default.aspx/GetRTxmlData' -H 'Cookie: ASP.NET_SessionId=oq0qiwuqbb3g3453jvyysvjx' -H 'Origin: http://utilsub.lbs.ubc.ca' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Host: utilsub.lbs.ubc.ca' -H 'Accept-Language: de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36' -H 'Content-Type: application/json; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' --data-binary $'{\'dgm\':\'x-pml:/diagrams/ud/network.dgm\',\'id\':\'75660a13-5145-42d5-b661-a50f328306c7\',\'node\':\'\'}' --compressed

The API returns XML wrapped in JSON.

You might wanna use CURL with PHP as codeSpy said, you just have to set all the headers and post data and replicate the request properly, otherwise the API wont respond to your request.

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-06-03 18:48

Try http://code.google.com/p/crawler4j/ It is very easy to use, you have to override one classe which is Controller.java.

You only need to specify the seeds and it returns the text and the HTML data in two variables for every website crawled.

0人赞添加讨论(0) 举报

Web Scraping, data mining, data extraction

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间