I am stuck with a scraping task in my project.
i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
?>
In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
So you just grab all the TR's
You will notice that the CSV text is "dirty". That is because the actual text is:
So to have "Alpine" and "Eastern Time", you have to replace
with something like
Check out the PHP man page for
html_entity_decode()
about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
This yields a data object that you can inspect and convert in CSV by simple looping.
You retrieve
$data->result->events
, usefputcsv()
on its items converted to array form, and Bob's your uncle.