I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.
Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.
Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
I created a small class to grab data from the provided url, then extract html elements of your choice. The class makes use of CURL and DOMDocument.
php class:
example usage:
example response:
Its worth remembering that when crawling external links (I do appreciate the OP relates to a users own page) you should be aware of robots.txt. I have found the following which will hopefully help http://www.the-art-of-web.com/php/parse-robots/.
Thank you @hobodave.
However I found two weaknesses in your code. Your parsing of the original url to get the "host" segment stops at the first single slash. This presumes that all relative links start in the root directory. This only true sometimes.
fix this by breaking at the last single slash not the first
a second unrelated bug, is that
$depth
does not really track recursion depth, it tracks breadth of the first level of recursion.If I believed this page were in active use I might debug this second issue, but I suspect the text I am writing now will never be read by anyone, human or robot, since this issue is six years old and I do not even have enough reputation to notify +hobodave directly about these defects by commmenting on his code. Thanks anyway hobodave.
Why use PHP for this, when you can use wget, e.g.
For how to parse the contents, see Best Methods to parse HTML and use the search function for examples. How to parse HTML has been answered multiple times before.
Here my implementation based on the above example/answer.
CRAWL CLASS:
USAGE:
As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.
Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html
Dumping results to a file: http://www.tizag.com/phpT/filewrite.php