This may be trivial, or not, but I'm working on a piece of software that will verify the "end of the line" domain for ads displayed through my web application. Ideally, I have a list of domains I do not want to serve ads from (let's say Norton.com is one of them) but most ad networks serve ads via shortened, and cryptic, URLs (adsrv.com), that eventually redirect to Norton.com. So the question is: has any one built, or have an idea of how to build, a scraper-like tool that will return the final destination url of an ad.
Initial discovery: Some ads are in Flash, JavaScript, or plain HTML. Emulating a browser is perfectly viable, and would combat different formats of ads. Not all Flash or JS ads have a noflash or noscript alternative. (Browser may be necessary, but as stated this is perfectly fine... Using something like WatiN or WatiR or WatiJ or Selenium, etc...)
Prefer open source so that I could rebuild one myself. Really appreciate help!
EDIT* This script needs to Click on the ad, since it might be Flash, JS, or just HTML plain. So Curl is less likely an option, unless Curl can click?
The Mechanize gem is handy for this:
You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.
The Net::HTTP library has a Following Redirection example.
Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.
Notice that after reading the body the URL was redirected to Google's
/
dir.Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.
This will get you started:
Sample PHP Implementation:
Which should return something like
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true
Note: You might need to use
curl_setopt()
to turn offCURLOPT_SSL_VERIFYHOST
andCURLOPT_SSL_VERIFYPEER
if you want to reliably follow across HTTPS/SSLThe solution I ended up using was simulating a browser, loading the ad, and clicking. The click was the key ingredient. Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help.
--head
restricts it to HEAD requests only, so that you don't have to actually download the pages-L
tells curl to keep following redirects-s
gets rid of any progress meters, etc-o /dev/null
tells curl to throw away the headers retrieved (we don't care about them)-w %{url_effective}
tells curl to write out the last fetched url as the result to stdoutThe result will be that the effective url is written to stdout, and nothing else.
cURL can retrieve HTTP headers. Keep stepping through the chain until you're no longer getting
Location:
headers and the lastLocation:
header you received is the final URL.