可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This may be trivial, or not, but I'm working on a piece of software that will verify the "end of the line" domain for ads displayed through my web application. Ideally, I have a list of domains I do not want to serve ads from (let's say Norton.com is one of them) but most ad networks serve ads via shortened, and cryptic, URLs (adsrv.com), that eventually redirect to Norton.com. So the question is: has any one built, or have an idea of how to build, a scraper-like tool that will return the final destination url of an ad.

Initial discovery: Some ads are in Flash, JavaScript, or plain HTML. Emulating a browser is perfectly viable, and would combat different formats of ads. Not all Flash or JS ads have a noflash or noscript alternative. (Browser may be necessary, but as stated this is perfectly fine... Using something like WatiN or WatiR or WatiJ or Selenium, etc...)

Prefer open source so that I could rebuild one myself. Really appreciate help!

EDIT* This script needs to Click on the ad, since it might be Flash, JS, or just HTML plain. So Curl is less likely an option, unless Curl can click?

回答1:

Sample PHP Implementation:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

Which should return something like https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

Note: You might need to use curl_setopt() to turn off CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER if you want to reliably follow across HTTPS/SSL

回答2:

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>

--head restricts it to HEAD requests only, so that you don't have to actually download the pages
-L tells curl to keep following redirects
-s gets rid of any progress meters, etc
-o /dev/null tells curl to throw away the headers retrieved (we don't care about them)
-w %{url_effective} tells curl to write out the last fetched url as the result to stdout

The result will be that the effective url is written to stdout, and nothing else.

回答3:

You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.

The Net::HTTP library has a Following Redirection example.

Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

Notice that after reading the body the URL was redirected to Google's / dir.

Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.

This will get you started:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

回答4:

cURL can retrieve HTTP headers. Keep stepping through the chain until you're no longer getting Location: headers and the last Location: header you received is the final URL.

回答5:

The Mechanize gem is handy for this:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

回答6:

The solution I ended up using was simulating a browser, loading the ad, and clicking. The click was the key ingredient. Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help.