Does Facebook know I'm scraping it with Phanto

2019-03-04 04:01发布

问题:

So, maybe I'm being paranoid.

I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements.

Things were working great for a few days and it was finding tons of ads.

Then it stopped returning any results.

When I logged into Facebook manually to inspect the elements again, I found that the word Sponsored was now appearing on the page in an ::after pseudoclass element with the css property content: sponsored. This means that an XPATH query for the text no longer yields any results. No joke, Facebook seemed to have changed the way they rendered this word after being scraped for a couple days.

Paranoid. I told you.

So, I offer this question to the community of Javascript, Web-Scraping, and PhantomJS developers out there. What the heck is going on. Can Facebook know what my PhantomJS program is doing inside of the page.evaluate block?

If so, how? Would my phantom commands appear in a key logger program embedded in the page, for instance?

What are some of your theories?

回答1:

It is perfectly possible to detect PhantomJS even if the useragent is spoofed. There are plenty of litte ways in which it differs from other browsers, among others:

  • Wrong order of headers
  • Lack of media plugins and latest JS capabilities
  • PhantomJS-specific methods, like window.callPhantom
  • PhantomJS name in the stack trace

and many others.

Please refer to this excellent article and presentation linked there for details: https://blog.shapesecurity.com/2015/01/22/detecting-phantomjs-based-visitors/

Maybe puppeteer would be a better fit for your needs as it is based on a real cutting-edge Chromium browser.