Language for web scraping JAVASCRIPT content

2019-07-17 03:51发布

问题:

I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it

ex: Parse a div that appears when a javascript its executed.

I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?

回答1:

There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.

The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.

This may be way overkill for what you need though. I'll leave it up to you to decide.



回答2:

You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.

Look at :

  • HTMLUnit
  • Golf


回答3:

You can try using something like Selenium, which allows you to automate browser tasks.

On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.



回答4:

You should look at PhantomJS and CasperJS (headless browsers).



回答5:

In the ruby world the gem for running Phantomjs would be poltergeist

There is another article about some of the options you have in ruby here too (however they are not all js capable)