Scraping Google Translate

2019-05-04 16:17发布

问题:

I would like to scraping Google Translate with NodeJS and cheerio library:

request("http://translate.google.de/#de/en/hallo%20welt", function(err, resp, body) {
    if(err) throw err;

    $ = cheerio.load(body);
    console.log($('#result_box').find('span').length);    
}

But he can't find the necessary span-elements from translation box (result_box). In source code of the website it looks like this:

<span id="result_box">
    <span class="hps">hello</span>
    <span class="hps">world</span>
</span>

So I think I could wait 5-10 seconds til Google has created all span-elements, but no.. seems to be that isn't..

setTimeout(function() {
        $ = cheerio.load(body);
        console.log($('#result_box').find('span').length);    
    }, 15000);

Could you help me, please? :)


Solution:

Instead of cheerio I use http.get:

http.get(
  this.prepareURL("http://translate.google.de/translate_a/t?client=t&sl=de&tl=en&hl=de&ie=UTF-8&oe=UTF-8&oc=2&otf=1&ssel=5&tsel=5&pc=1&q=Hallo", 
  function(result) {
    result.setEncoding('utf8');
    result.on("data", function(chunk) {
        console.log(chunk); 
    });
}));

So I get a result string with translation. The used url is the request to server.

回答1:

I know you've already resolved this, but i think the reason why your code didn't work was because you should have written [...].find("span.hps").[...]

Or at least for me it worked always only with the class identifier, when present.



回答2:

The reason that you can't use cheerio in node to scrap google translation that google is not rendering the translation page at google side! They reply with a script to your request then the script make an api request that includes your string. Then the script at the user side run again and build the content you see and that's what not happen in cheerio!

So you need to do a request to the api but it's google and they can detect scrapping so they will block you after a few attempts!

You still can fake a user behavior but it'll take long time and they may block you at any time!