How to extract source from Google search result “2

2019-06-05 19:21发布

The search results page for a local Google search typically looks like this, containing 20 results.

In order to get the full contact details for any given result on the left-hand-side, the result needs to be clicked, bringing up (after a lengthy wait) an overlay (not sure of the technical term) over the Google Maps pane (on Firefox, does something different on other web browsers):

enter image description here

I am extracting the business name. address, phone and website with Python and WebDriver thus:

address = driver.find_element_by_xpath("//div[@id='akp_uid_0']/div/div/ol/li/div/div/div/ol/table/tbody/tr[2]/td/li/div/div/span[2]").text

name = driver.find_element_by_css_selector(".kno-ecr-pt").text.encode('raw_unicode_escape')
phone = driver.find_element_by_css_selector("div._mr:nth-child(2) > span:nth-child(2)").text

website = driver.find_element_by_css_selector("a.lua-button:nth-child(1)").get_attribute("href")

This is working reliably, but is extremely slow. Loading up each Maps overlay can take in the tens of seconds each time. I've tried PhantomJS via WebDriver, but got quickly blocked by Google's bot-detection.

If my reading of Firebug is correct, each of these links on the left hand side is defined like so:

<a data-ved="0CA4QyTMwAGoVChMIj66ruJHGxwIVTKweCh03Sgw0" data-async-trigger="" data-height="0" data-cid="11660382088875336582" data-akp-stick="H4sIAAAAAAAAAGOovnz8BQMDgycHm5SIoaGZmYGxhZGBhYWFuamxsZmphZESVtEoyeSMzKL8gqLE5JL8omLtvNRyhcr8omztvMrkA51e-lt5XiW0n3kw-e7MFfkJwUIAxqbXGGYAAAA" data-akp-oq="Body in Balance Chiropractic New York, NY" jsl="$x 3;" data-rtid="ifLMvGmjeYOk" jsaction="r.UQJvbqFUibg" class="ifLMvGmjeYOk-6WH35iSZ2V0 rllt__link rllt__content" tabindex="0" role="link"><div class="_Ml"><div class="_pl _ki"><div role="heading" aria-level="3" style="margin-right:0px" class="_rl">Body in Balance <wbr></wbr>Chiropractic</div><div class="_lg"><span aria-hidden="true" class="rtng" style="margin-right:5px">5.0</span><g-review-stars><span aria-label="Rated 5.0 out of 5" class="_pxg _Jxg"><span style="width:70px"></span></span></g-review-stars><div style="display:inline;font-size:13px;margin-left:5px"><span>20 reviews</span></div></div><div class="_tf"><span>Chiropractor</span>&nbsp;·&nbsp;W 45th St</div><div class="_CRe"><div><span>Opens at 8:00 am</span></div></div></div></div></a>

My knowledge of CSS and JavaScript is practically nil, so I may not be asking the right question. But is there a way to get at the underlying source of what eventually hovers over the Maps pane (there's probably a more technical term for it), without having to click on the link on the left hand side to bring it up? My thinking is that if I can get that parse that HTML without actually having to trigger it, I can save much time.

1条回答
贪生不怕死
2楼-- · 2019-06-05 20:21

I have tried to check the dom structure of the page you provided. Basically IE has huge differences on such a page with Firefox(IE will direct to another page once you've clicked the left-hand-side items.)

But due to my environmental limit, I can just have done this for IE. For firefox, you may have a try on the following code. There might be minor issues(apologize, I am unable to test it ).

Note: I wrote a java demo(Just for searching Phone num) because I am familiar with java. And I am also not good at cssSelector so I used xpath instead. Hope it can help.

        driver.get("https://www.google.com/search?q=chiropractors%2Bnew%20york%2Bny&rflfq=1&tbm=lcl&tbs=lf:1,lf_ui:2&oll=40.754671143320074,-73.97722375000001&ospn=0.017814865199625274,0.040340423583984375&oz=15&fll=40.75807315356519,-73.99290368792725&fspn=0.01641614335274255,0.040340423583984375&fz=15&ved=0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&bav=on.2,or.r_cp.&biw=1360&bih=608&dpr=1&sei=y4LdVYvcFsa7uATo_LngCQ&ei=4YTdVbWaENOUuAT5yaSYCA&emsg=NCSR&noj=1&rlfi=hd:;si:#emsg=NCSR&rlfi=hd:;si:&sei=y4LdVYvcFsa7uATo_LngCQ");

        //0. Actually no need unless you have low connection speed with google.
        Thread.sleep(5000);


        //1. By xpath '_gt' will extract all of the 20 results' div on left hand side. Both IE and firefox can work well. 
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='_gt']"));

        //2. Traverse all of the results. Let 'data-cid' as identifier. Note: Only FF can be done. For IE there are no data-cid s
        for(int i=0; i<elements.size(); i++) {
            WebElement e = elements.get(i);


            WebElement aTag = e.findElement(By.tagName("a"));


            String dataCid = aTag.getAttribute("data-cid");


            //3. Here, the div which contains the info we want can be identified by 'data-cid' in firefox
            WebElement parentDivOfTable = driver.findElement(By.xpath("//div[@class='akp_uid_0' and @data-cid='" + dataCid + "']"));

            //4. get the infomation table.
            WebElement table = parentDivOfTable.findElement(By.xpath("//table[@class='_B5g']"));

            //get the phone num.
            String phoneNum = table.findElement(By.xpath("//span[text()='Phone:']/following-sibling")).getText();
        }
查看更多
登录 后发表回答