How to yield fragment URLs in scrapy using Seleniu

2019-06-14 19:07发布

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).

I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post

After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.

So Finally, my biggest questions are:

1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...

EDIT:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The require imports...

def getBrowser():
    path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87")
    browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)

    return browser

class MySpider(Spider):
    name = "myspider"

    browser = getBrowser()

    def start_requests(self):
        the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="

        yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        self.get_page_links()

    def get_page_links(self):
        """ This first part, goes through all available pages """

        for i in xrange(1, 3):  # 210
            new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
                    "config": {"page": str(i)}}
            json_data = json.dumps(new_data)
            new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
            self.browser.get(new_url)
            print "\nThe new URL is -> ", new_url, "\n"
            content = self.browser.page_source
            self.get_item_links(content)

    def get_item_links(self, body=""):
        if body:
            """ This second part, goes through all available items """
            raw_links = re.findall(r'listclickable.+?>', body)
            links = []
            if raw_links:
                for raw_link in raw_links:
                    new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
                                                                                                             "")
                    links.append(str(new_link))

                if links:
                    ids = self.get_ids(links)
                    for link in links:
                        current_id = self.get_single_id(link)
                        print "\nThe Link -> ", link
                        # If commented the line below, code works, doesn't otherwise
                        yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)                                                                           

    def get_ids(self, list1=[]):
        if list1:
            ids = []
            for elem in list1:
                raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
                ids.append(raw_id)

            return ids

        else:
            return []

    def get_single_id(self, text=""):
        if text:
            raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
            return raw_id

        else:
            return ""

    def parse_room(self, response): 
        # More scraping code...

So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.

2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise

3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content

Many thanks in advance!

3条回答
地球回转人心会变
2楼-- · 2019-06-14 19:48

Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.

For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.

查看更多
Explosion°爆炸
3楼-- · 2019-06-14 19:57

You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).

Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.

Here are a few of them (but the most often used) :

  • ChromeDriver
  • PhantomJS (my favorite)
  • Firefox

Once your started, you can start your bot and parse the html content of the webpage.

I included a minimal working example below using Python and ChromeDriver :

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()

See the documentation for more details !

查看更多
在下西门庆
4楼-- · 2019-06-14 20:02

Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)

If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.

Hope this helps

查看更多
登录 后发表回答