I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper using scrapy to crawl site with infinite scrolling. After digging around I found out that there is a package call selenium and it has python module. I have a feeling someone has already done that using Scrapy and Selenium to scrape site with infinite scrolling. It would be great if someone can point towards to an example.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
For infinite scrolling data are requested to Ajax calls. Open web browser --> network_tab --> clear previous requests history by clicking icon like stop--> scroll the webpage--> now you can find the new request for scroll event--> open the request header --> you can find the URL of request ---> copy and paste URL in an seperare tab--> you can find the result of Ajax call --> just form the requested URL to get the data page until end of the page
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
Step 2 : use the code below to automate infinite scroll and extract the source code
The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.
Step 3 : Print the data if required.
This will open a page, find the bottom-most element with the given
id
and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call todriver.find_element_*
because I don't know of a way to explicitly query the last element in the page.Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to
driver.find_element_*
.