I'm working on a project that require to extract all links from a website, with using this code I'll get all of links from single URL:
import requests
from bs4 import BeautifulSoup, SoupStrainer
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []
for link in soup.find_all('a'):
links.append(str(link))
problem is that if I want to extract all URLs, I have to write another for loop and then another one ... . I want to extract all URLs that are exist in this website and in this website's sub domains. is there any way to do this without writing nested for? and even with writing nested for, I don't know how many for should I use to get all URLs.
WoW, it takes about 30 min to find a solution, I found a simple and efficient way to do this, As @αԋɱҽԃ-αмєяιcαη mentioned, some time if your website linked to a BIG website like google, etc, it wont be stop until you memory get full of data. so there are steps that you should consider.
here a sample code and it should works fine, I actually tested it and it was fun fore me:
and the output will be:
I set the limitation to 162, you can increase it as many as you want and you ram allowed.
How's this?
You can also use this crawl frame, which can help you do many things
Well, actually what you are asking for is possible but that's mean an infinite loop which will keep run and run till your memory
BoOoOoOm
Anyway the idea should be like the following.
you will use
for item in soup.findAll('a')
andthen item.get('href')
add to
set
to get rid of duplicates urls and use withif
conditionis not None
to get rid ofNone
objects.then keep looping over and over till your
set
became0
something likelen(urls)