import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa="
should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.
If every time redundant part of url starts with
&
, you can applysplit()
to each url:output:
Not the best way, but you could do one more time split, adding one more line after a:
Result: