I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx
. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.
The link where I put isbn numbers to populate the results.
import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
wb = load_workbook('amazon.xlsx')
ws = wb['content']
def get_info(num):
params = {
'url': 'search-alias=aps',
'field-keywords': num
}
res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
soup = BeautifulSoup(res.text,"lxml")
itemlink = soup.select_one("a.s-access-detail-page")
if itemlink:
get_data(itemlink['href'])
def get_data(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
itmtitle = soup.select_one("#productTitle").get_text(strip=True)
except AttributeError: itmtitle = "N\A"
print(itmtitle)
ws.cell(row=row, column=2).value = itmtitle
wb.save("amazon.xlsx")
if __name__ == '__main__':
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
get_info(val)
However, when I try to do the same using multiprocessing
I get the following error:
ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined
For multiprocessing
what I brought changes in my script is:
from multiprocessing import Pool
if __name__ == '__main__':
isbnlist = []
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
isbnlist.append(val)
with Pool(10) as p:
p.map(get_info,isbnlist)
p.terminate()
p.join()
Few of the ISBN I've tried with:
9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461
How Can I get rid of that error and get the desired results using multiprocessing
?
It does not make sense to reference the global variable
row
inget_data()
, becauseIt's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.
Even if they did, because you're building the entire ISBN list before executing
get_info()
, the value ofrow
will always bews.max_row + 1
because the loop has completed.So you would need to provide the row values as part of the data passed to the second argument of
p.map()
. But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following: