Script throws an error when it is made to run usin

2019-09-21 18:36发布

问题:

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.

The link where I put isbn numbers to populate the results.

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook

wb = load_workbook('amazon.xlsx')
ws = wb['content']

def get_info(num):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': num
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
    soup = BeautifulSoup(res.text,"lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        get_data(itemlink['href'])

def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError: itmtitle = "N\A"

    print(itmtitle)

    ws.cell(row=row, column=2).value = itmtitle
    wb.save("amazon.xlsx")

if __name__ == '__main__':
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        get_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool

if __name__ == '__main__':
    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        p.map(get_info,isbnlist)
        p.terminate()
        p.join()

Few of the ISBN I've tried with:

9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

回答1:

It does not make sense to reference the global variable row in get_data(), because

  1. It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

  2. Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool


def get_info(isbn):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': isbn
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
    soup = BeautifulSoup(res.text, "lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        return get_data(itemlink['href'])


def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError:
        itmtitle = "N\A"

    return itmtitle


def main():
    wb = load_workbook('amazon.xlsx')
    ws = wb['content']

    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row, column=1).value is None:
            break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        titles = p.map(get_info, isbnlist)
        p.terminate()
        p.join()

    for row in range(2, ws.max_row + 1):
        ws.cell(row=row, column=2).value = titles[row - 2]

    wb.save("amazon.xlsx")


if __name__ == '__main__':
    main()