Python web crawler sometimes returns half of the s

2019-09-04 04:45发布

问题:

I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one thing I've been stuck on all day. When I go for the USPTO site and get the source code it will sometimes give me the whole thing and work wonderfully, but other times it only gives me about the second half (and what I'm looking for is in the first).

searched around here quite a bit, and I haven't seen anyone with this exact issue. Here's the relevant piece of code (it's got some redundancies since I've been trying to figure this out for a while now, but I'm sure that's the least of its problems):

from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests

# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"

# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)

for row in patreader:
    patnum = row[0]
    #print(row)

    print(patnum)
    # Take each patent and append it to the base URL to get the actual one
    gpaturl = gpatbase + patnum
    ptourl = ptobase + patnum


    gpatreq = requests.get(gpaturl)
    gpatsource = gpatreq.text
    soup = BeautifulSoup(gpatsource, "html5lib")

    # Find the number of academic citations on that patent

    # From the Google Patents page, find the link labeled USPTO and extract the url
    for tag in soup.find_all("a"):
        if tag.next_element == "USPTO":
            uspto_link = tag.get('href')

    #uspto_link = ptourl
    requested = urllib.request.urlopen(uspto_link)
    source = requested.read()

    pto_soup = BeautifulSoup(source, "html5lib")

    print(uspto_link)
    # From the USPTO page, find the examiner's name and save it
    for italics in pto_soup.find_all("i"):
        if italics.next_element == "Primary Examiner:":
            prim = italics.next_element
        else:
            prim = "Not found"

    if prim != "Not found":
        examiner = prim.next_element
    else:
        examiner = "Not found"

    print(examiner)

As of now, it's about 50-50 on whether I'll get the examiner name or "Not found," and I don't see anything that the members of either group have in common with each other, so I'm all out of ideas.

回答1:

I still don't know what's causing the issue, but if someone has a similar problem I was able to figure out a workaround. If you send the source code to a text file instead of trying to work with it directly, it won't be cut off. I guess the issue comes after the data is downloaded, but before it's imported to the 'workspace'. Here's the piece of code I wrote into the scraper:

 if examiner == "Examiner not found":
        filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
        sys.stdout = open(filename, 'w')
        print(patnum)
        print(pto_soup.prettify())
        sys.stdout = console_out

        # Take that logged code and find the examiner name
        sec = "Not found"
        prim = "Not found"
        scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')

        scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
        # Find all italics (<i>) tags
        for italics in scrapedsoup.find_all("i"):
            for desc in italics.descendants:
                # Check to see if any of them affect the words "Primary Examiner"
                if "Primary Examiner:" in desc:
                    prim = desc.next_element.strip()
                    #print("Primary found: ", prim)
                else:
                    pass
                # Same for "Assistant Examiner"
                if "Assistant Examiner:" in desc:
                    sec = desc.next_element.strip()
                    #print("Assistant found: ", sec)
                else:
                    pass

        # If "Secondary Examiner" in there, set 'examiner' to the next string 
        # If there is no secondary examiner, use the primary examiner
        if sec != "Not found":
            examiner = sec
        elif prim != "Not found":
            examiner = prim
        else:
            examiner = "Examiner not found"
        # Show new results in the console
        print(examiner)