I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one thing I've been stuck on all day. When I go for the USPTO site and get the source code it will sometimes give me the whole thing and work wonderfully, but other times it only gives me about the second half (and what I'm looking for is in the first).
searched around here quite a bit, and I haven't seen anyone with this exact issue. Here's the relevant piece of code (it's got some redundancies since I've been trying to figure this out for a while now, but I'm sure that's the least of its problems):
from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests
# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"
# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)
for row in patreader:
patnum = row[0]
#print(row)
print(patnum)
# Take each patent and append it to the base URL to get the actual one
gpaturl = gpatbase + patnum
ptourl = ptobase + patnum
gpatreq = requests.get(gpaturl)
gpatsource = gpatreq.text
soup = BeautifulSoup(gpatsource, "html5lib")
# Find the number of academic citations on that patent
# From the Google Patents page, find the link labeled USPTO and extract the url
for tag in soup.find_all("a"):
if tag.next_element == "USPTO":
uspto_link = tag.get('href')
#uspto_link = ptourl
requested = urllib.request.urlopen(uspto_link)
source = requested.read()
pto_soup = BeautifulSoup(source, "html5lib")
print(uspto_link)
# From the USPTO page, find the examiner's name and save it
for italics in pto_soup.find_all("i"):
if italics.next_element == "Primary Examiner:":
prim = italics.next_element
else:
prim = "Not found"
if prim != "Not found":
examiner = prim.next_element
else:
examiner = "Not found"
print(examiner)
As of now, it's about 50-50 on whether I'll get the examiner name or "Not found," and I don't see anything that the members of either group have in common with each other, so I'm all out of ideas.