My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.
for line in infile:
if line == '<?xml version="1.0" encoding="UTF-8"?>':
xmldoc = minidom.parse(XMLstring)
else:
XMLstring += line
Here's my take on it, using a generator and lxml.etree
. Extracted information purely for example.
import urllib2, os, zipfile
from lxml import etree
def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
buff = []
for line in data:
if separator(line):
if buff:
yield ''.join(buff)
buff[:] = []
buff.append(line)
yield ''.join(buff)
def first(seq,default=None):
"""Return the first item from sequence, seq or the default(None) value"""
for item in seq:
return item
return default
datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]
if not os.path.exists(filename):
with open(filename,'wb') as file_write:
r = urllib2.urlopen(datasrc)
file_write.write(r.read())
zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 10: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
print "DocID: {0}\nTitle: {1}\nAssignee: {2}\n".format(docID,title,assignee)
Yields:
DocID: US-D0629996-S1-20110104
Title: Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC
DocID: US-D0629997-S1-20110104
Title: Belt sleeve
Assignee: None
DocID: US-D0629998-S1-20110104
Title: Underwear
Assignee: X-Technology Swiss GmbH
DocID: US-D0629999-S1-20110104
Title: Portion of compression shorts
Assignee: Nike, Inc.
DocID: US-D0630000-S1-20110104
Title: Apparel
Assignee: None
DocID: US-D0630001-S1-20110104
Title: Hooded shirt
Assignee: None
DocID: US-D0630002-S1-20110104
Title: Hooded shirt
Assignee: None
DocID: US-D0630003-S1-20110104
Title: Hooded shirt
Assignee: None
DocID: US-D0630004-S1-20110104
Title: Headwear cap
Assignee: None
DocID: US-D0630005-S1-20110104
Title: Footwear
Assignee: Vibram S.p.A.
I'd opt for parsing each chunk of XML separately.
You seem to already be doing that in your sample code. Here's my take on your code:
def parse_xml_buffer(buffer):
dom = minidom.parseString("".join(buffer)) # join list into string of XML
# .... parse dom ...
buffer = [file.readline()] # initialise with the first line
for line in file:
if line.startswith("<?xml "):
parse_xml_buffer(buffer)
buffer = [] # reset buffer
buffer.append(line) # list operations are faster than concatenating strings
parse_xml_buffer(buffer) # parse final chunk
Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.
Update:
Starting from scratch, here's how I would do it (using BeautifulSoup):
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
for xml_string in separated_xml("ipgb20110104.xml"):
soup = BeautifulSoup(xml_string)
for num in soup.findAll("doc-number"):
print num.contents[0]
This returns:
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...
I don't know about minidom, nor much about XML parsing, but I have used XPath to parse XML/HTML. E.g. within the lxml module.
Here you can find some XPath Examples: http://www.w3schools.com/xpath/xpath_examples.asp