可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).
Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
Small but complete file is here
So what I have done till now is using findall for every element of interest.
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute of instrument
print insattrib["id"]
How can I access all children/grandchildren of instrumentConfiguration (s) element?
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
Example of what I want
InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector
Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.
Any suggestions!
Edit
Didn't got the correct answer so have to edit once more!
回答1:
Here's a script that parses one million <instrumentConfiguration/>
elements (967MB
file) in 40
seconds (on my machine) without consuming large amount of memory.
The throughput is 24MB/s
. The cElementTree page (2005)
reports 47MB/s
.
#!/usr/bin/env python
from itertools import imap, islice, izip
from operator import itemgetter
from xml.etree import cElementTree as etree
def parsexml(filename):
it = imap(itemgetter(1),
iter(etree.iterparse(filename, events=('start',))))
root = next(it) # get root element
for elem in it:
if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
values = [('Id', elem.get('id')),
('Parameter1', next(it).get('name'))] # cvParam
componentList_count = int(next(it).get('count'))
for parent, child in islice(izip(it, it), componentList_count):
key = parent.tag.partition('}')[2]
value = child.get('name')
assert child.tag.endswith('cvParam')
values.append((key, value))
yield values
root.clear() # preserve memory
def print_values(it):
for line in (': '.join(val) for conf in it for val in conf):
print(line)
print_values(parsexml(filename))
Output
$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps
Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/>
are <cvParam/>
and <componentList/>
and all values are available as tag names or attributes.
On performance
ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.
If you replace root.clear()
by elem.clear()
then the code is ~10% faster but ~10 times more memory. lxml.etree
works with elem.clear()
variant, the performance is the same as for cElementTree
but it consumes 20 (root.clear()
) / 2 (elem.clear()
) times as much memory (500MB).
回答2:
If this is still a current issue, you might try pymzML, a python Interface to mzML Files. Website:
http://pymzml.github.com/
回答3:
In this case I would get findall to find all the instrumentList elements. Then on those results just access the data as if instrumentList and instrument were arrays, you get all the elements and don't have to search for them all.
回答4:
If your files are huge, have a look at the iterparse()
function. Be sure to read this article
by elementtree's author, especially the part about "incremental parsing".
回答5:
I know that this is old, but I run into this issue while doing XML parsing, where my XML files where really large.
J.F. Sebastian's answer is indeed correct, but the following issue came up.
What I noticed, is that sometimes the values in elem.text ( if you have values inside XML and not as attributes) are not read correctly (sometimes None is returned) if you iterate through the start attributes. I had to iterate through the 'end' like this
it = imap(itemgetter(1),
iter(etree.iterparse(filename, events=('end',))))
root = next(it) # get root element
If someone wants to get the text inside an xml tag (and not an attribute) maybe he should iterate through the 'end' events and not 'start'.
However, if all the values are in attributes, then the code in J.F. Sebastian's answer is more correct.
XML example for my case:
<data>
<country>
<name>Liechtenstein</name>
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
</country>
<country>
<name>Singapore</name>
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
</country>
<country>
<name>Panama</name>
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
</country>