Here's one XML doc I'm working with:
<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
<extent>
<charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
</extent>
</span><span type="sentence">
<extent>
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
I'm looping through a set of XML docs to retrieve all sentences that begin with a space. I have no trouble capturing all the errors (leading spaces) with this:
>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}
>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files
>>> for docAddr in xmlAddresses:
>>> parser = etree.XMLParser(encoding=u'utf-8')
>>> tree = etree.parse(docAddr, parser=parser)
>>> sentences = getTokenTextFeature(docAddr,tree,sentences)
>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>> text = sentences[sent]['sentence']
>>> if rgxLeadingSpace.findall(text):
>>> print text # the second sentence is from the above XML doc
" It rallied on ideas the market was oversold , " a trader said .
" The result of the second year-half is expected to improve on the early part of the year , " Atria said .
" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .
What I need to do is, after finding the errors, loop through all the XML files which contain those errors and adjust their START
attributes. For example, this is a sentence from the above XML doc that contained a leading space:
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
It should look like this:
<charseq START="207" END="310">The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
I think I provided all the necessary code. If someone can help me I will create a million StackOverflow accounts and upvote you a million times! :) Thanks!
The approach I would use would be to not extract out and then search the matching sentences in a separate array as you're doing, but instead while traversing the nodes of the dom check each sentence element against your pattern. That way when you find one, you can use the element object you're visiting directly and modify its START attribute, and then simply write out the modified dom to a new (or replacement) XML file.
I don't know what
getTokenTextFeature
does, but here is a program that modifies the XML in the manner you asked for.