Editing attributes of multiple XML docs

Here's one XML doc I'm working with:

<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I'm looping through a set of XML docs to retrieve all sentences that begin with a space. I have no trouble capturing all the errors (leading spaces) with this:

>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}

>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files

>>> for docAddr in xmlAddresses:
>>>    parser = etree.XMLParser(encoding=u'utf-8') 
>>>    tree = etree.parse(docAddr, parser=parser) 
>>>    sentences = getTokenTextFeature(docAddr,tree,sentences) 

>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>>    text = sentences[sent]['sentence']
>>>    if rgxLeadingSpace.findall(text):    
>>>        print text                        # the second sentence is from the above XML doc

" It rallied on ideas the market was oversold , " a trader said . 

" The result of the second year-half is expected to improve on the early part of the year , " Atria said .

" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .

What I need to do is, after finding the errors, loop through all the XML files which contain those errors and adjust their START attributes. For example, this is a sentence from the above XML doc that contained a leading space:

<charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

It should look like this:

<charseq START="207" END="310">The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I think I provided all the necessary code. If someone can help me I will create a million StackOverflow accounts and upvote you a million times! :) Thanks!

标签： python xml xml-parsing ipython elementtree

2条回答

啃猪蹄的小仙女

2楼-- · 2019-09-03 19:36

The approach I would use would be to not extract out and then search the matching sentences in a separate array as you're doing, but instead while traversing the nodes of the dom check each sentence element against your pattern. That way when you find one, you can use the element object you're visiting directly and modify its START attribute, and then simply write out the modified dom to a new (or replacement) XML file.

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-09-03 19:53

I don't know what getTokenTextFeature does, but here is a program that modifies the XML in the manner you asked for.

xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''

import re
import xml.etree.ElementTree as etree

root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
  match = re.match('^("? +)(.*)', charseq.text)
  if match:
    space,text = match.groups()
    charseq.set('START', str(int(charseq.get('START')) + len(space)))
    charseq.text = text
print etree.tostring(root)

0人赞添加讨论(0) 举报

Editing attributes of multiple XML docs

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间