在文本提取，LXML Xpath的断字(Word Breaks in text extraction

我想删除线用IE来提取的话<w:delText>标签。我已经使用一个表达式，它成功地提取它，只是某些词语出现破裂。例如词"They"表现为'T'和'hey' 。下面给出的是一个XML样本，其中问题仍然存在：

<w:delText
    xml:space="preserve">.
    </w:delText></w:r><w:r
    w:rsidR="0020338C"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>T</w:delText></w:r><w:r
    w:rsidR="00DF6A7D"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>hey</w:delText></w:r></w:del><w:ins
    w:id="5"
    w:author="Author"
    w:date="2014-08-13T10:08:00Z"><w:r
    w:rsidR="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:t
    xml:space="preserve">
    that
    helps
    them</w:t></w:r></w:ins>

我用下面的代码：

find =  etree.XPath("//w:p//.//*[local-name() = 'delText']//text()" ,namespaces={'w':"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
list_of_deleted_words = (find(lxml_tree))

我怎么可能解决这一问题？

编辑：

我意识到这个问题是只用言语是在他们大写字母，如“她”的话，“他”还可以获得分裂。

这是的话..“他们”应该算作一个词，而不是两个（我的代码目前正在做）。

这个问题的产生是因为文本的舒展随意放入几个所谓的“运行”。在OOXML，文本被组织在w:p元素（段）这样的（简化的结构）：

<w:p>
  <w:r>
    <w:t>Simpli</w:t>
  </w:r>
  <w:r>
    <w:t>fied structures</w:t>
  </w:r>
</w:p>

正如你所看到的，实际的文本里面w:t元素依次一个内部w:r元素，或“运行”。不幸的是，这种划分在不同的运行是如此随意的，它只能是任意的。据我所知，没有人知道如何开始新的运行中，可以选择。

现在回到你的问题， w:delText是内部运行了。还有，同样，fragmenation进入运行似乎是纯粹的abitrary。

以您目前的方法，没有办法知道如果一个特定的文本内容的方式w:delText曾是整个单词或没有。对于这一点，你必须考虑到运行的全序列，既包含普通文本和包含删除的文本的人的人。

机会是，这样的工作，因为删除的文字仍然是在它被删除的位置运行。显示的OpenXML 2003年，略有不同，但没关系：

<w:r>
  <w:t>Normal Text before deletion </w:t>
</w:r>
<aml:annotation aml:id="0"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:25:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>T</w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<aml:annotation aml:id="1"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:24:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>hey </w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<w:r>
  <w:t>Normal Text after deletion </w:t>
</w:r>

换一种方式，