我如何得到正确的开始/结束用SAX一个XML标记的位置？(How do I get the corr

有一个在SAX定位器，它跟踪当前位置。然而，当我把它在我的startElement（），它总是返回我的XML标记的结束位置。

我怎样才能获得标签的启动位置？有什么办法优雅地解决这个问题？

Answer 1:

不幸的是， Locator由Java系统库提供的接口org.xml.sax包不容许有关定义的文件位置的详细信息。从引用文档中的getColumnNumber方法（由我添加的亮点）：

从该方法的返回值仅作为用于诊断的目的的近似 ; 它不打算提供足够的信息来编辑原始XML文档的字符内容。例如，当行包含组合字符序列，宽字符，代理对，或双向文本， 该值可能不对应于在文本编辑器的显示的列 。

根据该规范，你总是会得到“ 的第一个字符与文档事件相关的文本之后 ”基于SAX驱动尽力而为的位置。因此短期回答你的问题的第一部分是： 没有， Locator不提供有关标签的起始位置信息 。另外，如果你在你的文件，如处理多字节字符，中国人还是日本人的文字，你从SAX司机拿到的位置可能不是你想要的。

如果您是标签准确位置后，或者想要了解属性更细粒度的信息，属性内容等，你就必须实现自己的位置提供。

与所有潜在的编码问题，Unicode字符等方式参与，我想这是太大的项目在这里发布，实施也将取决于您的具体要求。

只是从个人的经验快速的警告：写周围的包装InputStream传递到SAX解析器是危险的，因为你不知道什么时候SAX解析器将报告是基于它已经从流中读取事件。

你可以做一些自己的计数的开始characters(char[], int, int)你的方法ContentHandler通过检查换行符，制表符等，除了使用Locator信息，这应该给你一个更好的画面在那里你实际上是在文档中。只要记住最后一个事件的位置，你可以计算出当前的起始位置。考虑到虽然，你可能看不到所有的换行，因为这些可能出现的内部标签，你就不会在看到characters ，但是可以推断出那些来自Locator信息。

Answer 2:

什么是您使用SAX解析器？有些人，有人告诉我，不提供定位服务。

下面简单的Python程序的输出会给你的每一个元素的起始行数和列数在您的XML文件，例如，如果你在你的XML缩进两个空格：

Element: MyRootElem
starts at row 2 and column 0

Element: my_first_elem
starts at row 3 and column 2

Element: my_second_elem
starts at row 4 and column 4

像这样运行： python sax_parser_filename.py my_xml_file.xml

#!/usr/bin/python

import sys
from xml.sax import ContentHandler, make_parser
from xml.sax.xmlreader import Locator

class MySaxDocumentHandler(ContentHandler):
    """
    the document handler class will serve 
    to instantiate an event handler which will 
    acts on various events coming from the parser
    """
    def __init__(self):
        self.setDocumentLocator(Locator())        

    def startElement(self, name, attrs):
        print "Element: %s" % name
        print "starts at row %s" % self._locator.getLineNumber(), \
            "and column %s\n" % self._locator.getColumnNumber()

    def endElement(self, name):
        pass

def mysaxparser(inFileName):
    # create a handler
    handler = MySaxDocumentHandler()
    # create a parser
    parser = make_parser()
    # associate our content handler to the parser
    parser.setContentHandler(handler)
    inFile = open(inFileName, 'r')
    # start parser
    parser.parse(inFile)
    inFile.close()

def main():
    mysaxparser(sys.argv[1])

if __name__ == '__main__':
    main()

Answer 3:

说到这里，我终于想出了一个解决方案。（不过我才懒得贴了上去，对不起。）这里的字符（）的endElement（）和ignorableWhitespace（）方法是crutial，与它们指向您的标签可能起点定位。在字符（）定位器指向的非标记信息，在的endElement定位器（）指向最后一个标签的结束位置，这将有可能成为这个标签的出发点，如果他们粘在一起的cloest终点，和在ignorableWhitespace定位器（）指向一系列空白和标签的末端。只要我们跟踪这三种方法的结束位置的，我们可以发现这个标签的出发点，我们已经可以得到这个标签的结束位置与定位器的endElement（）。因此，出发点和XML的终点可以顺利找到。

class Example extends DefaultHandler{
private Locator locator;
private SourcePosition startElePoint = new SourcePosition();

public void setDocumentLocator(Locator locator) {
    this.locator = locator;
}
/**
* <a> <- the locator points to here
*   <b>
* </a>
*/
public void startElement(String uri, String localName, 
    String qName, Attributes attributes) {

}
/**
* <a>
*   <b>
* </a> <- the locator points to here
*/
public void endElement(String uri, String localName, String qName)  {
    /* here we can get our source position */
    SourcePosition tag_source_starting_position = this.startElePoint;
    SourcePosition tag_source_ending_position = 
        new SourcePosition(this.location.getLineNumber(),
            this.location.getColumnNumber());

    // do your things here

    //update the starting point for the next tag
    this.updateElePoint(this.locator);
}

/**
* some other words <- the locator points to here
* <a>
*   <b>
* </a>
*/
public void characters(char[] ch, int start, int length) {
    this.updateElePoint(this.locator);//update the starting point
}
/**
*the locator points to here-> <a>
*                               <b>
*                             </a>
*/
public void ignorableWhitespace(char[] ch, int start, int length) {
    this.updateElePoint(this.locator);//update the starting point
}
private void updateElePoint(Locator lo){
    SourcePosition item = new SourcePosition(lo.getLineNumber(), lo.getColumnNumber());
    if(this.startElePoint.compareTo(item)<0){
        this.startElePoint = item;
    }
}

class SourcePosition<SourcePosition> implements Comparable<SourcePosition>{
    private int line;
    private int column;
    public SourcePosition(){
        this.line = 1;
        this.column = 1;
    }
    public SourcePosition(int line, int col){
        this.line = line;
        this.column = col;
    }
    public int getLine(){
        return this.line;
    }
    public int getColumn(){
        return this.column;
    }
    public void setLine(int line){
        this.line = line;
    }
    public void setColumn(int col){
        this.column = col;
    }
    public int compareTo(SourcePosition o) {
        if(o.getLine() > this.getLine() || 
            (o.getLine() == this.getLine() 
                && o.getColumn() > this.getColumn()) ){
            return -1;
        }else if(o.getLine() == this.getLine() && 
            o.getColumn() == this.getColumn()){
            return 0;
        }else{
            return 1;
        }
    }
}

}

文章来源: How do I get the correct starting/ending locations of a xml tag with SAX?