javax.swing.text.ElementIterator weird behavior

I'm getting a weird behavior with javax.swing.text.ElementIterator(). It never shows all elements, and it shows a different amount of elements depending on what type of ParserCallback I use. The test below is done with the website that is in my profile, but can be done with any other big html file.

// some imports shown in case its an import mixup
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.ChangedCharSetException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

// Shows whats in an element, recursively
public void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException
{
    AttributeSet attributes = element.getAttributes();
    System.out.println("element: '" + element.toString().trim() + "', name: '" + element.getName() + "', children: " + element.getElementCount() + ", attributes: " + attributes.getAttributeCount() + ", leaf: " + element.isLeaf());
    Enumeration attrEnum = attributes.getAttributeNames();
    while (attrEnum.hasMoreElements())
    {
        Object attr = attrEnum.nextElement();
        System.out.println("\tAttribute: '" + attr + "', Val: '" + attributes.getAttribute(attr) + "'");
        if (attr == StyleConstants.NameAttribute
                && attributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT)
        {
            int startOffset = element.getStartOffset();
            int endOffset = element.getEndOffset();
            int length = endOffset - startOffset;
            System.out.printf("\t\tContent (%d-%d): '%s'\n", startOffset, endOffset, htmlDoc.getText(startOffset, length).trim());
        }
    }
    for (int i = 0; i < element.getElementCount(); i++)
    {
        Element child = element.getElement(i);
        printElement(htmlDoc, child);
    }
}

public void tryParse(String filename) 
        throws FileNotFoundException, IOException, BadLocationException
{
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filename)));

    Parser parser = new ParserDelegator();
    HTMLEditorKit htmlKit = new HTMLEditorKit();
    HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
    ParserCallback callback2 = htmlDoc.getReader(0);
    ParserCallback callback1 =
            new HTMLEditorKit.ParserCallback()
            {
            };

    parser.parse(in, callback2, true);
    ElementIterator iterator = new ElementIterator(htmlDoc);
    Element element;
    while ((element = iterator.next()) != null)
        printElement(htmlDoc, element);
    in.close();
}

In the test above, the results vary if I use callback1 or callback2. Even weirder, if I do fill the callbacks with the appropriate functions and have them output something, they show that the parser does handle the whole website, but the ElementIterator still doesn't have it all.

I've also tried to use htmlKit.read() instead of parser.parse(), but it still doesn't work.

Although I'm now getting my desired results by using the parser callback functions (not shown here), I still wonder why ElementIterator doesn't work as expected in case I need it later, so I wonder if anyone here has experience with that ElementIterator and can answer.

Update: Complete Java Source uploaded here: http://home.snafu.de/tilman/tmp/Main.java

标签： java parsing swing html-parsing

1条回答

叼着烟拽天下

2楼-- · 2019-08-01 11:04

Using the approach seen here, I haven't noticed the problem you describe. I added a println(), and all the elements seem to be there.

Addendum: I'm not sure how your tryParse() fails, but your printElement() seems to work from my main():

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Enumeration;
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

/** @see https://stackoverflow.com/questions/2882782 */
public class NewMain {

    public static void main(String args[]) throws Exception {
        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlKit.read(new BufferedReader(new FileReader("test.html")), htmlDoc, 0);
        ElementIterator iterator = new ElementIterator(htmlDoc);
        Element element;
        while ((element = iterator.next()) != null) {
            printElement(htmlDoc, element);
        }
    }
    private static void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException {
        AttributeSet attrSet = element.getAttributes();
        System.out.println(""
            + "Element: '" + element.toString().trim()
            + "', name: '" + element.getName()
            + "', children: " + element.getElementCount()
            + ", attributes: " + attrSet.getAttributeCount()
            + ", leaf: " + element.isLeaf());
        Enumeration attrNames = attrSet.getAttributeNames();
        while (attrNames.hasMoreElements()) {
            Object attr = attrNames.nextElement();
            System.out.println("  Attribute: '" + attr + "', Value: '"
                + attrSet.getAttribute(attr) + "'");
            Object tag = attrSet.getAttribute(StyleConstants.NameAttribute);
            if (attr == StyleConstants.NameAttribute
                && tag == HTML.Tag.CONTENT) {
                int startOffset = element.getStartOffset();
                int endOffset = element.getEndOffset();
                int length = endOffset - startOffset;
                System.out.printf("    Content (%d-%d): '%s'\n", startOffset,
                    endOffset, htmlDoc.getText(startOffset, length).trim());
            }
        }
    }
}

0人赞添加讨论(0) 举报

javax.swing.text.ElementIterator weird behavior

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间