javax.swing.text.ElementIterator weird behavior

2019-08-01 09:59发布

I'm getting a weird behavior with javax.swing.text.ElementIterator(). It never shows all elements, and it shows a different amount of elements depending on what type of ParserCallback I use. The test below is done with the website that is in my profile, but can be done with any other big html file.

// some imports shown in case its an import mixup
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.ChangedCharSetException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

// Shows whats in an element, recursively
public void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException
{
    AttributeSet attributes = element.getAttributes();
    System.out.println("element: '" + element.toString().trim() + "', name: '" + element.getName() + "', children: " + element.getElementCount() + ", attributes: " + attributes.getAttributeCount() + ", leaf: " + element.isLeaf());
    Enumeration attrEnum = attributes.getAttributeNames();
    while (attrEnum.hasMoreElements())
    {
        Object attr = attrEnum.nextElement();
        System.out.println("\tAttribute: '" + attr + "', Val: '" + attributes.getAttribute(attr) + "'");
        if (attr == StyleConstants.NameAttribute
                && attributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT)
        {
            int startOffset = element.getStartOffset();
            int endOffset = element.getEndOffset();
            int length = endOffset - startOffset;
            System.out.printf("\t\tContent (%d-%d): '%s'\n", startOffset, endOffset, htmlDoc.getText(startOffset, length).trim());
        }
    }
    for (int i = 0; i < element.getElementCount(); i++)
    {
        Element child = element.getElement(i);
        printElement(htmlDoc, child);
    }
}

public void tryParse(String filename) 
        throws FileNotFoundException, IOException, BadLocationException
{
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filename)));

    Parser parser = new ParserDelegator();
    HTMLEditorKit htmlKit = new HTMLEditorKit();
    HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
    ParserCallback callback2 = htmlDoc.getReader(0);
    ParserCallback callback1 =
            new HTMLEditorKit.ParserCallback()
            {
            };

    parser.parse(in, callback2, true);
    ElementIterator iterator = new ElementIterator(htmlDoc);
    Element element;
    while ((element = iterator.next()) != null)
        printElement(htmlDoc, element);
    in.close();
}

In the test above, the results vary if I use callback1 or callback2. Even weirder, if I do fill the callbacks with the appropriate functions and have them output something, they show that the parser does handle the whole website, but the ElementIterator still doesn't have it all.

I've also tried to use htmlKit.read() instead of parser.parse(), but it still doesn't work.

Although I'm now getting my desired results by using the parser callback functions (not shown here), I still wonder why ElementIterator doesn't work as expected in case I need it later, so I wonder if anyone here has experience with that ElementIterator and can answer.

Update: Complete Java Source uploaded here: http://home.snafu.de/tilman/tmp/Main.java

1条回答
叼着烟拽天下
2楼-- · 2019-08-01 11:04

Using the approach seen here, I haven't noticed the problem you describe. I added a println(), and all the elements seem to be there.

Addendum: I'm not sure how your tryParse() fails, but your printElement() seems to work from my main():

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Enumeration;
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

/** @see https://stackoverflow.com/questions/2882782 */
public class NewMain {

    public static void main(String args[]) throws Exception {
        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlKit.read(new BufferedReader(new FileReader("test.html")), htmlDoc, 0);
        ElementIterator iterator = new ElementIterator(htmlDoc);
        Element element;
        while ((element = iterator.next()) != null) {
            printElement(htmlDoc, element);
        }
    }
    private static void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException {
        AttributeSet attrSet = element.getAttributes();
        System.out.println(""
            + "Element: '" + element.toString().trim()
            + "', name: '" + element.getName()
            + "', children: " + element.getElementCount()
            + ", attributes: " + attrSet.getAttributeCount()
            + ", leaf: " + element.isLeaf());
        Enumeration attrNames = attrSet.getAttributeNames();
        while (attrNames.hasMoreElements()) {
            Object attr = attrNames.nextElement();
            System.out.println("  Attribute: '" + attr + "', Value: '"
                + attrSet.getAttribute(attr) + "'");
            Object tag = attrSet.getAttribute(StyleConstants.NameAttribute);
            if (attr == StyleConstants.NameAttribute
                && tag == HTML.Tag.CONTENT) {
                int startOffset = element.getStartOffset();
                int endOffset = element.getEndOffset();
                int length = endOffset - startOffset;
                System.out.printf("    Content (%d-%d): '%s'\n", startOffset,
                    endOffset, htmlDoc.getText(startOffset, length).trim());
            }
        }
    }
}
查看更多
登录 后发表回答