Parsing HTML Data using Java (DOM parse) [closed]

2019-01-29 02:08发布

I've worked on this for a while and didn't find anything related on Stack Overflow. I'm using a parser that's intending on capturing snippets of HTML code. Based on the code (further below), the file grows exponentially in size and is capturing the fields (li) I need but is also very repetitive in that it's capturing the same data over and over again.

Here's the file that I'm reading from (the full file actually has over 100 lines but only included 3 lines here for this post):

<html xlmns=http://www.w3.org/1999/xhtml>
<name>Name: J0719</name>
<bracket><description>Description: <ol><li>Hop Counts: 2</li><li>State: 3</li></eol></description></bracket> 
<name>Name: J0716</name>
<bracket><description>Description: <ol><li>Hop Counts: 3</li><li>State: 2</li></eol></description></bracket> 
<name>Name: J0718</name> 
<bracket><description>Description: <ol><li>Hop Counts: 1</li><li>State: 5</li></eol></description></bracket>
<name>Name: J0726</name>
<bracket><description>Description: <ol><li>Hop Counts: 8</li><li>State: 4</li></eol></description></bracket> 
</html>

My full code is here:

package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;


import java.util.Enumeration;
import java.util.logging.Level;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/results2.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/output2_TEST.html");
Document doc = null;
    try {
        doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
    } catch (IOException ex) {
        Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
    }
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

//For loops to capture the <li> fields in the file
Element bracket = doc.getElementsByTag("bracket").first();
Elements trs = bracket.getElementsByTag("description");
for (Element description : trs) {
    for (Element li : description.getAllElements()) {
        System.out.println(li.text());
    }
}
System.out.println();

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}

}

My question is how do I parse through the fields marked by "li" in the input file such that my output file has a new line for each "li" tag. Ideal output would be to look like this (and prevent an infinite loop):

Name: J0719
Hop Counts: 2
State: 3
Name: J0716
Hop Counts: 3
State: 2
Name: J0718
Hop Counts: 1
State: 5
Name: J0726
Hop Counts: 8
State: 4

Thanks and appreciate any help on this!

Sep 2nd update: Although the previousElementSibling was useful when used alone but I required another nested loop of some sort when also attempting to pull out the "Description" fields (otherwise previousElementSibling just continuously pulled the first previous element each time). The much quicker workaround I found was to just change the tags around in the original code so that it now looks like the code below:

Updated XML file:

<html xlmns=http://www.w3.org/1999/xhtml>
<bracket><li>Name: J0719</li>
<description>Description: <ol><li>Hop Counts 2</li><li>State: 3</li></eol></description></bracket>
<bracket><li>Name: J0716</li>
<description>Description: <ol><li>Hop Counts 3</li><li>State: 2</li></eol></description></bracket>
<bracket><li>Name: J0718</li>
<description>Description: <ol><li>Hop Counts 1</li><li>State: 5</li></eol></description></bracket>
<bracket><li>Name: J0719</li>
<description>Description: <ol><li>Hop Counts 8</li><li>State: 4</li></eol></description></bracket>
</html>

Aside from the following 'for' loops, everything else from the original code remained the same

//Updated Code:
//For loops to capture the (li) fields in the file
Elements brackets = doc.getElementsByTag("bracket");


    for (Element bracket : brackets) {
        Elements lis = bracket.select("li");

            for (Element li : lis){
                System.out.println(li.text());

        }
        break;
    }
    System.out.println();

The only other thing is that I have to manually press the 'stop' running button a while later after execution after i see the file size stops growing. But i still see the output file generating the desired results.

1条回答
我想做一个坏孩纸
2楼-- · 2019-01-29 02:53

If I understand your problem correctly, you struggle with the fact the name and bracket nodes in your xml are not children of a parent node, but just come after each other. I think a solution to get the correct name element when you have the bracket element is to use JSOUP's DOM navigation methods, i.e. previousElementSibling()

Here what your loop could look like:

Elements brackets = doc.getElementsByTag("bracket");
for (Element bracket : brackets) {
    Element lis = bracket.select("li");
    Element name = bracket.previousElementSibling();
    System.out.println(name.text());
    for (Element li : lis){
      System.out.println(li.text());
    }       
}
查看更多
登录 后发表回答