Apache POI :- Get Headings from DOC file

2019-08-17 05:35发布

问题:

I am messing with apache poi to manipulate word document. Is there any way to get headings from a doc file? i am able to get plain text from the doc but I need to differentiate all headings from the document file?. IS any function available in apache poi api to get only headings from the ms word file??

回答1:

Promoting a comment to an answer

There are two ways to make a "Heading" in Word. The "proper" way, and the way that most people seem to do it...

  1. In the styles dropdown, pick the appropriate header style, write your text, then go back to the normal paragraph style for the next line

  2. Highlight a line, and bump up the font size + make it bold or italic

If your users are doing #2, you've basically no real hope of identifying the Headings. Short of writing some fuzzy matching logic to try to spot when the font size jumps, you're out of luck

For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.

If you look at the Apache Tika MS-Word parser which is built on top of POI, you'll see a good example there of iterating over the paragraphs and checking the styles



回答2:

just as Gagravarr saying:

For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.

using Apache POI code like this :

        File f=new File("test.docx");
        FileInputStream fis = new FileInputStream(f);
        XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
        XWPFStyles styles=xdoc.getStyles();         
        List<XWPFParagraph> xwpfparagraphs =xdoc.getParagraphs();
        System.out.println();
        for(int i=0;i<xwpfparagraphs.size();i++)
        {
            System.out.println("paragraph style id "+(i+1)+":"+xwpfparagraphs.get(i).getStyleID());                         
            if(xwpfparagraphs.get(i).getStyleID()!=null)
            {
                String styleid=xwpfparagraphs.get(i).getStyleID();
                XWPFStyle style=styles.getStyle(styleid);
                if(style!=null)
                {
                    System.out.println("Style name:"+style.getName());
                    if(style.getName().startsWith("heading"))
                    {
                        //this is a heading
                    }
                }

            }


        }


回答3:

At least for HWPF (i.e. the old binary doc format) and if you have a properly formatted file (so type #1 of the other answers) you should not rely exclusively on the style name - in fact, this may be a language-dependent value ("Heading" in English, "Titre" in French, etc.).

Paragraph.getLvl(), which encodes the level where the respective paragraph is shown in Word's outline view, often makes a good secondary source. 1 constitutes the most significant level, all subsequent numbers up to 8 stand for less significant heading candidates and 9 is the value that Word assigns to ordinary (non-heading) paragraphs by default.