Getting text style from docx using Apache poi

2019-02-18 11:33发布

问题:

I'm trying to get the style information from an MS docx file, I have no problem writing file content with added styles like bold, italic. font size etc, but reading the file content and getting the style information is not so clear. I've tried using XWPFDocument, this API does not seem to have the ability to read the styles. I'm now trying XWPFWordExtractor which seems a bit more promising but I'm still stuck getting the style information for the text.

The type of content I reading looks similar to the following.

"Hello, this is bold text and this is italic text abd this is bold-italic text"

Any pointers to an example would be great.

回答1:

Okay, so based on the comments from Gagravarr, the solution is below, exactly as I wanted. So basically Gagravarr answered the question but I'm not sure how apart from saying it hear to give him credit.

for (XWPFParagraph paragraph : docx.getParagraphs()) {
                int pos = 0;
                for (XWPFRun run : paragraph.getRuns()) {
                    System.out.println("Current run IsBold : " + run.isBold());
                    System.out.println("Current run IsItalic : " + run.isItalic());
                    for (char c : run.text().toCharArray()) {

                        System.out.print(c);
                        pos++;
                    }
                    System.out.println();
                }
            }

`

Output below

Current run IsBold : false Current run IsItalic : false "Hello, this is  Current run IsBold : true Current run IsItalic : false bold text Current run IsBold : false Current run IsItalic : false  and this is  Current run IsBold : false Current run IsItalic : true italic text Current run IsBold : false Current run IsItalic : false  a Current run IsBold : false Current run IsItalic : false n Current run IsBold : false Current run IsItalic : false d this is  Current run IsBold : true Current run IsItalic : true bold-italic text Current run IsBold : false Current run IsItalic : false "



回答2:

I gave up trying to use Apache poi, I found another lib called docx4j, this seems to do what I need, the properties I want to look at a now available, once the docx file is loaded you can view the content of the file in an xml format like below.

`

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:ns27="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" mc:Ignorable="w14 wp14">
   <w:body>
      <w:p w:rsidR="009A66AB" w:rsidRDefault="000F4AD1">
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>&quot;Hello, this is</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="apple-converted-space"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t> </w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="Strong"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:bdr w:val="none" w:color="auto" w:sz="0" w:space="0" w:frame="true"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>bold text</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="apple-converted-space"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t> </w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>and this is</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="apple-converted-space"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t> </w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="Emphasis"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:bdr w:val="none" w:color="auto" w:sz="0" w:space="0" w:frame="true"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>italic text</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="apple-converted-space"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t> </w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>an</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>d this is</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="apple-converted-space"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t> </w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rStyle w:val="Emphasis"/>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:b/>
               <w:bCs/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:bdr w:val="none" w:color="auto" w:sz="0" w:space="0" w:frame="true"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>bold-italic text</w:t>
         </w:r>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/>
               <w:color w:val="222222"/>
               <w:sz w:val="23"/>
               <w:szCs w:val="23"/>
               <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
            </w:rPr>
            <w:t>&quot;</w:t>
         </w:r>
      </w:p>
      <w:sectPr w:rsidR="009A66AB">
         <w:pgSz w:w="11906" w:h="16838"/>
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
         <w:cols w:space="708"/>
         <w:docGrid w:linePitch="360"/>
      </w:sectPr>
   </w:body>
</w:document>

`



回答3:

you can use paragraph.getCTP().getPPr().getRPr().isSetB()



回答4:

I found a very nice way to copy styles from one document to another. It is not as direct as I would have hoped but it works.

  1. Rename the source word document to type zip
  2. Extract the contents
  3. Copy styles.xml into a string constant or read the file
  4. Copy the styles into your output document with the following code

    public void copyStylesXml(String stylesXmlString) {
       try {
          CTStyles ctStyle = CTStyles.Factory.parse(stylesXmlString);
          XWPFStyles styles = getDoc().createStyles();
          styles.setStyles(ctStyle);
       } catch (Exception e) {
          log.warn(e, e);
       }
    }
    

The same approach works for copying list formats



回答5:

Here is a very good way to copy styles from another document. A little background; a docx file is really a zip file of a number of xml files including styles.xml. In the following code sample I read numberin.xml, parse it into a CTStyles object then set it in the current document. Here is most of the code. You can use the same approach to copy numbering.xml for your Word numbering.

// copy an existing style.xml document into this document to get styles
public void copyStylesFromDocument(String documentFileName) {
    log.debug("fileName " + documentFileName);
    try {
        InputStream is = CertificationReportHelper.getInputStreamFromZipFile(documentFileName, FILE_NAME_STYLES);
        CTStyles ctStyle = CTStyles.Factory.parse(is);
        XWPFStyles styles = getDoc().createStyles();
        styles.setStyles(ctStyle);
        log.info("Styles copied from file " + FILE_NAME_STYLES + " in document" + documentFileName);
    } catch (Exception e) {
        String msg = "Error copying styles from file " + FILE_NAME_STYLES + " in document" + documentFileName;
        addErrorMessage(msg, e);
        log.debug(e, e);
    }
    @SuppressWarnings("resource") // closing stream causes input stream to close and operation fails
public static InputStream getInputStreamFromZipFile(String zipFileName, String containedFile) {
    InputStream is = null;
    ZipFile zfile = null;
    try {
        zfile = new ZipFile(zipFileName);
        ZipEntry entry = zfile.getEntry(containedFile);
        log.trace(entry);
        if (entry != null) {
            is = zfile.getInputStream(entry);
            log.trace("created input stream  for file " + containedFile + " from zip file" + zipFileName);
        } else {
            String msg = "Error getting input stream for file " + containedFile + " from zip file " + zipFileName;
            // closing stream causes input stream to close and operation fails
            throw new ApplicationRuntimeException(msg);
        }
    } catch (Exception e) {
        String msg = "Error getting input stream for file " + containedFile + " from zip file " + zipFileName + "  Message:"
                + e.getMessage();
        log.warn("*** Throwing exception " + msg);
        throw new ApplicationRuntimeException(msg, e);
    } finally {
        // closing stream causes input stream to close and operation fails
        // try {
        // zfile.close();
        // } catch (IOException e) {
        // log.warn("Catching exception "+e+" closing zip file "+zipFileName);
        // }
    }
    return is;