How to extract formatting information of word docu

2019-07-14 14:43发布

问题:

I am using Apache POI for extracting formatting information from MS word files.

I want to extract information like whether paragraph is having bullet, background color, forecolor, alignment, etc.

There is not much documentation or tutorials available for this. Javadoc also does not contain much helpful information.

Where can I get tutorials/good documentation which can help me in learning Apache POI API??

回答1:

For HWPF (.doc), the classes you probably want are:

  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/ParagraphProperties.html
  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/CharacterProperties.html
  • http://poi.apache.org/apidocs/org/apache/poi/hwpf/model/StyleDescription.html

Depending on the exact property you want, it may be on the paragraph or the character properties.

The best example I can think of for reading a word document with HWPF and getting text, checking styles and formatting etc is WordExtractor from Apache Tika: https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

(XWPF for .docx is similar)