I am messing with apache poi to manipulate word document. Is there any way to get headings from a doc file? i am able to get plain text from the doc but I need to differentiate all headings from the document file?. IS any function available in apache poi api to get only headings from the ms word file??
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
Promoting a comment to an answer
There are two ways to make a "Heading" in Word. The "proper" way, and the way that most people seem to do it...
In the styles dropdown, pick the appropriate header style, write your text, then go back to the normal paragraph style for the next line
Highlight a line, and bump up the font size + make it bold or italic
If your users are doing #2, you've basically no real hope of identifying the Headings. Short of writing some fuzzy matching logic to try to spot when the font size jumps, you're out of luck
For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with
Heading
(case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.If you look at the Apache Tika MS-Word parser which is built on top of POI, you'll see a good example there of iterating over the paragraphs and checking the styles
At least for HWPF (i.e. the old binary doc format) and if you have a properly formatted file (so type #1 of the other answers) you should not rely exclusively on the style name - in fact, this may be a language-dependent value ("Heading" in English, "Titre" in French, etc.).
Paragraph.getLvl(), which encodes the level where the respective paragraph is shown in Word's outline view, often makes a good secondary source.
1
constitutes the most significant level, all subsequent numbers up to8
stand for less significant heading candidates and9
is the value that Word assigns to ordinary (non-heading) paragraphs by default.just as Gagravarr saying:
For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.
using Apache POI code like this :