I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, nothing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
Try apache poi - it can handle doc, docx, xls, xlsx, ppt, pptx.
Another production-level solution is OpenOffice in headless mode which can even be used in a server-side scenario.
You could try docx4j; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java
With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.
Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.
If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.
This is applicable only if you need the text only.