Why the text extracted from PDF using PDF text ext

2019-05-14 10:50发布

问题:

I extracted text from a pdf using both Apache PDFbox and iText. But both the extracted text are completely unstructured and messy

This is

but the extracted text is ::

111111 1111111111111111111111111111111111111111111111111111111111111
US008631488B2
(12) United States Patent (10) Patent No.: US 8,631,488 B2
Oz et al.
(45) Date of Patent: Jan. 14,2014
6,813,682 B2 1112004 Bress et al.
(54) SYSTEMS AND METHODS FOR PROVIDING
7,065,644 B2 Daniell et al.
6/2006
SECURITY SERVICES DURING POWER
Todd et al.
7,076,690 Bl 7/2006
MANAGEMENT MODE
7,086,089 B2 8/2006 Hrastar et al.
7,184,554 B2 2/2007 Freese
(75) Inventors: Ami Oz, Azur (IL); Shlomo Touboul,
7,283,542 B2
10/2007 Mitchell
7,353,533 B2 Wright et al.
Kefar Haim (IL) 4/2008
Maufer et al.
7,359,983 Bl 4/2008
7,360,242 B2 4/2008 Syvanne
(73) Assignee: CUPP Computing AS, Bergen (NO)
7,418,253 B2 8/2008 Kavanagh
(Continued)
Notice: Subject to any disclaimer, the term of this
( * )
patent is extended or adjusted under 35
FOREIGN PATENT DOCUMENTS
U.S.c. 154(b) by 656 days. wo 2000078008 12/2000
Appl. No.: 12/535,650
(21)
WO 2004030308 4/2004
(22) Filed: Aug. 4, 2009
OTHER PUBLICATIONS
Breeden H, John et al., "A Hardware FirewallYou TakeWithYou,"
(65) Prior Publication Data
Government Computer News, located at http:/gcn.com!Articles/
US 2010/0037321 Al Feb. 11,2010
2005/06/0 11A-hardware-firewall-you-take-with-you.aspx?p~1, Jun.
1,2005.

Why this happening ? How to solve this ?

回答1:

The PDF format is designed to allow a document to be displayed and printed correctly, not to allow structured access to the text content. Extracting text from a PDF document is similar to running the printed page through an OCR software. You may not have to recognize the glyphs and convert them to characters, but the structure and logical text flow of the document must be estimated.

If you don't use the naive text extraction examples, both iText and PDFBox (if I remember correctly) give you much more detailed access to the document elements. In this case you would both need the text content as well as the position on the page to be able to reconstruct the content in a meaningful way.