currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
This is a common problem with pdf parsing. You can also expect trailing dashes that you will have to fix in some cases. I came up with a workaround for one of my projects which I will describe here shortly:
I used pdfminer to extract XML from PDF and also found concatenated words in the XML. I extracted the same PDF as HTML and the HTML can be described by lines of the following regex:
The spans are positioned absolutely and have a top-style that you can use to determine if a line break happened. If a line break happened and the last word on the last line does not have a trailing dash you can separate the last word on the last line and the first word on the current line. It can be tricky in the details, but you might be able to fix almost all text parsing errors.
Additionally you might want to run a dictionary library like enchant over your text, find errors and if the fix suggested by the dictionary is like the error word but with a space somewhere, the error word is likely to be a parsing error and can be fixed with the dictionaries suggestion.
Parsing PDF sucks and if you find a better source, use it.