python and pyPdf - how to extract text from the pa

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....

标签： python text formatting pypdf

1条回答

Anthone

2楼-- · 2020-07-16 08:55

This is a common problem with pdf parsing. You can also expect trailing dashes that you will have to fix in some cases. I came up with a workaround for one of my projects which I will describe here shortly:

I used pdfminer to extract XML from PDF and also found concatenated words in the XML. I extracted the same PDF as HTML and the HTML can be described by lines of the following regex:

<span style="position:absolute; writing-mode:lr-tb; left:[0-9]+px; top:([0-9]+)px; font-size:[0-9]+px;">([^<]*)</span>

The spans are positioned absolutely and have a top-style that you can use to determine if a line break happened. If a line break happened and the last word on the last line does not have a trailing dash you can separate the last word on the last line and the first word on the current line. It can be tricky in the details, but you might be able to fix almost all text parsing errors.

Additionally you might want to run a dictionary library like enchant over your text, find errors and if the fix suggested by the dictionary is like the error word but with a space somewhere, the error word is likely to be a parsing error and can be fixed with the dictionaries suggestion.

Parsing PDF sucks and if you find a better source, use it.

0人赞添加讨论(0) 举报

python and pyPdf - how to extract text from the pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间