PDF to HTML Parsing in Python 3

2019-10-20 00:22发布

问题:

I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautifulsoup to extract all the headings, subitems and paragraph respectively in a tree structure.

I have searched a few options available on the internet but so far no success. Here's a code I've used to parse a PDF to HTML using PDFMiner.six

import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.layout import LTContainer, LTComponent, LTRect, LTLine, LAParams, LTTextLine
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.image import ImageWriter
from io import StringIO, BytesIO
from bs4 import BeautifulSoup
import re
import io



def convert_pdf_to_html(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    outfp = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = HTMLConverter(rsrcmgr, outfp, imagewriter=ImageWriter('out'))
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = ""
    maxpages = 0 #is for all
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

convert_pdf_to_html('PDF - Remraam Ph 1 Mosque.pdf')

However, the above code returns the following error which I'm unable to fix:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdfminer\pdftypes.py in decode(self)
    293                 data = ccittfaxdecode(data, params)
    294             elif f == LITERAL_CRYPT:
--> 295                 raise PDFNotImplementedError('Crypt filter is unsupported')
    296             else:
    297                 raise PDFNotImplementedError('Unsupported filter: %r' % f)

TypeError: not all arguments converted during string formatting

Then I tried the apache tika. But the thing with apache tika, I was not able to extract the content in a proper html format, what I mean by is that the output of apache tika returns the result in '< p>' tags. There was no way for me to extract headings and subheadings tags from the pdf document. Here's the code that I've used.

from tika import parser
parsed_data_full = parser.from_file('PDF - Remraam Ph 1 Mosque.pdf',xmlContent=True)
parsed_data_full = parsed_data_full['content']
print(parsed_data_full)

I'm not sure how to parse PDF to HTML in a proper way so that I could use HTML tags to identify headings and subheadings of that document.

Would appreciate any help, thank you for reading the long question.