I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautifulsoup to extract all the headings, subitems and paragraph respectively in a tree structure.
I have searched a few options available on the internet but so far no success. Here's a code I've used to parse a PDF to HTML using PDFMiner.six
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.layout import LTContainer, LTComponent, LTRect, LTLine, LAParams, LTTextLine
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.image import ImageWriter
from io import StringIO, BytesIO
from bs4 import BeautifulSoup
import re
import io
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
outfp = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, outfp, imagewriter=ImageWriter('out'))
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
convert_pdf_to_html('PDF - Remraam Ph 1 Mosque.pdf')
However, the above code returns the following error which I'm unable to fix:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdfminer\pdftypes.py in decode(self)
293 data = ccittfaxdecode(data, params)
294 elif f == LITERAL_CRYPT:
--> 295 raise PDFNotImplementedError('Crypt filter is unsupported')
296 else:
297 raise PDFNotImplementedError('Unsupported filter: %r' % f)
TypeError: not all arguments converted during string formatting
Then I tried the apache tika. But the thing with apache tika, I was not able to extract the content in a proper html format, what I mean by is that the output of apache tika returns the result in '< p>' tags. There was no way for me to extract headings and subheadings tags from the pdf document. Here's the code that I've used.
from tika import parser
parsed_data_full = parser.from_file('PDF - Remraam Ph 1 Mosque.pdf',xmlContent=True)
parsed_data_full = parsed_data_full['content']
print(parsed_data_full)
I'm not sure how to parse PDF to HTML in a proper way so that I could use HTML tags to identify headings and subheadings of that document.
Would appreciate any help, thank you for reading the long question.