Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table.
The code I'm using for extracting tables from pdf is this:
import camelot
tables = camelot.read_pdf('test.pdf', pages='all',lattice=True, suppress_stdout = True)
I'd like to extract the text written above the table i.e THE PARTICULARS, as shown in the image below.
What should be a best approach for me to do it? appreciate any help. thank you
You can create the Lattice parser directly
Then you have access to
parser.layout
which contains all the components in the page. These components all havebbox (x0, y0, x1, y1)
and the extracted tables also have abbox
object. You can find the closest component to the table on top of it and extract the text.