We are given the option to extract tables from a PDF document by specifying its coordinates. For windows users, in order to get the coordinates, you have to upload the PDF file to Tabula web page and export the script which contains the coordinates then input the coordinates into your code. For Mac users, you just have to use the Preview app and the crop inspector. I'm just wondering if there are any third party programs or plug-ins which offer this to Windows user? I think this will be handy under the following situation:
- When you do not have internet access.
- I think the preview app will be more accurate because I have experienced inaccurate coordinates produced from the Tabula web page.
Will be grateful if anyone can point me to where I can find such thing. Much thanks.
Tabula needs areas to be specified in PDF units, which are defined to be 1/72 of an inch. If using Acrobat Reader DC, you can use the Measure tool and multiply its readings by 72.
Tabula needs the area to be specified as the top, left, bottom and right distances. To obtain them, you can measure the distances from the top of the page to the beginning of the table and so on.
I had the same problem, the code seemed to ignore the area callout. Fixed it by including "guess = False" in the command line. like so (note I'm using revision 1.2.1):
df = tabula.read_pdf(file_folder + file_name,
guess=False, pages=1, stream=True , encoding="utf-8",
area = (200.8125,64.6425,352.2825,496.1025),
columns = (65.3,196.86,294.96,351.81,388.21,429.77))
Tabula can understand coordinates data in the form of "points".
In windows you can measure your areas coordinates with Adobe Acrobat DC and Acrobat Reader DC
if you have Adobe Acrobat DC -
Tools >> Edit PDF >> Select Your Area and Press Enter >> Change Units to Points
Top 100 pt = A
Left 50 pt = B
Cropped page size 370 x 225 pt = C x D
if you have Adobe Acrobat DC or Acrobat Reader DC-
Edit >> Preferences >> Units >> Change Page Units to Points >> OK >>
Tools >> Measure
Top = A = 100
Left = B = 50
Areas Width = C = 370
Areas Length = D = 225
you have to do this calculation
area=[A,B,A+D,B+C]
area=[100,50,100+225,50+370]
in code
df=read_pdf(folder,area=[[100,50,325,420]] ,output_format="xlsx")
Reader only allows measurements if the PDF creator had allowed it.
Found this instead:
https://graphicdesign.stackexchange.com/a/81666
Brief steps:
- Download SumatraPDF. It is also available as zip, no install
needed.
- Open PDF with the Sumatra reader.
- Press 'm' - this
shows cursor position in top left corner.
- Use tabula with
options -p for page, -a for area. (top,left,bottom,right)
The 'top + height' which you can call bottom if you like is missing from the accepted answer here, although that is NOT the distance from the bottom of the page to the table but rather the distance from the top of the page to the bottom of the table.
All the necessary details are summarised in the wiki here, but this is the relevant bit:
Note the left, top, height, and width parameters and calculate the following:
y1 = top
x1 = left
y2 = top + height
x2 = left + width
..then the order of them is: y1,x1,y2,x2
Can offer few practical tips about getting the job done.. My pdf viewer did not measure and I experimented with the linux program 'screenruler' (sudo apt install screenruler
) but it was a bit of a pain, also needing calibration as described here
In the end however got the most accurate results with old school methods. Printed a page with the table on A4 paper, took all the measurements with a transparent ruler to an estimated fraction of a millimetre, ruling lines for all the dimensions. Well, the other side of the ruler only went down to a sixteenth of an inch which is not as fine grained so went with the metric side, and with a pocket calculator multiply centimetres by 28.346456693 to get pdf units. Maybe you have one of those rulers lying around which goes down to a sixtyfourth of an inch ;)
The column measurements are all from the left of the page and only the internal dividing lines between columns, don't include the line on the far left or the far right of the table.
You might find for very compressed columns where you had to guess the small dimensions that a character from one column spills over into the next. In this case you can tweak the columns dimensions and iterate till it's right.