I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices
.
pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| sed '$d' \
| sed -r 's/ +/,/g; s/ //g' \
> output.csv
The resulting file should be in CSV spreadsheet format (comma separated value fields).
In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?
What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).
First, you should add
-nopgbrk
for ("No pagebreaks, please!") to your command. Because these pesky^L
characters which otherwise appear in the output then need not be filtered out later.Adding a
grep -vE '(Supported Devices|^$)'
will then filter out all the lines you do not want, including empty lines, or lines with only spaces:However, your other problem is this:
-layout
option as a series of space characters, sometimes even two in the same row.,
separator).There is a workaround for this:
-x ... -y ... -W ... -H ...
parameters topdftotext
to crop the PDF column-wise.paste
andcolumn
.The following command extracts the first columns:
These are for second, third and fourth columns:
BTW, I cheated a bit: in order to get a clue about what values to use for
-x
,-y
,-W
and-H
I did first run this command in order to find the exact coordinates of the column header words:It's always good if you know how to read and make use of
pdftotext -h
. :-)Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)
I'll offer you another solution as well.
While in this case the
pdftotext
method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).Here the not-so-well-known, but pretty cool Free and OpenSource Software
Tabula-Extractor
is the best choice.I myself am using the direct GitHub checkout:
I wrote myself a pretty simple wrapper script like this:
Since
~/bin/
is in my$PATH
, I just runto extract all the tables from all pages and convert them to a single CSV file.
The first ten (out of a total of 8727) lines of the CVS look like this:
which in the original PDF look like this:
It even got these lines on the last page, 293, right:
which look on the PDF page like this:
TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the
asciinema
command line tool), starringtabula-extractor
:For the case where you want to extract that tabular data from PDF over which you have control at creation time (for timesheets contracts your employees have to sign), the following solution will be cleaner:
Create a PDF form with field IDs.
Let people fill and save the PDF forms.
Use a Apache PDFBox, an open source tool that allows to extract form data from a PDF. It includes a command-line example tool PrintFields that you would call as follows to print the desired field information:
For other options, see this question.
As an alternative to the above workflow, maybe you could also use a digital signature web service that allows PDF form filling and export of the data to tables. Such as SignRequest, which allows to create templates and later export the data of signed documents. (Not affiliated, just found this myself.)
As Martin R commented,
tabula-java
is the new version oftabula-extractor
and active. 1.0.0 was released on July 21st, 2017.Download the jar file and with the latest java:
This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below