Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A purely R solution could be:
then you'll have pdf lines in an array.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
Linux systems have
pdftotext
which I had reasonable success with. By default, it createsfoo.txt
from a givefoo.pdf
.That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.