I'm trying to extract data from tables inside some pdf reports.
I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.
Is there a way to use R to recognize and extract only tables?
I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:
From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.
Awsome question, I wondered about the same thing recently, thanks!
I did it, with tabulizer
‘0.2.2’
as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three packages in specific order:Update: After just testing the approach again, it looks like it's enough to just do
install.packages(tabulizer)
now.rJava
will be installed automatically as a dependency.Now you are ready to extract tables from your PDF reports.
Hope it works for you.
Limitations: Certainly the table in this example is quite simple and maybe you have to mess around with
gsub
,stringr
tidyr
and this kind of stuff.