Recognize PDF table using R

2020-01-27 04:32发布

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

2条回答
神经病院院长
2楼-- · 2020-01-27 05:05

I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools)
dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf")
dat <- paste0(dat, collapse = " ")
pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258"
extract <- regmatches(dat, regexpr(pattern, dat))
extract <- gsub('\n', "  ", extract)
strsplit(extract, "\\s{2,}")

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

查看更多
Summer. ? 凉城
3楼-- · 2020-01-27 05:06

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

Update: After just testing the approach again, it looks like it's enough to just do install.packages(tabulizer) now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)

# specify an example and load it into your workspace
report <- "http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf" 
lst <- extract_tables(report, encoding="UTF-8") 
# peep into the doc for further specs (page, location etc.)!

# after examing the list you want to do some tidying
# 1st delete blank columns
lst[[1]] <- lst[[1]][, -3]
lst[[2]] <- lst[[2]][, -4]

# 2nd bind the list elements, if you want and create a df...
table <- do.call(rbind, lst)
table <- as.data.frame(table[c(2:37, 40:nrow(table)), ],
                       stringsAsFactors=FALSE) # ...w/o obsolete rows

# 3rd take over colnames, cache rownames to vector
colnames(table) <- table[1, ]
rn <- table[2:71, 1]
table <- table[-1,-1] # and bounce them out of the table

# 4th I'm sure you want coerce to numeric 
table <- as.data.frame(apply(table[1:70,1:10], 2, 
                             function(x) as.numeric(as.character(x))))
rownames(table) <- rn # bring back rownames 

table # voilà

Hope it works for you.

Limitations: Certainly the table in this example is quite simple and maybe you have to mess around with gsub, stringr tidyr and this kind of stuff.

查看更多
登录 后发表回答