Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

标签： r text-mining pdf-scraping

2条回答

神经病院院长

2楼-- · 2020-01-27 05:05

I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools)
dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf")
dat <- paste0(dat, collapse = " ")
pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258"
extract <- regmatches(dat, regexpr(pattern, dat))
extract <- gsub('\n', "  ", extract)
strsplit(extract, "\\s{2,}")

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2020-01-27 05:06

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

Update: After just testing the approach again, it looks like it's enough to just do install.packages(tabulizer) now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)

# specify an example and load it into your workspace
report <- "http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf" 
lst <- extract_tables(report, encoding="UTF-8") 
# peep into the doc for further specs (page, location etc.)!

# after examing the list you want to do some tidying
# 1st delete blank columns
lst[[1]] <- lst[[1]][, -3]
lst[[2]] <- lst[[2]][, -4]

# 2nd bind the list elements, if you want and create a df...
table <- do.call(rbind, lst)
table <- as.data.frame(table[c(2:37, 40:nrow(table)), ],
                       stringsAsFactors=FALSE) # ...w/o obsolete rows

# 3rd take over colnames, cache rownames to vector
colnames(table) <- table[1, ]
rn <- table[2:71, 1]
table <- table[-1,-1] # and bounce them out of the table

# 4th I'm sure you want coerce to numeric 
table <- as.data.frame(apply(table[1:70,1:10], 2, 
                             function(x) as.numeric(as.character(x))))
rownames(table) <- rn # bring back rownames 

table # voilà

Hope it works for you.

Limitations: Certainly the table in this example is quite simple and maybe you have to mess around with gsub, stringr tidyr and this kind of stuff.

0人赞添加讨论(0) 举报

Recognize PDF table using R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间