Extracting text data from PDF files

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.

Any suggestions?

标签： pdf r parser-generator

7条回答

▲ chillily

2楼-- · 2019-01-10 10:10

This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2019-01-10 10:19

A purely R solution could be:

library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), 
      readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

then you'll have pdf lines in an array.

0人赞添加讨论(0) 举报

Luminary・发光体

4楼-- · 2019-01-10 10:20

A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

0人赞添加讨论(0) 举报

来，给爷笑一个

5楼-- · 2019-01-10 10:24

I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information

Set path to pdftotxt.exe and convert pdf to text

exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"

for(i in 1:length(pdfFracList)){
    fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
    pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
    txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
    print(paste0("File number ", i, ", Processing file ", pdfSource))
    system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

0人赞添加讨论(0) 举报

forever°为你锁心

6楼-- · 2019-01-10 10:29

install.packages("pdftools")
library(pdftools)


download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
              "56901.DEN.Gamebook", mode = "wb")

txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

0人赞添加讨论(0) 举报

放荡不羁爱自由

7楼-- · 2019-01-10 10:32

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

0人赞添加讨论(0) 举报

1 2 下一页

Extracting text data from PDF files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间