How to scrape a downloaded PDF file with R

2019-08-18 22:32发布

问题:

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

    > library(pdftools)
    > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf")
    > text
    [1] ""

Also, using pdftables leads me here:

    > library(pdftables)
    > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv")
    Error in get_content(input_file, format, api_key) : 
    Bad Request (HTTP 400).

回答1:

You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")

It would be nice if you were more specif and also give us reproducible example.



回答2:

To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")