How to read PDF metadata from R

2019-04-21 00:49发布

Our of curiosity, is there a way to read PDF metadata -- such as the information shown below -- from R?

I could not anything about that by searching from [r] pdf metadata in the current question base. Any pointers very welcome!

回答1:

I can't think of a pure R way to do this, but you can probably install your favorite PDF command-line tool (for example, the PDF toolkit, PDFtk and use that to get at least some of the data you are looking for.

The following is a basic example using PDFtk. It assumes that pdftk is accessible in your path.

x <- getwd() ## I'll run this example in a tempdir to keep things clean
setwd(tempdir())
list.files(pattern="*.txt$|*.pdf$")
# character(0)

pdf(file = "SomeOutputFile.pdf")
plot(rnorm(100))
dev.off()

system("pdftk SomeOutputFile.pdf data_dump output SomeOutputFile.txt")
list.files(pattern="*.txt$|*.pdf$")
# [1] "SomeOutputFile.pdf" "SomeOutputFile.txt"

readLines("SomeOutputFile.txt")
#  [1] "InfoBegin"                    "InfoKey: Creator"            
#  [3] "InfoValue: R"                 "InfoBegin"                   
#  [5] "InfoKey: Title"               "InfoValue: R Graphics Output"
#  [7] "InfoBegin"                    "InfoKey: Producer"           
#  [9] "InfoValue: R 3.0.1"           "InfoBegin"                   
# [11] "InfoKey: ModDate"             "InfoValue: D:20131102170720" 
# [13] "InfoBegin"                    "InfoKey: CreationDate"       
# [15] "InfoValue: D:20131102170720"  "NumberOfPages: 1"            
# [17] "PageMediaBegin"               "PageMediaNumber: 1"          
# [19] "PageMediaRotation: 0"         "PageMediaRect: 0 0 504 504"  
# [21] "PageMediaDimensions: 504 504"

setwd(x)

I'd look into what other options there are to specify what metadata gets extracted, and see if there's a convenient way to parse this information into a form that is more useful for you.