Here is the code written by @Ken S to extract data from OCR'd pdf, which gives a dataframe like
name Status
Page Words
test.pdf Present test_1, test_3 gym, school
test1.pdf Present test1_4, test1_7 gym, swimming pool
test2.pdf Not Present - -
. But I want the data to be flattened so that the output would look like
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym and school regularly
test.pdf Present test_1 school I go gym and school regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
fileName=Name of the File
Status=If any word is found then "Present" else "Not Present"
Page=Here "_1", "_3" defines the page number on which the word was found;; on page "test_1" word "gym" was found and on page "test_3" word "school" was found.
Words= Which all words were found ;; like only "gym" and "school" were found on page 1 and 3 of test.pdf file AND only "swimming pool" and "gym" were found on page 4 and 7 of test1.pdf file.
TEXT = It is the text in which the word was found
This is the following code
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
#Words <- unlist(words, recursive = T)
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
I tried changing with Words <- unlist(words, recursive = T)
but error
Error in data.frame(row.names = fileName, Status = Status, Page = Page, :
row names supplied are of the wrong length
Any suggestion what Improvement should be done.
Thanks
P.S : access to sample files