Transpose lists elements and match to value in R

2019-08-09 21:51发布

问题:

Here is the code written by @Ken S to extract data from OCR'd pdf, which gives a dataframe like

    name      Status                           
                                                   Page                      Words
   test.pdf  Present                         test_1, test_3                gym, school
   test1.pdf Present                         test1_4, test1_7           gym, swimming pool 
   test2.pdf Not Present                              -                         -

. But I want the data to be flattened so that the output would look like

fileName   Status        Page             Words                    TEXT
test.pdf   Present     test_1             gym            I go gym and school regularly 
test.pdf   Present     test_1             school         I go gym and school regularly
test.pdf   Present     test_3             school     Here is the next school
test1.pdf  Present     test1_4            swimming pool  In swimming pool
test1.pdf  Present     test1_7            gym         next to Gold gym
test2.pdf  Not Present    -               -

fileName=Name of the File

Status=If any word is found then "Present" else "Not Present"

Page=Here "_1", "_3" defines the page number on which the word was found;; on page "test_1" word "gym" was found and on page "test_3" word "school" was found.

Words= Which all words were found ;; like only "gym" and "school" were found on page 1 and 3 of test.pdf file AND only "swimming pool" and "gym" were found on page 4 and 7 of test1.pdf file.

TEXT = It is the text in which the word was found

This is the following code

all_files <- Sys.glob("*.pdf")
strings   <- c("school", "gym", "swimming pool")

# Read text from pdfs
texts <- lapply(all_files, function(x){
  img_file <- pdf_convert(x, format="tiff", dpi=400)
  return( tolower(ocr(img_file)) )
})

# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
  for(w in seq_along(strings)){
    intermed   <- grep(strings[w], texts[[d]])
    words[[d]] <- c(words[[d]], 
                    strings[w][ (length(intermed) > 0) ])
    pages[[d]] <- unique(c(pages[[d]], intermed))
  }
}

# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))

Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))

Words    <- sapply(words, paste0, collapse=", ")
#Words <- unlist(words, recursive = T)
Status   <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")

data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)

I tried changing with Words <- unlist(words, recursive = T) but error

Error in data.frame(row.names = fileName, Status = Status, Page = Page,  : 
  row names supplied are of the wrong length

Any suggestion what Improvement should be done.

Thanks

P.S : access to sample files