Extracting one text files from multiple zip archiv

2019-09-06 01:21发布

问题:

I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.

The folder has multiple Zip files:

pf_0915.zip
pf_0914.zip
pf_0913.zip
.....

Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.

Or as each new abc.txt file is read, we could add it to the abcCombined.txt.

I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.

Any hints would be appreciated.

回答1:

Can't comment because of my low reputation, so although this is a partial answer:

If you know the file name within the various zips the syntax to get just that file would be something like the following:

my_data<-read.csv(unz("pf_0915.zip","abc.txt"))

This is the code for a csv obviously, not a fixed width text, but if you already have that set up, it'll be something like

my_data<-read_fwd(unz("pf_0915.zip","abc.txt") ... ) 

with all your other parameters in the ...

You can do this in a loop if you have many zips, and accumulate them in a data frame, data table, whatever structure floats your boat...



回答2:

The following:

  1. Creates a vector of the files in a directory
  2. Uses the list parameter to unzip() to see the metadata for the contents
  3. Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
  4. Tests if any of the files meet your criteria
  5. Keeps only those files into a resultant vector
  6. Iterates over that vector and
    • Extracts only the target file into a temporary directory
    • Reads it into a data.frame
    • Ultimately binds the individual data.frames into one big one

You can write out the resultant combined data.frame however you wish.

library(purrr)

target_dir <- "so"
extract_file <- "abc.txt"

list.files(target_dir, full.names=TRUE) %>% 
  keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>% 
  map_df(function(x) {
    td <- tempdir()
    read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
  }) -> combined_df

The version below just expands some of the shortcuts in the one above:

only_files_with_this_name <- function(zip_path, name) {
  zip_contents <- unzip(zip_path, list=TRUE)
  look_for <- sprintf("^%s$", name)
  any(grepl(look_for, zip_contents$Name))
}

list.files(target_dir, full.names=TRUE) %>% 
  keep(only_files_with_this_name, name=extract_file)) %>% 
  map_df(function(x) {
    td <- tempdir()
    file_in_zip <- unzip(x, extract_file, exdir=td)
    read.fwf(file_in_zip, widths=c(4,1,4,2))
    unlink(file_in_zip)
  }) -> combined_df