Fast reading (by chunk?) and processing of a file

2019-09-08 08:17发布

问题:

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:

library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
  write(paste(i, 'is the current iteration'), myfile, append=T)
  z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
  write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}

With nx=5 and ny=2, I would have a file like this:

# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ... 

I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?

Given the output is regular, I thought readr would be a good idea (?). The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:

library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size 
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
  nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
  z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
  z = as.vector(t(z))
  ifirst = (i-1)*ny*nx + 1 # appropriate index
  ztot[ifirst:(ifirst+nx*ny-1)] = z
}

# The arrays are actually spatial rasters. Compute the coordinates 
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter) 
y = rep(rep(seq(1:ny), each=nx), niter) 

myDF = data.frame(x=x, y=y, z=z) 

But this is not fast enough. How can I achieve this faster?

Is there a way to read everything at once and delete the useless rows afterwards?

Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?

PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...


EDIT The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:

bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))

Discussion on how to proceed results from the readLines can be found here

回答1:

Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:

tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))

Lines can be removed on the basis of a pattern or of indices for example. This was suggested by @Roland from here.



回答2:

Not sure if I still understood your problem correctly. Running your script created a file with 1310 lines. With This is iteration 1or2or3 printed at lines

Line 1: This is iteration 1
Line 132: This is iteration 2
Line 263: This is iteration 3
Line 394: This is iteration 4
Line 525: This is iteration 5
Line 656: This is iteration 6
Line 787: This is iteration 7
Line 918: This is iteration 8
Line 1049: This is iteration 9
Line 1180: This is iteration 10

Now there is data between these lines that you want to read and skip this 10 strings.

You can do this by tricking read.table saying your comment.char is "T" which will make read.table thinks all lines starting with letter "T" are comments and will skip those.

data<-read.table("bigFile.txt",comment.char = "T")

this will give you a data.frame of 1300 observations with 150 variables.

> dim(data)
[1] 1300  150

For a non-consisted strings. Read your data with read.table with fill=TRUE flag. This will not break your input process.

data<-read.table("bigFile.txt",fill=TRUE)

Your data looks like this

> head(data)

          V1          V2           V3         V4          V5        V6        V7
1: 1.0000000          is          the    current   iteration        NA        NA
2: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
3: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
4: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
5: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
6: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541

Now if you see how the strings are distributed in columns. Now you can simply subset your data set with pattern matching. Matching columns that match these strings. For example

library(data.table)
data<-as.data.table(data)
cleaned_data<-data[!(V3 %like% "the"),]

> head(cleaned_data)
          V1          V2           V3         V4          V5        V6        V7
1: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
2: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
3: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
4: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
5: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
6: 0.3943193 0.508373900 0.2131134905 0.92474343 0.432134031 0.4585807 0.9811607


标签: r import