I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
With nx=5
and ny=2
, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
I want to read the successive arrays as fast as possible to put them in a single data.frame
(in reality, I have thousands of them). What is the most efficient way to proceed?
Given the output is regular, I thought readr
would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
But this is not fast enough. How can I achieve this faster?
Is there a way to read everything at once and delete the useless rows afterwards?
Alternatively, is there no reading function accepting a vector with precise locations as skip
argument, rather than a single number of initial rows?
PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...
EDIT
The following solution (reading all lines with readLines
and removing the undesirable ones and then processing the rest) is a faster alternative with niter
very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
Discussion on how to proceed results from the readLines
can be found here