I'm trying to load this ugly-formatted data-set into my R session: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
Weekly SST data starts week centered on 3Jan1990
Nino1+2 Nino3 Nino34 Nino4
Week SST SSTA SST SSTA SST SSTA SST SSTA
03JAN1990 23.4-0.4 25.1-0.3 26.6 0.0 28.6 0.3
10JAN1990 23.4-0.8 25.2-0.3 26.6 0.1 28.6 0.3
17JAN1990 24.2-0.3 25.3-0.3 26.5-0.1 28.6 0.3
So far, i can read the lines with
x = readLines(path)
But the file mixes 'white space' with '-' as separators, and i'm not a regex expert. I Appreciate any help on turning this into a nice and clean R data-frame. thanks!
An easy method if for non-programmers (who are willing to go outside of R)
You now have a .csv file that is also easy for a human to read; save it. Load it into Excel, R, or whatever, and continue to processing.
Another way to determine widths...
The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...
ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6
I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.
My preferred approach is to combine
fread
withstringi
; it's competitive as the fastest approach, and has the added benefit (IMO) of storing your data as adata.table
:Note that
fread
automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case setstrip.white = FALSE
.We could also have started with a vector of column widths
ww
by doing:And we could have picked which columns to exclude more robustly by using negative indices like:
Then replace
col_ends$beg[ii]
withabs(col_ends$beg[ii])
and in the next line:Lastly, if you want the column names to be read programatically as well, you could clean up with
readLines
:(note that combining this step with
fread
would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)You can now use the
read_fwf()
function in Hadley Wickham'sreadr
package.A huge performance improvement is to be expected, compared to base
read.fwf()
.This is a fixed width file. Use
read.fwf()
to read it:Update
The package
readr
(released April, 2015) provides a simple and fast alternative.Speed comparison:
readr::read_fwf()
was ~2x faster thanutils::read.fwf ()
.First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.
That said, the course is mostly intended for learning.
I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.
I think its easier to use
readLines()
and then from that usesubstr()
to make your variables