Read fixed width text file

2019-01-01 10:02发布

问题:

I\'m trying to load this ugly-formatted data-set into my R session: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

Weekly SST data starts week centered on 3Jan1990

Nino1+2      Nino3        Nino34        Nino4
Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA 
03JAN1990     23.4-0.4     25.1-0.3     26.6 0.0     28.6 0.3 
10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.3 
17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.3

So far, i can read the lines with

  x = readLines(path)

But the file mixes \'white space\' with \'-\' as separators, and i\'m not a regex expert. I Appreciate any help on turning this into a nice and clean R data-frame. thanks!

回答1:

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url(\"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for\"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file=\"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for\",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().



回答2:

Another way to determine widths...

df <- read.fwf(
  file=url(\"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for\"),
  widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
  skip=4
)

The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...

ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6



回答3:

First off, that question is directly from a the Coursera \"Get Data and Clean It\" course by Leeks. While there is another part of the question, the tough part is reading the file.

That said, the course is mostly intended for learning.

I hate R\'s fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.

I think its easier to use readLines() and then from that use substr() to make your variables

x <- readLines(con=url(\"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for\"))

# Skip 4 lines
x <- x[-(1:4)]

mydata <- data.frame(var1 = substr(x, 1, 10),
                     var2 = substr(x, 16, 19),
                     var3 = substr(x, 20, 23),
                     var4 = substr(x, 29, 32)  # and so on and so on
                     )


回答4:

You can now use the read_fwf() function in Hadley Wickham\'s readr package.

  • Annoucement: http://blog.rstudio.org/2015/04/09/readr-0-1-0/
  • Development page: https://github.com/hadley/readr
  • CRAN page: http://cran.r-project.org/web/packages/readr/index.html

A huge performance improvement is to be expected, compared to base read.fwf().



回答5:

I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.

My preferred approach is to combine fread with stringi; it\'s competitive as the fastest approach, and has the added benefit (IMO) of storing your data as a data.table:

library(data.table)
library(stringi)

col_ends <- 
  list(beg = c(1, 10, 15, 19, 23, 28, 32, 36,
               41, 45, 49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

data = fread(
  \"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for\", 
  header = FALSE, skip = 4L, sep = NULL
  )[, lapply(1:(length(col_ends$beg)),
             function(ii) 
               stri_sub(V1, col_ends$beg[ii], col_ends$end[ii]))
    ][ , paste0(\"V\", c(2, 5, 8, 11)) := NULL]
#              V1   V3   V4   V6   V7   V9  V10  V12  V13
#    1: 03JAN1990 23.4 -0.4 25.1 -0.3 26.6  0.0 28.6  0.3
#    2: 10JAN1990 23.4 -0.8 25.2 -0.3 26.6  0.1 28.6  0.3
#    3: 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6  0.3
#    4: 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4  0.2
#    5: 31JAN1990 25.1 -0.2 25.8 -0.2 26.7  0.1 28.4  0.2
#   ---                                                  
# 1365: 24FEB2016 27.1  0.9 28.4  1.8 29.0  2.1 29.5  1.4
# 1366: 02MAR2016 27.3  1.0 28.6  1.8 28.9  1.9 29.5  1.4
# 1367: 09MAR2016 27.7  1.2 28.6  1.6 28.9  1.8 29.6  1.5
# 1368: 16MAR2016 27.5  1.0 28.8  1.7 28.9  1.7 29.6  1.4
# 1369: 23MAR2016 27.2  0.9 28.6  1.4 28.8  1.5 29.5  1.2

Note that fread automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case set strip.white = FALSE.


We could also have started with a vector of column widths ww by doing:

ww <- c(9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4)
nd <- cumsum(ww)

col_ends <-
  list(beg = c(1, nd[-length(nd)]+1L),
       end = nd)

And we could have picked which columns to exclude more robustly by using negative indices like:

col_ends <- 
  list(beg = c(1, -10, 15, 19, -23, 28, 32, -36,
               41, 45, -49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

Then replace col_ends$beg[ii] with abs(col_ends$beg[ii]) and in the next line:

paste0(\"V\", which(col_ends$beg < 0))

Lastly, if you want the column names to be read programatically as well, you could clean up with readLines:

cols <-
  gsub(\"\\\\s\", \"\", 
       sapply(1:(length(col_ends$beg)),
              function(ii) 
                stri_sub(readLines(URL, n = 4L)[4L], 
                         col_ends$beg[ii]+1L,
                         col_ends$end[ii]+1L)))

cols <- cols[cols != \"\"]

(note that combining this step with fread would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)



回答6:

I don\'t know a thing about R, but I can provide you with a regex that will match such lines:

\\s[0-9]{2}[A-Z]{3}[0-9]{4}(\\s{5}[0-9]+\\.[0-9]+[ -][0-9]+\\.[0-9]+){4}


回答7:

The most direct way is to use read.fwf as pointed out above.

Well if the final goal is to get it into R, you could always read it into Excel for a start, use the \"text to cloumns\" feature to visually cut out the columns, then save the final file as a CSV. After that read the CSV into R.



回答8:

An easy method if for non-programmers (who are willing to go outside of R)

  1. Open the page in a web browser.
  2. Copy and paste the lines of data to a text editor. Omit column headers.
  3. Use search-and-replace to change multiple spaces with a single space (Replace space-space by space. Keep clicking until there are no double spaces left. Takes mere seconds).
  4. Use search-and-replace to replace dash (-) with a space
  5. ForUse search-and-replace to replace space by comma-space.

You now have a .csv file that is also easy for a human to read; save it. Load it into Excel, R, or whatever, and continue to processing.



标签: r fixed-width