Reading two-line headers in R

2019-01-07 16:23发布

What is the best way to read a file into R when the header has two necessary lines for the header?

This happens to me all the time, as people often use one line for the column name and then include another line underneath it for the unit of measurement. I don't want to skip anything. I want the names and the units to carry through.

Here is what a typical file with two headers might look like:

trt   biomass    yield
crop    Mg/ha    bu/ac
C2      17.76   205.92
C2      17.96   207.86
CC      17.72   197.22
CC      18.42   205.20
CCW     18.15   200.51
CCW     17.45   190.59
P       3.09    0.00
P       3.34    0.00
S2      5.13    49.68
S2      5.36    49.72

标签: r file-io
4条回答
疯言疯语
2楼-- · 2019-01-07 16:49

Almost the same method to the other answers, just shortening to 2 statements:

dat <- "trt   biomass    yield
crop    Mg/ha    bu/ac
C2      17.76   205.92
C2      17.96   207.86
CC      17.72   197.22
CC      18.42   205.20
CCW     18.15   200.51
CCW     17.45   190.59
P       3.09    0.00
P       3.34    0.00
S2      5.13    49.68
S2      5.36    49.72"

header <- sapply(read.table(text=dat, nrow=2), paste, collapse="_")
result <- read.table(text=dat, skip=2, col.names=header)

Result:

> head(result,2)
  trt_crop biomass_Mg/ha yield_bu/ac
1       C2         17.76      205.92
2       C2         17.96      207.86
...
查看更多
我命由我不由天
3楼-- · 2019-01-07 17:01

A slightly different explained step by step approach:

  1. Read only the first two lines of the files as data (without headers):

    headers <- read.table("data.txt", nrows=2, header=FALSE)
    
  2. Create the headers names with the two (or more) first rows, sappy allows to make operations over the columns (in this case paste) - read more about sapply here :

    headers_names <- sapply(headers,paste,collapse="_")
    
  3. Read the data of the files (skipping the first 2 rows):

    data <- read.csv(file="data.txt", skip = 2, header=FALSE)
    
  4. And assign the headers of step two to the data:

    names(data) <- headers_names
    

The advantage is that you would have clear control of the the parameters of read.table (such as sep for commas, and stringAsFactors - for both the headers and the data)

查看更多
别忘想泡老子
4楼-- · 2019-01-07 17:02

I would do two steps, assuming we know that the first row contains the labels, and there are always two headers.

header <- scan("file.txt", nlines = 1, what = character())
data <- read.table("file.txt", skip = 2, header = FALSE)

Then add the character vector header on as the names component:

names(data) <- header

For your data this would be

header <- scan("data.txt", nlines = 1, what = character())
data <- read.table("data.txt", skip = 2, header = FALSE)
names(data) <- header

head(data)

>     head(data)
  trt biomass  yield
1  C2   17.76 205.92
2  C2   17.96 207.86
3  CC   17.72 197.22
4  CC   18.42 205.20
5 CCW   18.15 200.51
6 CCW   17.45 190.59

If you want the units, as per @DWin's answer, then do a second scan() on line 2

header2 <- scan("data.txt", skip = 1, nlines = 1, what = character())
names(data) <- paste0(header, header2)

> head(data)
  trtcrop biomassMg/ha yieldbu/ac
1      C2        17.76     205.92
2      C2        17.96     207.86
3      CC        17.72     197.22
4      CC        18.42     205.20
5     CCW        18.15     200.51
6     CCW        17.45     190.59
查看更多
Viruses.
5楼-- · 2019-01-07 17:10

Use readLines with 2 for the limit, parse it, paste0 them together, then read in with read.table with skip =2 and header=FALSE (the default). Finish the process off with assignment of the column names:

dat <- "trt biomass yield
 crop   Mg/ha   bu/ac
 C2 17.76   205.92
 C2 17.96   207.86
 CC 17.72   197.22
 CC 18.42   205.20
 CCW    18.15   200.51
 CCW    17.45   190.59
 P  3.09    0.00
 P  3.34    0.00
 S2 5.13    49.68
 S2 5.36    49.72
 "

You would probably use a file argument but using the text argument to the read-functions makes this more self-contained:

 readLines(textConnection(dat),n=2)
#[1] "trt\tbiomass\tyield" "crop\tMg/ha\tbu/ac" 
 head2 <- read.table(text=readLines(textConnection(dat),n=2), sep="\t", stringsAsFactors=FALSE)
 with(head2, paste0(head2[1,],head2[2,]) )
# [1] "trtcrop"      "biomassMg/ha" "yieldbu/ac"  
 joinheadrs <- with(head2, paste0(head2[1,],head2[2,]) )

newdat <- read.table(text=dat, sep="\t",skip=2)
colnames(newdat)<- joinheadrs
#-------------------
> newdat
   trtcrop biomassMg/ha yieldbu/ac
1       C2        17.76     205.92
2       C2        17.96     207.86
3       CC        17.72     197.22
4       CC        18.42     205.20
5      CCW        18.15     200.51
6      CCW        17.45     190.59
7        P         3.09       0.00
8        P         3.34       0.00
9       S2         5.13      49.68
10      S2         5.36      49.72

Might be better to use paste with an underscore-sep:

joinheadrs <- with(head2, paste(head2[1,],head2[2,] ,sep="_")  )
joinheadrs
#[1] "trt_crop"      "biomass_Mg/ha" "yield_bu/ac"  
查看更多
登录 后发表回答