How to read a data set where variables are stored

2019-09-20 19:50发布

问题:

I have a fairly large (about 1200 variables with 14 observations each) data set, stored in a text file, with a very weird and definitely not tidy structure. In practice each variable is stored as a row, instead than as a column, and the first and second row are respectively the variable name, and the measurement unit for that variable. Here is a sample data set:

Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456

The first variable is called Date and it's nondimensional (units ---), the second one is PT-AMB#SRV and measured in volts V, and so on. NOTE: two entries on the same row are separated by a tab in the original file. I'm not sure if this is conserved once I copy&paste data here on Stack Overflow.

First of all, I tried to read in data like this:

df=read.table("TEST.txt",sep="\t")

I get the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 16 elements

The error goes away if I manually edit the second (and the third) variable names, changing the # to a -.

The first question is: why is this happening, and how can I prevent it from happening? If I need to change all # to - in all variable names, how can I do that automatically? Preferably in R, otherwise the command line is fine (I work in Windows).

Second question: after modifying all the # (just two in this sample data set), I read it with

 df=read.table("TEST.txt",sep="\t")

I get:

             V1  V2        V3        V4        V5        V6        V7        V8        V9       V10       V11       V12       V13       V14       V15       V16
1          Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
2    PT-AMB-SRV   V  1.403400  1.403207  1.403265  1.403326  1.403454  1.403783  1.404924  1.404962  1.405291  1.404951  1.404685  1.404812  1.404433  1.404428
3 PS1-SEC20-SRV   V  2.395769  2.416003  2.362276  2.253045  2.139873  1.939328  2.450442  2.294791  2.085946  1.929666  2.634747  3.067008  3.081949  3.095456

I then try to transpose df, so that variables are stored in columns:

df_t=t(df)

I get:

    [,1]        [,2]         [,3]           
V1  "Date"      "PT-AMB-SRV" "PS1-SEC20-SRV"
V2  "---"       "V"          "V"            
V3  "1/19/2016" "1.403400"   "2.395769"     
V4  "1/19/2016" "1.403207"   "2.416003"     
V5  "1/19/2016" "1.403265"   "2.362276"     
V6  "1/19/2016" "1.403326"   "2.253045"     
V7  "1/19/2016" "1.403454"   "2.139873"     
V8  "1/19/2016" "1.403783"   "1.939328"     
V9  "1/19/2016" "1.404924"   "2.450442"     
V10 "1/19/2016" "1.404962"   "2.294791"     
V11 "1/19/2016" "1.405291"   "2.085946"     
V12 "1/19/2016" "1.404951"   "1.929666"     
V13 "1/19/2016" "1.404685"   "2.634747"     
V14 "1/19/2016" "1.404812"   "3.067008"     
V15 "1/19/2016" "1.404433"   "3.081949"     
V16 "1/19/2016" "1.404428"   "3.095456" 

No more a data frame, but an array of characters. Definitely not what I want to. How can I make it so that variables are stored in columns (tidy data set)? I thought the problem was the column containing the measurement units, but removing it before transposing with

df[,"V2"]=NULL   

doesn't solve anything. Maybe tidyr could help here, but I don't know how.

回答1:

Read the data into DF0, transpose it and use type.convert to get appropriate classes for the columns. Set the names and convert the first column to "Date" class using the appropriate format string.

# replace text = Lines with file = "myfile.dat"
DF0 <- read.table( text = Lines, colClasses = "character", comment = "" )
L <- lapply( as.data.frame( tail( t(DF0), -2 ), stringsAsFactors = FALSE ), type.convert )
DF <- setNames( as.data.frame(L), DF0[[1]] )
DF$Date <- as.Date( DF$Date, format = "%m/%d/%Y" )

The result is:

> DF
         Date PT-AMB#SRV PS1-SEC20#SRV
1  2016-01-19   1.403400      2.395769
2  2016-01-19   1.403207      2.416003
3  2016-01-19   1.403265      2.362276
4  2016-01-19   1.403326      2.253045
5  2016-01-19   1.403454      2.139873
6  2016-01-19   1.403783      1.939328
7  2016-01-19   1.404924      2.450442
8  2016-01-19   1.404962      2.294791
9  2016-01-19   1.405291      2.085946
10 2016-01-19   1.404951      1.929666
11 2016-01-19   1.404685      2.634747
12 2016-01-19   1.404812      3.067008
13 2016-01-19   1.404433      3.081949
14 2016-01-19   1.404428      3.095456

Note: We used this input:

Lines <- 
"Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456"


标签: r import