I have a fairly large (about 1200 variables with 14 observations each) data set, stored in a text file, with a very weird and definitely not tidy
structure. In practice each variable is stored as a row, instead than as a column, and the first and second row are respectively the variable name, and the measurement unit for that variable. Here is a sample data set:
Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
PT-AMB#SRV V 1.403400 1.403207 1.403265 1.403326 1.403454 1.403783 1.404924 1.404962 1.405291 1.404951 1.404685 1.404812 1.404433 1.404428
PS1-SEC20#SRV V 2.395769 2.416003 2.362276 2.253045 2.139873 1.939328 2.450442 2.294791 2.085946 1.929666 2.634747 3.067008 3.081949 3.095456
The first variable is called Date
and it's nondimensional (units ---
), the second one is PT-AMB#SRV
and measured in volts V
, and so on. NOTE: two entries on the same row are separated by a tab in the original file. I'm not sure if this is conserved once I copy&paste data here on Stack Overflow.
First of all, I tried to read in data like this:
df=read.table("TEST.txt",sep="\t")
I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 16 elements
The error goes away if I manually edit the second (and the third) variable names, changing the #
to a -
.
The first question is: why is this happening, and how can I prevent it from happening? If I need to change all #
to -
in all variable names, how can I do that automatically? Preferably in R
, otherwise the command line is fine (I work in Windows).
Second question: after modifying all the #
(just two in this sample data set), I read it with
df=read.table("TEST.txt",sep="\t")
I get:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
2 PT-AMB-SRV V 1.403400 1.403207 1.403265 1.403326 1.403454 1.403783 1.404924 1.404962 1.405291 1.404951 1.404685 1.404812 1.404433 1.404428
3 PS1-SEC20-SRV V 2.395769 2.416003 2.362276 2.253045 2.139873 1.939328 2.450442 2.294791 2.085946 1.929666 2.634747 3.067008 3.081949 3.095456
I then try to transpose df
, so that variables are stored in columns:
df_t=t(df)
I get:
[,1] [,2] [,3]
V1 "Date" "PT-AMB-SRV" "PS1-SEC20-SRV"
V2 "---" "V" "V"
V3 "1/19/2016" "1.403400" "2.395769"
V4 "1/19/2016" "1.403207" "2.416003"
V5 "1/19/2016" "1.403265" "2.362276"
V6 "1/19/2016" "1.403326" "2.253045"
V7 "1/19/2016" "1.403454" "2.139873"
V8 "1/19/2016" "1.403783" "1.939328"
V9 "1/19/2016" "1.404924" "2.450442"
V10 "1/19/2016" "1.404962" "2.294791"
V11 "1/19/2016" "1.405291" "2.085946"
V12 "1/19/2016" "1.404951" "1.929666"
V13 "1/19/2016" "1.404685" "2.634747"
V14 "1/19/2016" "1.404812" "3.067008"
V15 "1/19/2016" "1.404433" "3.081949"
V16 "1/19/2016" "1.404428" "3.095456"
No more a data frame, but an array of characters. Definitely not what I want to. How can I make it so that variables are stored in columns (tidy data set)? I thought the problem was the column containing the measurement units, but removing it before transposing with
df[,"V2"]=NULL
doesn't solve anything. Maybe tidyr
could help here, but I don't know how.
Read the data into DF0, transpose it and use
type.convert
to get appropriate classes for the columns. Set the names and convert the first column to"Date"
class using the appropriate format string.The result is:
Note: We used this input: