How can one read in a text file in which each record is a paragraph and each newline denotes separate field. The complication is that some records have 4 lines and some have 6. @DWin nailed my questions when the the difference in number of fields was 1 but it all fell apart when it was two. You can have a look at his answer here.
So here is my latest simulation of the starting text
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:56
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:58
blay blay blah who knows what
TheInstitute 5467
telephone line 412552999 x 4999
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 12:59
blay blay blah who knows what
TheInstitute 5467
telephone line 4125526987 x 4567
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 13:51
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 14:56
blay blay blah who knows what
Here is what the output should look like. In fact this is one step removed from what I need. I am placing a ASCII text representation of an R data.frame below. You will see that everything is in a data frame but the field values are shifted by two columns because some records have two extra fields.
structure(list(institution = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "TheInstitute 5467", class = "factor"),
telephoneline = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c("telephone line 4125526987 x 4567",
"telephone line 412552999 x 4999"), class = "factor"), date.or.bump = structure(c(2L,
3L, 1L, 1L, 4L), .Label = c("bump phone line 4125527777",
"datetime 2011110516 12:56", "datetime 2011110516 12:58",
"datetime 2011110516 14:56"), class = "factor"), field4 = structure(c(2L,
1L, 3L, 3L, 1L), .Label = c("blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma",
"bump pony pony oops 4125527777"), class = "factor"), field5 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "datetime 2011110516 12:59",
"datetime 2011110516 13:51"), class = "factor"), field6 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma"
), class = "factor")), .Names = c("institution", "telephoneline",
"date.or.bump", "field4", "field5", "field6"), class = "data.frame", row.names = c(NA,
-5L))
PS: Am I correct to believe that one posts a data frame by using dput or can one save a .Rdata file direclty here.
There is probably a more elegant way, but this should get the job done.
Update:
Here's another solution using
plyr::rbind.fill
:Another strategy is to use a string of your choosing -- call it
EOL
-- to mark the end of each line, and then paste all of the lines together.You can then use two rounds of
strsplit
to first break out records, and then break out fields within records. (Records will be separated by two consecutiveEOL
s, while fields will be separated by a singleEOL
).This method appeals to me because it's close to what I'd like to do when I read in the file in the first place (i.e. use
"\n\n"
as thesep
character), but am not able to do with eitherscan
orreadLines
.Read data in. dat <- readLines("filename.txt")
Split data by records (inspired by Josh O'Brien solution)
Transform data to named vectors (assume last field is comment and data starts with numeric value)
Get unique names of field in data.
Combine field into matrix and give it names.
SECOND solution (using gawk)