Parsing data in R, alternative to rbind() which ca

2019-09-01 10:37发布

问题:

Let's say I have a data table called YC that looks like this:

Categories:           colsums:   tillTF:
ID: cat               NA         0 
  MA                  NA         0 
    spayed            NA         0
      declawed        NA         0 
        black         NA         0
          3           NA         0
            no        57         1
        claws         NA         0
          calico      NA         0
            4         NA         0
              no      42         1
           striped    NA         0
              0.5     NA         0
                yes   84         1
      not fixed       NA         0
         declawed     NA         0 
            black     NA         0 
              0.2     NA         0
                yes   19         1
              0.2     NA         0
                yes   104        1
  NH                  NA         0
    spayed            NA         0 
       claws          NA         0
          striped     NA         0
             12       NA         0 
               no     17         1
           black      NA         0
              4       NA         0
               yes    65         1
ID: DOG               NA         0 
 MA                   NA         0
...           

Only it's 1) not actually pivot table, it's inconsistently formatted to look like one and 2) the data is much more complicated, and was entered inconstantly over the course of a few decades. The only assumption that can be safely made about the data is that there are 12 variables associated with each record, and they are always entered in the same order.

My goal is to parse this data so that each attribute and associated numeric record are in in appropriate columns in a single row, like this:

Cat  MA  spayed    declawed  black    3    no  57
Cat  MA  spayed    claws     calico   0.5  no  42
Cat  MA  not fixed declawed  black    0.2  yes 19
Cat  MA  not fixed declawed  black    0.2  yes 104
Cat  NH  spayed    claws     striped  12   no  17
Cat  NH  spayed    claws     black    4    yes 65
Dog  MA ....

I've written a for loop which identifies a "record" and then re-writes values in an array by reading backwards up the column in the data table until another "record" is reached. I'm new to R, and so wrote out my ideal loop without knowing whether it was possible.

array<-rep(0, length(7))
    for (i in 1:7)
      if(YC$tillTF[i]==1){
        array[7]<-(YC$colsums[i])
        array[6]<-(YC$Categories[i])
        array[5]<-(YC$Categories[i-1])
        array[4]<-(YC$Categories[i-2])
        array[3]<-(YC$Categories[i-3])
        array[2]<-(YC$Categories[i-4])
        array[1]<-(YC$Categories[i-5])
      }

    YC_NT<-rbind(array)

Once array is filled in, I want to loop through YC and create a new row in YC_NT for each unique record:

for (i in 8:length(YC$tillTF))
  if (YC$tillTF[i]==1){
    array[8]<-(YC$colsums[i])
    array[7]<-(YC$Categories[i])
    if (YC$tillTF[i-1]==0){
      array[6]<-YC$Categories[i-1]
            }else{ 
              rbind(array, YC_NT)}
    if (YC$tillTF[i-2]==0){
      array[5]<-YC$Categories[i-2]
          }else{
            rbind(array, YC_NT)}
    if(YC$tillTF[i-3]==0){
      array[4]<-YC$Categories[i-3]
          }else{
            rbind(array, YC_NT)}
    if(YC$tillTF[i-4]==0){
      array[3]<-YC$Categories[i-4]
          }else{
            rbind(array, YC_NT)}
    if(YC$tillTF[i-5]==0){
      array[2]<-YC$Categories[i-5]
          }else{
            rbind(array, YC_NT)}
    if(YC$tillTF[i-6]==0){
      array[1]<-YC$Categories[i-6]
          }else{
            rbind(array, YC_NT)}
}else{ 
  array<-array}

When I run this loop within a function on my data, I'm getting my YC_NT data table back containing a single row. After spending a few days searching, I don't know that there is an R function which would be able to add the vector array to last row of a data table without giving it a unique name every time. My questions:

1) Is there a function that would add a vector called array to the end of a data table without re-writing a previous row called array?

2) If no such function exists, how could I create a new name for array every time my for loop reached a new numeric record?

Thanks for your help,

回答1:

rbind or rbind.fill should do the trick. Alternatively, you can insert a row more efficiently with code such as:

df[nrow(df) + 1,] <- newrow


回答2:

So I'm going to assume a new record begins every time tillTF=1. And that the n variables specified for the next subject are just the last n variables, the previous values all remain the same. I'm aslo assuming that all records are "complete" in that the last line is tillTF=1. (To make the last statement true, I removed the last two lines form your sample)

Here's how I might read the data in

dog <- read.fwf("dog.txt", widths=c(22,11,7), skip=1, stringsAsFactors=F)
dog$V1 <- gsub("\\s{2,}","",dog$V1)
dog$V2 < -gsub("\\s","",dog$V2)
dog$V3 <- as.numeric(gsub("\\s","",dog$V3))

So I read in the data here and and strip off the extra spaces. Now I will add an ID column giving each record a unique ID and incrementing that value every time tillTF=1. Then i'll split the data on that ID value

dog$ID<-c(0, cumsum(dog$V3[-nrow(dog)]))
dv <- lapply(split(dog, dog$ID), function(x) {
    c(x$V1, x$V2[nrow(x)])}
)

Now I'll go through the list with Reduce and each time replace the last n variables with the n variables for a given ID.

trans < -Reduce(function(a,b) {
    a[(length(a)-length(b)+1):length(a)] <- b
    a
}, dv, accumulate=T)

Now i'll put all the data together with tabs and then use read.table to process the data and do all the proper data conversions and create a data frame

dd<-read.table(text=sapply(a, paste0, collapse="\t"), sep="\t")

That gives

# print(dd)
       V1 V2        V3       V4      V5   V6  V7  V8
1 ID: cat MA    spayed declawed   black  3.0  no  57
2 ID: cat MA    spayed    claws  calico  4.0  no  42
3 ID: cat MA    spayed    claws striped  0.5 yes  84
4 ID: cat MA not fixed declawed   black  0.2 yes  19
5 ID: cat MA not fixed declawed   black  0.2 yes 104
6 ID: cat NH    spayed    claws striped 12.0  no  17
7 ID: cat NH    spayed    claws   black  4.0 yes  65

So as you can see, I left the "ID:" on but it should be easy enough to strip that off. But these commands do the basic reshaping for you. There are fewer arrays and if statements and rbinding in the solution which is nice, but I encourage you to make sure you understand each line if you want to use it.

Also note that my output is slightly different than your expected output; you are missing the "84" value and have the calico with "42" listed as "0.5" rather than "4.0". So let me know if I was wrong in how I interpreted the data or perhaps correct the example output.



标签: r parsing rbind