How to add a counting column based on values in ot

2019-05-02 01:53发布

问题:

I have a relatively large dataset (16,000+ x ~31). In other words, it's large enough that I don't want to manipulate it line by line in Excel. The data is in this form:

block  site     day  X1   X2
1      1        1    0.4  5.1 
1      1        2    0.8  1.1
1      1        3    1.1  4.2
1      2        1    ...  ...
1      2        2
1      2        3
2      3        1
2      3        2
2      3        3
2      4        1
2      4        2
2      4        3

As you can see, the site count is continuous but I would like a column where the site number resets with each block. For example, I would like something like this below:

block  site     day  X1   X2    site2
1      1        1    0.4  5.1   1
1      1        2    0.8  1.1   1
1      1        3    1.1  4.2   1
1      2        1    ...  ...   2
1      2        2               2
1      2        3               2
2      3        1               1
2      3        2               1
2      3        3               1
2      4        1               2
2      4        2               2
2      4        3               2

I was thinking about using the R function rle but am not sure if it will work because of complications with day. Otherwise, I would try something like:

Data$site2 <- sequence(rle(Data$block)$lengths)

Does anyone have any suggestions for adding a column counting (sequence) the number of sites within each block? If it helps, there are the same number of days (263) recorded for each site but there are a different number of sites per block.

回答1:

Here's a slightly clumsy solution using plyr and ddply:

ddply(df,.(block),transform,
                  site1 = rep(1:length(unique(site)),
                             times = rle(site)$lengths))

Or a slightly slicker version:

ddply(df,.(block),transform,site1 = as.integer(as.factor(site)))

There may be a clever way of doing this directly, though, using the various seq, sequence and rle functions, but my brain is a bit hazy at the moment. If you leave this open for a bit someone will likely come along with a slick non-plyr solution.



回答2:

Using tapply could work

# Make some fake data
dat <- data.frame(block = rep(1:3, each = 4), site = rep(1:6, each  = 2), val = rnorm(12))
# For each block reset the count
dat$site2 <- unlist(tapply(dat$site, dat$block, function(x){x - min(x) + 1}))


回答3:

Via ave:

df1 <- structure(list(block = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
    site = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), day = c(1, 
    2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("block", "site", 
"day"), row.names = c("2", "3", "4", "5", "6", "7", "8", "9", 
"10", "11", "12", "13"), class = "data.frame")

df1$site2 <- ave(df1$site,df1$block,FUN=function(x) match(x,sort(unique(x))))


标签: r sequence