-->

Convert year-month string column into quarterly bi

2019-08-11 16:01发布

问题:

I am currently working with a large phenology data set, where there are multiple observations of trees for a given month. I want to assign these observations into three month clusters or bins. I am currently using the following code:

Cluster.GN <- ifelse(Master.feed.parts.gn$yr.mo=="2007.1", 1,
              ifelse(Master.feed.parts.gn$yr.mo=="2007.11", 1,....     
              ifelse(Master.feed.parts.gn$yr.mo=="2014.05", 17, NA)

This code works, but it is very cumbersome as there are over 50 months. I have had trouble finding another solution because this "binning" is not based on number of observations (as within each month there can be up to 4000 observations) and it is not chronological, as some months are missing. Any help you can provide would be highly appreciated.

UPDATE I: I used the "cut" function in R. I tried setting the breaks to 17, as that is how many three month bins I should have. But when I use table(Cluster.GN) it shows that only the odd numbered "bins" have observations (sorry but I can't figure out how to get the table uploaded here). >Cluster.GN <- cut(Master.feed.parts.gn$yr.mo, breaks= 17, c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"), include.lowest=TRUE)

回答1:

UPDATE: this answer was a quick hack, I didn't check zoo library. For the right way to do it, please see G Grothendieck's answer using zoo::as.yearqtr()


All you need to do is convert the yr.mo field from a year-month string (e.g. 2007.11) into an integer in the range 1..17, binning on every quarter (i.e. months 1..3 into first bin, 4..6 into second bin etc.). (I don't see how 8 years (2007..2014) * 4 quarters = 32 bins reduces to only 17 bins, unless your data is sparse. But anyway...)

No need for cumbersome ifelse ladders.

And for higher performance, use stringi library, stri_split_fixed()

sample_wr <- function(...) sample(..., replace=T)

# Generate sample data (you're supposed to provide this to code, to make your issue reproducible)
set.seed(123)
N <- 20
df <- data.frame(yr.mo =
          paste(sample_wr(2007:2014, N), sample_wr(1:12, N), sep='.') )
# [1] "2009.11" "2013.9"  "2010.8"  "2014.12" "2014.8"  "2007.9"  "2011.7" 
# [8] "2014.8"  "2011.4"  "2010.2"  "2014.12" "2010.11" "2012.9"  "2011.10"
#[15] "2007.1"  "2014.6"  "2008.10" "2007.3"  "2009.4"  "2014.3" 

yearmonth_to_integer <- function(xx) {
    yy_mm <- as.integer(unlist(strsplit(xx, '.', fixed=T)))
    return( (yy_mm[1] - 2006) + (yy_mm[2] %/% 3) )
}

Cluster.GN <- sapply(x, yearmonth_to_integer)

# 2009.11  2013.9  2010.8 2014.12  2014.8  2007.9  2011.7 
#    6      10       6      12      10       4       7 
# 2014.8  2011.4  2010.2 2014.12 2010.11  2012.9 2011.10 
#   10       6       4      12       7       9       8 
# 2007.1  2014.6 2008.10  2007.3  2009.4  2014.3 
#    1      10       5       2       4       9 

and for higher performance, use dplyr or data.table library:

require(dplyr)

# something like the following, currently doesn't work,
# you have to handle two intermediate columns from yy_mm
# You get to fix this :)

df %>% mutate(yy_mm = as.integer(unlist(strsplit(yr.mo, '.', fixed=T))),
              quarter = yy_mm[1]-2006 + yy_mm[2] %/% 3 )