In R, is it possible to include the same row in mu

2019-09-17 15:03发布

问题:

I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.

Here's an example to give a more concrete idea:

library(MESS)
library(lubridate)
library(dplyr)

Generate Reproducible Example

datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))

useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
  dat <- left_join(dat, useDate)

Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.

  out <- dat %>%
      mutate(lagDateGood = lag(DateGood),
             leadDateGood = lead(DateGood)) %>%
      filter(lagDateGood != "No" | leadDateGood != "No")

Now I need to calculate the area under the curve - this is not correct

out2 <- out %>%
    group_by(day) %>%
    mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
    summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))

The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.

I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?

回答1:

For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:

#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)

calcTotalFlux <- function(df) {
    if (nrow(df) < 10) {              # make sure the day has at least 10 measures
        NA
    } else {
    day_midnight <- floor_date(df$datetime[2], "day")
    df %>%
    mutate(time = datetime - day_midnight) %>%
    summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}

do.call("rbind",lapply(res, calcTotalFlux))

    TotalFlux
7         NA
8   585230.2
9   579017.3
10        NA
11  563689.7
12        NA


回答2:

Here's another way. More in line with the suggestions of @Alex Brown.

 # Another way
last <- out %>%
    group_by(day) %>%
    filter(datetime == max(datetime)) %>%
    ungroup() %>%
    mutate(day = day + 1)

first <- out %>%
    group_by(day) %>%
    filter(datetime == min(datetime)) %>%
    ungroup() %>%
    mutate(day = day - 1)

d <- rbind(out, last, first) %>%
    group_by(day) %>%
    arrange(datetime)

n_measures_per_day <- d %>%
    summarize(n = n())

d <- left_join(d, n_measures_per_day) %>%
    filter(n > 4)

TotalFluxDF <- d %>%
    mutate(timeAtMidnight = floor_date(datetime[3], "day"),
           time = datetime - timeAtMidnight) %>%
    summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))

TotalFluxDF

Source: local data frame [3 x 2]

    day      auc
  (dbl)    (dbl)
1     8 585230.2
2     9 579017.3
3    11 563689.7