In R, average row value until hit a specific condi

2019-06-08 17:23发布

站内文章 / 前端开发

27 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am working with GPS data and trying to figure out how to average the 11th-15th fixes for latitude and longitude. I have seen solutions in similar questions for how to average every n rows. The problem is that occasionally the satellites bomb out and the fixes stop at 13 or 14. So, in these cases, I only want to average 3 or 4 values instead of 5. So I am looking to average values for latitude and longitude starting from where the number in series is 11 until the number in series drops again (or as long as it is increasing? I need it to include the last set, which will not drop back down to a low number again). I have started by deleting all rows where the number in series is NOT in my desired 11-15 ranges. So,for an example dummy data set, this leaves me with:

      Date      Time     Long       Lat     NoInSeries
12  17/11/2014 22:09:17 115.9508 -31.82850    11
13  17/11/2014 22:09:18 115.9508 -31.82846    12
14  17/11/2014 22:09:19 115.9513 -31.82864    13
15  17/11/2014 22:09:21 115.9511 -31.82863    14
26  18/11/2014 00:07:14 115.9509 -31.82829    11
27  18/11/2014 00:07:15 115.9509 -31.82829    12
28  18/11/2014 00:07:16 115.9509 -31.82830    13
29  18/11/2014 00:07:17 115.9509 -31.82830    14
30  18/11/2014 00:07:18 115.9509 -31.82831    15
56  18/11/2014 10:00:24 115.9513 -31.82670    11
57  18/11/2014 10:00:25 115.9514 -31.82670    12
58  18/11/2014 10:00:26 115.9514 -31.82669    13
59  18/11/2014 10:00:27 115.9514 -31.82668    14
60  18/11/2014 10:00:28 115.9514 -31.82668    15

My desired output would be something like this, with the first one averaging 4 (11-14) and the next two averaging 5 (11-15):

     Date      Time     Long       Lat     NoInSeries  AvgLong     Avg Lat
12  17/11/2014 22:09:17 115.9508 -31.82850    11       115.9510   -31.82856
13  17/11/2014 22:09:18 115.9508 -31.82846    12          NA          NA
14  17/11/2014 22:09:19 115.9513 -31.82864    13          NA          NA
15  17/11/2014 22:09:21 115.9511 -31.82863    14          NA          NA
26  18/11/2014 00:07:14 115.9509 -31.82829    11       115.9509   -31.82830
27  18/11/2014 00:07:15 115.9509 -31.82829    12          NA          NA
28  18/11/2014 00:07:16 115.9509 -31.82830    13          NA          NA
29  18/11/2014 00:07:17 115.9509 -31.82830    14          NA          NA
30  18/11/2014 00:07:18 115.9509 -31.82831    15          NA          NA
56  18/11/2014 10:00:24 115.9513 -31.82670    11       115.9514   -31.82669
57  18/11/2014 10:00:25 115.9514 -31.82670    12          NA          NA
58  18/11/2014 10:00:26 115.9514 -31.82669    13          NA          NA
59  18/11/2014 10:00:27 115.9514 -31.82668    14          NA          NA
60  18/11/2014 10:00:28 115.9514 -31.82668    15          NA          NA

I would then go through and delete all rows where AvgLong==NA, so my final output would just have all the rows where number in series=11 with the averages.

I really don't know where to start with the code for this...the examples I have found have all discussed averaging an exact number of rows, rather than a variable number.

For instance:

c( tapply( x, (row(x)-1)%/%5, mean ) )

Or:

idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)

Also, the answers I have seen generally then output the averages as a new data frame... Ultimately, I also want to have this averaging under a condition: if schedule is 'Multifix', I want to average 11 to however high it goes up to 15, whereas if schedule is 'Continuous', I want to average from 181 up until however high each one goes...). Something like this:

if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}

Or perhaps I have an if else statement to define a variable and then use that variable in the function to do the averaging?

...but I imagine this condition could complicate things if the output creates a new dataframe, which is why I was aiming for just adding values to new columns "AvgLong" and "AvgLat." Thanks for any help!!

回答1:

#dput function shows the data I was working from your question.

dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L, 
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014", 
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18", 
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17", 
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509, 
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514, 
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864, 
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831, 
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L, 
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))

#get.counter gets the row index when the value of the column starts to decrease as opposed to ascending order.

get.counter <- function(x){
  a1 = x
  counter = 0
  a2 = c()
  for( i in 1:length(a1)){  
    if(i < length(a1)){
      if(a1[i+1] > a1[i]){
        counter = counter + 1
      }else{
        counter = counter + 1
        a2 = c(a2, counter)
        counter = 0
      }
    }else{
      counter = counter + 1
      a2 = c(a2, counter)
    }
  }
  return(a2)
}

# avg.seg.col function outputs a data frame with the segmented average of a column. df1 is the input data frame, colvar is the column name (Eg: Long or Lat), and get_counter is the output from get.counter function.

avg.seg.col <- function(df1, colvar, get_counter){ 

  long <- c()

  start = 1

  for(i in cumsum(get_counter)){
    end = i
    b1 = subset(df1, select = colvar)[start:end,]

    mean_b1 = mean(b1)

    long = c(long, mean_b1, rep(NA, (length(b1)-1)))

    start = end+1
  }
  return(data.frame(long, stringsAsFactors = FALSE))
}

# read in data from a text file using read.table function. You need to make sure your file exists in current working directory. Working directory can be set by setwd("path of current working directory")

df1 <- read.table(file = "file1.txt", 
                  header = TRUE, 
                  sep = "\t", 
                  stringsAsFactors = FALSE)

# apply get.counter function with a vector from df1$NoInSeries

get_counter <- get.counter(df1$NoInSeries)

# Apply avg.seg.col function for Long column

AvgLong <- avg.seg.col(df1, "Long", get_counter)

# Apply avg.seg.col function for Lat column

AvgLat <- avg.seg.col(df1, "Lat", get_counter)

# Merge data frames by column

df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))

# Assign column names

colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")

Output:

     print(df2)
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA        NA
3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA        NA
4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA        NA
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA        NA
7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA        NA
8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA        NA
9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA        NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA        NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA        NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA        NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA        NA

#after removing rows with NA, the output looks like below

df2[-(which(df2$AvgLong %in% NA)), ]
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669

回答2:

You can do it using cumsum, diff, aggregate and merge

x
##          Date     Time     Long       Lat NoInSeries SeriesNo
## 1  17/11/2014 22:09:17 115.9508 -31.82850         11        0
## 2  17/11/2014 22:09:18 115.9508 -31.82846         12        0
## 3  17/11/2014 22:09:19 115.9513 -31.82864         13        0
## 4  17/11/2014 22:09:21 115.9511 -31.82863         14        0
## 5  18/11/2014 00:07:14 115.9509 -31.82829         11        1
## 6  18/11/2014 00:07:15 115.9509 -31.82829         12        1
## 7  18/11/2014 00:07:16 115.9509 -31.82830         13        1
## 8  18/11/2014 00:07:17 115.9509 -31.82830         14        1
## 9  18/11/2014 00:07:18 115.9509 -31.82831         15        1
## 10 18/11/2014 10:00:24 115.9513 -31.82670         11        2
## 11 18/11/2014 10:00:25 115.9514 -31.82670         12        2
## 12 18/11/2014 10:00:26 115.9514 -31.82669         13        2
## 13 18/11/2014 10:00:27 115.9514 -31.82668         14        2
## 14 18/11/2014 10:00:28 115.9514 -31.82668         15        2

cumsum(c(0, diff(x$NoInSeries) < 0)) will give you a new column which increments every time diff of NoInSeries is negative.

# Define a new variable which increments after every drop in NoInSeries
x$SeriesNo <- cumsum(c(0, diff(x$NoInSeries) < 0))

Now you aggregate using the new SeriesNo column

# Breakdown ...  First aggregate Long, Lat by Series No with Function mean
aggregate(cbind(Long, Lat) ~ SeriesNo, data = x, FUN = mean)
##   SeriesNo     Long       Lat
## 1        0 115.9510 -31.82856
## 2        1 115.9509 -31.82830
## 3        2 115.9514 -31.82669



# merge it back with original data with only rows where NoInSeries = 11

# Final Desired Result in one line
merge(x[x$NoInSeries == 11, c("Date", "Time", "SeriesNo")], aggregate(cbind(Long, 
    Lat) ~ SeriesNo, data = x, FUN = mean))
##   SeriesNo       Date     Time     Long       Lat
## 1        0 17/11/2014 22:09:17 115.9510 -31.82856
## 2        1 18/11/2014 00:07:14 115.9509 -31.82830
## 3        2 18/11/2014 10:00:24 115.9514 -31.82669

回答3:

Seems that using aggregate does most of the work:

> aggregate(df1[ ,c("ID", "Long","Lat")], list( (df1$ID-1) %/% 5), mean)
  Group.1   ID     Long       Lat
1       2 13.5 115.9510 -31.82856
2       5 28.0 115.9509 -31.82830
3      11 58.0 115.9514 -31.82669

Needed to shift the ID variable by one to get the modulo divison to deliver the groups you wanted. If you wanted to have something aligned with the origianl data, then the ave function is designed to deliver:

> df1$aveLong <- ave( df1$Long, (df1$ID-1) %/% 5, 
          FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1$aveLLat <- ave( df1$Lat, (df1$ID-1) %/% 5, 
          FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1
   ID       Date     Time     Long       Lat NoInSeries  aveLong
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510
2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA
3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA
4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509
6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA
7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA
8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA
9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514
11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA
     aveLLat
1  -31.82856
2         NA
3         NA
4         NA
5  -31.82830
6         NA
7         NA
8         NA
9         NA
10 -31.82669
11        NA
12        NA
13        NA
14        NA

回答4:

I've read for loops are necessary for iterative actions, which is why I like Chinmay's use of cumsum and diff. I don't have enough reputation to comment on @Chinmay Patil's elegant answer, so here is a slightly different approach.

df$group <- 0     #Create a dummy grouping variable

for(i in 2:length(df$NoInSeries)) {        #Starting on row 2 to the end
  #Check if the series resets (True = 1, False = 0)
  check <- df[i-1, "NoInSeries"] > df[i, "NoInSeries"]  
  df[i, "group"] <- df[i-1, "group"] + check    #Add check value to previous row
}     #This yields a number for each series

require(plyr)
ddply(df, .(group), summarise, 
    Date= min(Date), Time=min(Time), Long=mean(Long), Lat= mean(Lat))

#  group       Date     Time     Long       Lat
#1     0 17/11/2014 22:09:17 115.9510 -31.82856
#2     1 18/11/2014 00:07:14 115.9509 -31.82830
#3     2 18/11/2014 10:00:24 115.9514 -31.82669

You can report the Lat/Lon by the first time (min, as above), last time (max), or average time (mean). However, sometimes I have issues with ddply when I have POSIXct dates/times in the data frame.