I am working with GPS data and trying to figure out how to average the 11th-15th fixes for latitude and longitude. I have seen solutions in similar questions for how to average every n rows. The problem is that occasionally the satellites bomb out and the fixes stop at 13 or 14. So, in these cases, I only want to average 3 or 4 values instead of 5. So I am looking to average values for latitude and longitude starting from where the number in series is 11 until the number in series drops again (or as long as it is increasing? I need it to include the last set, which will not drop back down to a low number again). I have started by deleting all rows where the number in series is NOT in my desired 11-15 ranges. So,for an example dummy data set, this leaves me with:
Date Time Long Lat NoInSeries
12 17/11/2014 22:09:17 115.9508 -31.82850 11
13 17/11/2014 22:09:18 115.9508 -31.82846 12
14 17/11/2014 22:09:19 115.9513 -31.82864 13
15 17/11/2014 22:09:21 115.9511 -31.82863 14
26 18/11/2014 00:07:14 115.9509 -31.82829 11
27 18/11/2014 00:07:15 115.9509 -31.82829 12
28 18/11/2014 00:07:16 115.9509 -31.82830 13
29 18/11/2014 00:07:17 115.9509 -31.82830 14
30 18/11/2014 00:07:18 115.9509 -31.82831 15
56 18/11/2014 10:00:24 115.9513 -31.82670 11
57 18/11/2014 10:00:25 115.9514 -31.82670 12
58 18/11/2014 10:00:26 115.9514 -31.82669 13
59 18/11/2014 10:00:27 115.9514 -31.82668 14
60 18/11/2014 10:00:28 115.9514 -31.82668 15
My desired output would be something like this, with the first one averaging 4 (11-14) and the next two averaging 5 (11-15):
Date Time Long Lat NoInSeries AvgLong Avg Lat
12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
I would then go through and delete all rows where AvgLong==NA, so my final output would just have all the rows where number in series=11 with the averages.
I really don't know where to start with the code for this...the examples I have found have all discussed averaging an exact number of rows, rather than a variable number.
For instance:
c( tapply( x, (row(x)-1)%/%5, mean ) )
Or:
idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)
Also, the answers I have seen generally then output the averages as a new data frame... Ultimately, I also want to have this averaging under a condition: if schedule is 'Multifix', I want to average 11 to however high it goes up to 15, whereas if schedule is 'Continuous', I want to average from 181 up until however high each one goes...). Something like this:
if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}
Or perhaps I have an if else statement to define a variable and then use that variable in the function to do the averaging?
...but I imagine this condition could complicate things if the output creates a new dataframe, which is why I was aiming for just adding values to new columns "AvgLong" and "AvgLat." Thanks for any help!!
#dput function shows the data I was working from your question.
dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L,
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014",
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18",
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17",
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509,
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514,
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864,
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831,
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L,
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))
#get.counter gets the row index when the value of the column starts to decrease as opposed to ascending order.
get.counter <- function(x){
a1 = x
counter = 0
a2 = c()
for( i in 1:length(a1)){
if(i < length(a1)){
if(a1[i+1] > a1[i]){
counter = counter + 1
}else{
counter = counter + 1
a2 = c(a2, counter)
counter = 0
}
}else{
counter = counter + 1
a2 = c(a2, counter)
}
}
return(a2)
}
# avg.seg.col function outputs a data frame with the segmented average of a column. df1 is the input data frame, colvar is the column name (Eg: Long or Lat), and get_counter is the output from get.counter function.
avg.seg.col <- function(df1, colvar, get_counter){
long <- c()
start = 1
for(i in cumsum(get_counter)){
end = i
b1 = subset(df1, select = colvar)[start:end,]
mean_b1 = mean(b1)
long = c(long, mean_b1, rep(NA, (length(b1)-1)))
start = end+1
}
return(data.frame(long, stringsAsFactors = FALSE))
}
# read in data from a text file using read.table function. You need to make sure your file exists in current working directory. Working directory can be set by setwd("path of current working directory")
df1 <- read.table(file = "file1.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# apply get.counter function with a vector from df1$NoInSeries
get_counter <- get.counter(df1$NoInSeries)
# Apply avg.seg.col function for Long column
AvgLong <- avg.seg.col(df1, "Long", get_counter)
# Apply avg.seg.col function for Lat column
AvgLat <- avg.seg.col(df1, "Lat", get_counter)
# Merge data frames by column
df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))
# Assign column names
colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")
Output:
print(df2)
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
#after removing rows with NA, the output looks like below
df2[-(which(df2$AvgLong %in% NA)), ]
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
You can do it using cumsum
, diff
, aggregate
and merge
x
## Date Time Long Lat NoInSeries SeriesNo
## 1 17/11/2014 22:09:17 115.9508 -31.82850 11 0
## 2 17/11/2014 22:09:18 115.9508 -31.82846 12 0
## 3 17/11/2014 22:09:19 115.9513 -31.82864 13 0
## 4 17/11/2014 22:09:21 115.9511 -31.82863 14 0
## 5 18/11/2014 00:07:14 115.9509 -31.82829 11 1
## 6 18/11/2014 00:07:15 115.9509 -31.82829 12 1
## 7 18/11/2014 00:07:16 115.9509 -31.82830 13 1
## 8 18/11/2014 00:07:17 115.9509 -31.82830 14 1
## 9 18/11/2014 00:07:18 115.9509 -31.82831 15 1
## 10 18/11/2014 10:00:24 115.9513 -31.82670 11 2
## 11 18/11/2014 10:00:25 115.9514 -31.82670 12 2
## 12 18/11/2014 10:00:26 115.9514 -31.82669 13 2
## 13 18/11/2014 10:00:27 115.9514 -31.82668 14 2
## 14 18/11/2014 10:00:28 115.9514 -31.82668 15 2
cumsum(c(0, diff(x$NoInSeries) < 0))
will give you a new column which increments every time diff
of NoInSeries
is negative.
# Define a new variable which increments after every drop in NoInSeries
x$SeriesNo <- cumsum(c(0, diff(x$NoInSeries) < 0))
Now you aggregate
using the new SeriesNo
column
# Breakdown ... First aggregate Long, Lat by Series No with Function mean
aggregate(cbind(Long, Lat) ~ SeriesNo, data = x, FUN = mean)
## SeriesNo Long Lat
## 1 0 115.9510 -31.82856
## 2 1 115.9509 -31.82830
## 3 2 115.9514 -31.82669
# merge it back with original data with only rows where NoInSeries = 11
# Final Desired Result in one line
merge(x[x$NoInSeries == 11, c("Date", "Time", "SeriesNo")], aggregate(cbind(Long,
Lat) ~ SeriesNo, data = x, FUN = mean))
## SeriesNo Date Time Long Lat
## 1 0 17/11/2014 22:09:17 115.9510 -31.82856
## 2 1 18/11/2014 00:07:14 115.9509 -31.82830
## 3 2 18/11/2014 10:00:24 115.9514 -31.82669
Seems that using aggregate
does most of the work:
> aggregate(df1[ ,c("ID", "Long","Lat")], list( (df1$ID-1) %/% 5), mean)
Group.1 ID Long Lat
1 2 13.5 115.9510 -31.82856
2 5 28.0 115.9509 -31.82830
3 11 58.0 115.9514 -31.82669
Needed to shift the ID variable by one to get the modulo divison to deliver the groups you wanted. If you wanted to have something aligned with the origianl data, then the ave
function is designed to deliver:
> df1$aveLong <- ave( df1$Long, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1$aveLLat <- ave( df1$Lat, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1
ID Date Time Long Lat NoInSeries aveLong
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA
aveLLat
1 -31.82856
2 NA
3 NA
4 NA
5 -31.82830
6 NA
7 NA
8 NA
9 NA
10 -31.82669
11 NA
12 NA
13 NA
14 NA
I've read for
loops are necessary for iterative actions, which is why I like Chinmay's use of cumsum
and diff
. I don't have enough reputation to comment on @Chinmay Patil's elegant answer, so here is a slightly different approach.
df$group <- 0 #Create a dummy grouping variable
for(i in 2:length(df$NoInSeries)) { #Starting on row 2 to the end
#Check if the series resets (True = 1, False = 0)
check <- df[i-1, "NoInSeries"] > df[i, "NoInSeries"]
df[i, "group"] <- df[i-1, "group"] + check #Add check value to previous row
} #This yields a number for each series
require(plyr)
ddply(df, .(group), summarise,
Date= min(Date), Time=min(Time), Long=mean(Long), Lat= mean(Lat))
# group Date Time Long Lat
#1 0 17/11/2014 22:09:17 115.9510 -31.82856
#2 1 18/11/2014 00:07:14 115.9509 -31.82830
#3 2 18/11/2014 10:00:24 115.9514 -31.82669
You can report the Lat/Lon by the first time (min
, as above), last time (max
), or average time (mean
). However, sometimes I have issues with ddply
when I have POSIXct dates/times in the data frame.