I have a dataframe called ants
detailing multiple entries per site, looks like this:
Site Date Time Temp SpCond Salinity Depth Turbidity Chlorophyll
1 71 6/8/2010 14:50:35 14.32 49.88 32.66 0.397 0.0 1.3
2 71 6/8/2010 14:51:00 14.31 49.94 32.70 1.073 0.0 2.0
3 71 6/8/2010 14:51:16 14.32 49.95 32.71 1.034 -0.1 1.6
4 71 6/8/2010 14:51:29 14.31 49.96 32.71 1.030 -0.2 1.6
5 70 6/8/2010 14:53:55 14.30 50.04 32.77 1.002 -0.2 1.2
6 70 6/8/2010 14:54:09 14.30 50.03 32.77 0.993 -0.5 1.2
Sites have different numbers of entries, usually 3 but sometimes less or more. Where both date and site number match I would like to write a new dataframe with one entry per site detailing the average/mean readings for each parameter. I would like empty or "na" cells to be omitted from the calculation and subsequent dataframe.
I'm not sure if this is an apply function or a version of rowMeans maybe? Very stuck, any help much appreciated!
Here is a complete new answer with a full log also covering your new specification:
R> Lines <- " Site Date Time Temp SpCond Salinity Depth Turbidity Chlorophyll
+ 71 6/8/2010 14:50:35 14.32 49.88 32.66 0.397 0.0 1.3
+ 71 6/8/2010 14:51:00 14.31 49.94 32.70 1.073 0.0 2.0
+ 71 6/8/2010 14:51:16 14.32 49.95 32.71 1.034 -0.1 1.6
+ 71 6/8/2010 14:51:29 14.31 49.96 32.71 1.030 -0.2 1.6
+ 70 6/8/2010 14:53:55 14.30 50.04 32.77 1.002 -0.2 1.2
+ 70 6/8/2010 14:54:09 14.30 50.03 32.77 0.993 -0.5 1.2
+ "
R> con <- textConnection(Lines)
R> df <- read.table(con, sep="", header=TRUE, stringsAsFactors=FALSE)
R> close(con)
R> df$pt <- as.POSIXct(strptime(paste(df$Date, df$Time), "%m/%d/%Y %H:%M:%S"))
R> library(plyr)
R> newdf <- ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
R> newdf$pt <- as.POSIXct(newdf$pt, origin="1970-01-01")
R> newdf
Site Date Temp SpCond Salinity Depth Turbidity Chlorophyll pt
1 70 6/8/2010 14.30 50.03 32.77 0.9975 -0.350 1.200 2010-06-08 20:54:02
2 71 6/8/2010 14.32 49.93 32.70 0.8835 -0.075 1.625 2010-06-08 20:51:05
R>
Nico's answer looked like mine would have except that I would have added a named argument to be passed to mean() so that the NA's (in the aggregated columns) would not sabotage the results. (I could not tell whether the OP was asking that NA's in the by variables or in the otehr variables were known or suspected of having NA's) :
aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)
You would probably need to also run aggregate or tapply calls in parallel to count the number of non-NA values.
The other method using aggregate's formula method might be different since na.action=na.omit is the default:
aggregate( . ~Site +Date, data=df, FUN=mean, na.rm=TRUE)
Here is one way using the plyr package and its ddply()
function:
R> df
Site Date Time Temp SpCond Salinity Depth Turbidity Chlorophyll
1 71 6/8/2010 14:50:35 14.32 49.88 32.66 0.397 0.0 1.3
2 71 6/8/2010 14:51:00 14.31 49.94 32.70 1.073 0.0 2.0
3 71 6/8/2010 14:51:16 14.32 49.95 32.71 1.034 -0.1 1.6
4 71 6/8/2010 14:51:29 14.31 49.96 32.71 1.030 -0.2 1.6
5 70 6/8/2010 14:53:55 14.30 50.04 32.77 1.002 -0.2 1.2
6 70 6/8/2010 14:54:09 14.30 50.03 32.77 0.993 -0.5 1.2
R> library(plyr)
R> ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
Site Date Temp SpCond Salinity Depth Turbidity Chlorophyll
1 70 6/8/2010 14.300 50.035 32.770 0.9975 -0.350 1.200
2 71 6/8/2010 14.315 49.933 32.695 0.8835 -0.075 1.625
R>
I used a custom anonymous function to skip the first three columns.
You can also use aggregate
aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)
You were close with rowMeans()
, but you need colMeans()
instead. The others have shown how to use built-in or add-on functionality and I would certainly recommend you use them. However, it might be useful to see how to do something like this by hand:
## using df from Dirk's answer, we split the data in Site Date combinations
df.sp <- with(df,
split(data.frame(Temp, SpCond, Salinity, Depth, Turbidity,
Chlorophyll),
list(Site = Site, Date = Date)))
## The above gives a list of data frames one per date-site combo,
## to which we apply the colMeans() function
df.mean <- data.frame(t(sapply(df.sp, colMeans)))
At this point we need to do some extra tidying if you want the output to be nice like the others' answers:
## Process the rownames on df.mean
name.parts <- strsplit(rownames(df.mean), "\\.")
## pull out the Site part (before the '.')
df.mean <- within(df.mean, Site <- as.numeric(sapply(name.parts, `[`, 1)))
## pull out the Date part (after the '.')
df.mean <- within(df.mean, Date <- sapply(name.parts, `[`, 2))
## rearrange the columns
df.mean <- df.mean[, c(7:8,1:6)]
Note again, for most cases you should use the canned functions as described by the other answers. Sometimes it might be quicker to cook your own solution however, and the above might act as a guide to achieving this.