Sample of 2 (made-up) example rows in df:
userid facultyid courseid schoolid
167 265 NA 1678
167 71111 301 NA
Suppose that I have a couple hundred duplicate userid like in the above example. However, the vast majority of userid have different values.
How can I combine rows with duplicate userid in such a way as to stick to the column values in the 1st (of the 2) row unless the first value is NA (in which case the NA will be repopulated with whatever value came from the second row)?
In essence, drawing from the above example, my ideal output would contain:
userid facultyid courseid schoolid
167 265 301 1678
aggregate(x = df1, by = list(df1$userid), FUN = function(x) na.omit(x)[1])[,-1]
or use dplyr
library:
library(dplyr)
df1 %>%
group_by(userid) %>%
summarise_each(funs(first(na.omit(.))))
# initialize a vector that will contain row numbers which should be erased
rows.to.erase <- c()
# loop over the rows, starting from top
for(i in 1:(nrow(dat)-1)) {
if(dat$userid[i] == dat$userid[i+1]) {
# loop over columns to recuperate data when a NA is present
for(j in 2:4) {
if(is.na(dat[i,j]))
dat[i,j] <- dat[i+1,j]
}
rows.to.erase <- append(rows.to.erase, i+1)
}
}
dat.clean <- dat[-rows.to.erase,]
dat.clean
# userid facultyid courseid schoolid
# 1 167 265 301 1678
Here's a different approach using ddply
:
# requires the plyr package
library(plyr)
# Your example dataframe with added lines
schoolex <- data.frame(userid = c(167, 167, 200, 203, 203), facultyid = c(265, 71111, 200, 300, NA),
courseid = c(NA, 301, 302, 303, 303), schoolid = c(1678, NA, 1678, NA, 1678))
schoolex_duprm <- ddply(schoolex, .(userid), summarize, facultyid2 = facultyid[!is.na(facultyid)][1],
courseid2 = courseid[!is.na(courseid)][1],
schoolid2 = schoolid[!is.na(schoolid)][1])
Here's a simple one-liner from plyr. I wrote it a bit more generally than you asked:
a <- data.frame(x=c(1,2,3,1,2,3,1,2,3),y=c(2,3,1,1,2,3,2,3,1),
z=c(NA,1,NA,2,NA,3,4,NA,5),zz=c(1,NA,2,NA,3,NA,4,NA,5))
ddply(a,~x+y,summarize,z=first(z[!is.na(z)]),zz=first(zz[!is.na(zz)]))
Specifically answering the original question, if your data frame is named a, :
ddply(a,~userid,summarize,facultyid=first(facultyid[!is.na(facultyid)]),
courseid=first(courseid[!is.na(courseid)],
schoolid=first(schoolid[!is.na(schoolid)])