I have some almost identical rows in a dataframe, see ex., the criteria to establish they are related are some variables "sel1,sel2" in this example, the other variables, var1 and var2, must be integrated by the following criteria: 1. discarding NA, or 2. discarding the shorter string (in var2 in the example). So, until now I have discarded the NA, but not find a way to at the same time discard the shorter string. The strings are complex and might have commas, spaces and several types of characters.
df <- read.table(text =
" sel1 sel2 var1 var2
1 pseudorepeated1 x NA \"longer string\" # keep longer string instead of shortstring
2 pseudorepeated1 x 2 \"short string\" # keep 2 instead of NA
3 pseudorepeated2 y NA \"longer string 2\" # keep longer string 2
4 pseudorepeated2 y 4 \"short string2\" # keep 4
5 3 x gs as
6 4 y fg df
7 5 x eg af
8 6 y df fd", header = TRUE, stringsAsFactors=F)
df
df[is.na(df)] <- ""
df2<-aggregate(. ~ sel1 + sel2,data=df,FUN=function(X)paste(unique((X))) )
paste_noNA <- function(x,sep=", ")
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) )
df3<-as.data.frame(lapply(df2, function(X) unlist(lapply(X, function(x) paste_noNA(x)) ) ),
stringsAsFactors=F )
The expected output does not have the ", short string" text in this table.
df3
sel1 sel2 var1 var2
1.1 3 x gs as
1.3 5 x eg af
1.5 pseudorepeated1 x 2 longer string, short string# only longer string desired
2.2 4 y fg df
2.4 6 y df fd
2.6 pseudorepeated2 y 4 longer string 2, short string2# only longer string 2 desired
group by
sel1
andsel2
and removeNA
in var1, and replace shorter string with longer string invar2
. Finally, remove the duplicates in it.EDIT: with many columns
Data:
Code:
Output:
EDIT 2: avoiding
for
loop, and using.SDcols
and a column names variable