Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.
My dataset is something like:
Sample group Gene1 Gene2 Gene3
A 1 a NA NA
A 2 b NA NA
B 1 NA c NA
C 1 a NA d
C 2 b NA e
C 3 c NA NA
I want to get it into a format where each samples takes only 1 row (the group column is optional):
Sample group Gene1 Gene2 Gene3
A 1,2 a,b NA NA
B 1 NA c NA
C 1,2,3 a,b,c NA d,e
Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate.
I know aggregate
or dplyr
can be used to get the groups but I can't figure out how to do it for multiple columns.
Thanks in advance!
Edit
As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:
setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]
The output is now:
Sample group Gene1 Gene2 Gene3
1: A 1,2 a,b
2: B 1 c
3: C 1,2,3 a,b,c d,e
Thanks for all your help!