Is there an elegant/fastR way to combine all pairs of columns in a data.frame?
For example, using mapply()
and paste()
we can turn this data.frame:
mydf <- data.frame(a.1 = letters, a.2 = 26:1, b.1 = letters, b.2 = 1:26)
head(mydf)
a.1 a.2 b.1 b.2
1 a 26 a 1
2 b 25 b 2
3 c 24 c 3
4 d 23 d 4
5 e 22 e 5
6 f 21 f 6
into this data.frame:
mydf2 <- mapply(function(x, y) {
paste(x, y, sep = ".")},
mydf[ ,seq(1, ncol(mydf), by = 2)],
mydf[ ,seq(2, ncol(mydf), by = 2)])
head(mydf2)
a.1 b.1
[1,] "a.26" "a.1"
[2,] "b.25" "b.2"
[3,] "c.24" "c.3"
[4,] "d.23" "d.4"
[5,] "e.22" "e.5"
[6,] "f.21" "f.6"
However, this feels clumsy and is a bit slow when applied to big datasets. Any suggestions, perhaps using a Hadley package?
EDIT: The ideal solution would easily scale to large numbers of columns, such that the names of the columns would not need to be included in the function call. Thanks!
It's amusing to note that the OP's solution appears to be the fastest one:
Results:
[Updated Benchmark]
I've added one solution from @thelatemail, which I missed in the original answer, and one solution from @akrun:
Benchmark:
An option using
set
fromdata.table
. It should be fast for large datasets as it modifies by reference and the overhead of[.data.table
is avoided. Assuming that the columns are ordered for each pair of columns.Instead of creating a new result dataset, we can also update the same or a copy of the original dataset. There will be some warnings about type conversion, but I guess this would be a bit faster (not benchmarked)
Benchmarks
I tried the benchmarks on a slightly bigger data with many columns.
data
Compared
f1
,f.jazurro
(fastest) (from @Marat Talipov's post) withf.akrun2
In this,
f.jazurro
is slighly better thanf.akrun2
. I think if I increase the group size, nrows etc, it would be an interesting comparisonI'm not sure this is the best approach. See if the below code gives any speed improvement
Answer updated based on comment :-)