I have two columns, would like to retain only the non commutative rows.For the data below my output should contain one combination of (1 2). i.e. for my query (1 2) is same as (2 1). Is there a simple way to do it in R. Already tried transposing. and retaining the upper traingular matrix. but it becomes a pain re transposing back the data.
A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 1 0.3
2 3 0.1
2 4 0.4
My final output should be:
A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 3 0.1
2 4 0.4
Here is another solution using base R. The idea is to search in the second half of the
df
(usingsapply
) if there are any duplicated there. We then get backsecondHalf
vector. We further remove those rows fromdf
.This should work:
Be aware that this will use the FIRST occurence of the pattern.
We can independently
sort()
each row and then use!
duplicated()
to find which rows to preserve:Data
Explanation
The first step is to extract just the two columns of interest:
Then we independently sort each row with
apply()
andsort()
:As you can see,
apply()
returns its results in an unexpected transposition, so we have to fix it witht()
to prepare for the upcomingduplicated()
call:Now we can use
duplicated()
to get a logical vector indicating which rows are duplicates of previous rows:We then invert the logical vector with a negation, to get just those rows that are not duplicates of any previous rows:
Finally we can use the resulting logical vector to index out just those rows of
df
that are not duplicates of any previous rows:Therefore, the first occurrence of every set of post-sort duplicates will be retained, the remainder will be removed.
Excellent suggestion from @RichardScriven; we can replace the
t()
call with theMARGIN
argument ofduplicated()
, which will likely be slightly faster:We can use
data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by thepmin(A, B)
andpmax(A,B)
,if
the number of rows is greater than 1, we get the first row orelse
return the rows.Or we can just used
duplicated
on thepmax
,pmin
output to return a logical index and subset the data based on that.