I have two columns in a large file, say
pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig1
.....
Second is column redundant. I want new random combinations of double size which should not match given combination, for example
pro1 lig2
pro1 lig4
pro2 lig1
pro2 lig3
pro3 lig4
pro3 lig2
pro4 lig2
pro4 lig3
.....
Thanks.
If you want exactly two results for each value in column one, I'd brute force the non-matching part, with something like this:
OUTPUT
Using some sorting, filtering, chaining and list comprehensions, you can try:
This gives:
The main trick is to use a random function as
key
forsorted
.Say you have two columns:
Then the most straightforward way to do this would be to use
itertools.product
andrandom.sample
as below:If
col1
andcol2
contain duplicate items, you can extract the unique items by doingset(col1)
andset(col2)
.Note that
list(product(...))
will generateN * M
element list, whereN
andM
are the number of unique items in the columns. This may cause problems ifN * M
ends up being a very large number.