Suppose I have a set of tuples representing URLS with "scores":
{(0.75, 'http://www.foo.com'), (0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com'), (0.66, 'http://www.bar.com')}
.
What is a concise way for me to filter out duplicate URLS, returning only the URL with the lowest score? That is, from the example set above, I want to get the following set, where each URL appears only once, with the lowest corresponding score from the original set:
{(0.5, 'http://www.foo.com'),(0.33, 'http://www.bar.com')}
I came up with the following solution:
from collections import defaultdict
seen = defaultdict(lambda:1)
for score, url in s:
if score < seen[url]:
seen[url] = score
filtered = {(v,k) for k,v in seen.items()}
... but I feel like there is probably some simpler and more efficient way to do this without using the intermediary dict to keep track of the max element, and then regenerate the set from that. What is the best way to filter a set of tuples by the min/max of the first element?
A very simple approach:
Another solution:
You've already implemented the simplest approach I can think of. The only change I'd make would be to the loop—a slightly more concise version is using
min
.If you really want a shorter solution, like I said, it isn't the simplest approach, but it is a one liner. Most of the challenge is interchanging the URL and the score so you can use the URL as a key when dropping duplicates. It goes without saying that sorting is a pre-condition here (that's why I don't like this solution as much as the one above).
This solution becomes so much shorter if
s
looks like this:You'd only then need to do
Without any tricks or additional code for reuse you are pretty close. I came up with something similar that's a little cleaner in my opinion:
You can also utilize other libraries, such as boltons. You can use the unique method like so:
Update: if the tuples have all relevant scoring as the first elements, this solution would work for arbitrary length of tuples (assuming you change the key function)