How to filter a set of (int, str) tuples, to retur

2020-03-26 04:37发布

Suppose I have a set of tuples representing URLS with "scores":

{(0.75, 'http://www.foo.com'), (0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com'), (0.66, 'http://www.bar.com')}.

What is a concise way for me to filter out duplicate URLS, returning only the URL with the lowest score? That is, from the example set above, I want to get the following set, where each URL appears only once, with the lowest corresponding score from the original set:

{(0.5, 'http://www.foo.com'),(0.33, 'http://www.bar.com')}

I came up with the following solution:

from collections import defaultdict

seen = defaultdict(lambda:1)
for score, url in s:
    if score < seen[url]:
        seen[url] = score

filtered = {(v,k) for k,v in seen.items()}

... but I feel like there is probably some simpler and more efficient way to do this without using the intermediary dict to keep track of the max element, and then regenerate the set from that. What is the best way to filter a set of tuples by the min/max of the first element?

4条回答
姐就是有狂的资本
2楼-- · 2020-03-26 05:15

A very simple approach:

L=sorted(s,key=lambda t: (t[1],t[0]))
[L[0]] + [L[i] for i in range(1,len(L)) if L[i][1]!=L[i-1][1]]
查看更多
Deceive 欺骗
3楼-- · 2020-03-26 05:20

Another solution:

seen = {}
for score, url in s:
    if seen.setdefault(url, score) > score:
        seen[url] = score
filtered = {(v,k) for k,v in seen.items()}
print(filtered)
查看更多
Deceive 欺骗
4楼-- · 2020-03-26 05:21

You've already implemented the simplest approach I can think of. The only change I'd make would be to the loop—a slightly more concise version is using min.

seen = defaultdict(lambda: 1)  # `lambda: float('inf')` if scores can be > 1
for score, url in s:
    seen[url] = min(seen[url], score)

{(v,k) for k,v in seen.items()}
# {(0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com')}

If you really want a shorter solution, like I said, it isn't the simplest approach, but it is a one liner. Most of the challenge is interchanging the URL and the score so you can use the URL as a key when dropping duplicates. It goes without saying that sorting is a pre-condition here (that's why I don't like this solution as much as the one above).

{(v, k) for k, v in dict(sorted(((v, k) for k, v in s), reverse=True)).items()}
# {(0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com')}

This solution becomes so much shorter if s looks like this:

s2 = {(v,k) for k, v in s}
s2 
# {('http://www.bar.com', 0.33), ('http://www.bar.com', 0.66), ...}

You'd only then need to do

list(dict(sorted(s2, reverse=True)).items())
# [('http://www.foo.com', 0.5), ('http://www.bar.com', 0.33)]
查看更多
仙女界的扛把子
5楼-- · 2020-03-26 05:24

Without any tricks or additional code for reuse you are pretty close. I came up with something similar that's a little cleaner in my opinion:

seen = set()
filtered = []
for score, url in sorted(urls):
    if url in seen:
        continue
    filtered.append((score, url))
    seen.add(url)

You can also utilize other libraries, such as boltons. You can use the unique method like so:

import operator
from boltons.iterutils import unique
filtered = unique(sorted(urls), key=operator.itemgetter(1))

Update: if the tuples have all relevant scoring as the first elements, this solution would work for arbitrary length of tuples (assuming you change the key function)

查看更多
登录 后发表回答