Efficiently identifying whether part of string is

2019-06-24 14:47发布

I have a lot (>100,000) lowercase strings in a list, where a subset might look like this:

str_list = ["hello i am from denmark", "that was in the united states", "nothing here"]

I further have a dict like this (in reality this is going to have a length of around ~1000):

dict_x = {"denmark" : "dk", "germany" : "ger", "norway" : "no", "united states" : "us"}

For all strings in the list which contain any of the dict's keys, I want to replace the entire string with the corresponding dict value. The expected result should thus be:

str_list = ["dk", "us", "nothing here"]

What is the most efficient way to do this given the number of strings I have and the length of the dict?

Extra info: There is never more than one dict key in a string.

5条回答
ら.Afraid
2楼-- · 2019-06-24 15:06

Something like this would work. Note that this will convert the string to the first encountered key fitting the criteria. If there are multiple you may want to modify the logic based on whatever fits your use case.

strings = [str1, str2, str3]
converted = []
for string in strings:
    updated_string = string
    for key, value in dict_x.items()
        if key in string:
            updated_string = value
            break
    converted.append(updated_string)
print(converted)
查看更多
淡お忘
3楼-- · 2019-06-24 15:17

You can subclass dict and use a list comprehension.

In terms of performance, I advise you try a few different methods and see what works best.

class dict_contains(dict):
    def __getitem__(self, value):
        key = next((k for k in self.keys() if k in value), None)
        return self.get(key)

str1 = "hello i am from denmark"
str2 = "that was in the united states"
str3 = "nothing here"

lst = [str1, str2, str3]

dict_x = dict_contains({"denmark" : "dk", "germany" : "ger", "norway" : "no", "united states" : "us"})

res = [dict_x[i] or i for i in lst]

# ['dk', 'us', "nothing here"]
查看更多
疯言疯语
4楼-- · 2019-06-24 15:18

Assuming:

lst = ["hello i am from denmark", "that was in the united states", "nothing here"]
dict_x = {"denmark" : "dk", "germany" : "ger", "norway" : "no", "united states" : "us"}

You can do:

res = [dict_x.get(next((k for k in dict_x if k in my_str), None), my_str) for my_str in lst]

which returns:

print(res)  # -> ['dk', 'us', 'nothing here']

The cool thing about this (apart from it being a python-ninjas favorite weapon aka list-comprehension) is the get with a default of my_str and next with a StopIteration value of None that triggers the above default.

查看更多
萌系小妹纸
5楼-- · 2019-06-24 15:19

This seems to be a good way:

input_strings = ["hello i am from denmark",
                 "that was in the united states",
                 "nothing here"]
dict_x = {"denmark" : "dk", "germany" : "ger", "norway" : "no", "united states" : "us"}

output_strings = []

for string in input_strings:
    for key, value in dict_x.items():
        if key in string:
            output_strings.append(value)
            break
    else:
        output_strings.append(string)
print(output_strings)
查看更多
成全新的幸福
6楼-- · 2019-06-24 15:20

Try

str_list = ["hello i am from denmark", "that was in the united states", "nothing here"]

dict_x = {"denmark" : "dk", "germany" : "ger", "norway" : "no", "united states" : "us"}

for k, v in dict_x.items():
    for i in range(len(str_list)):
        if k in str_list[i]:
            str_list[i] = v

print(str_list)

This iterates through the key, value pairs in your dictionary and looks to see if the key is in the string. If it is, it replaces the string with the value.

查看更多
登录 后发表回答