I have a string (for example: "alpha beta charlie, delta&epsilon foxtrot"
) and a list (for example ["zero","omega virginia","apple beta charlie"]
). Is there a convenient way to iterate through every word and combination of words in the string in order to search for it in the list?
问题:
回答1:
Purpose
You're saying combinations, but combinations are semantically unordered, what you mean, is you intend to find the intersection of all ordered permutations joined by spaces with a target list.
To begin with, we need to import the libraries we intend to use.
import re
import itertools
Splitting the string
Don't split on characters, you're doing a semantic search for words exclusive of strange characters.
Regular expressions, powered by the re
module are perfect for this. In a raw
Python string, r''
, we use the regular expression for the edge of a word, \b
, around any alphanumeric character (and _
), \w
, of number greater than or equal to one, +
.
re.findall
returns a list of every match.
re_pattern = r'\b\w+\b'
silly_string = 'alpha beta charlie, delta&epsilon foxtrot'
words = re.findall(re_pattern, silly_string)
Here, words is our wordlist:
>>> print words
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot']
Creating the Permutations
Continuing, we prefer to manipulate our data with generators to avoid unnecessarily materializing data before we need it and holding large datasets in memory. The itertools library has some nice functions that neatly suit our needs for providing all permutations of the above words and chaining them in a single iterable:
_gen = (itertools.permutations(words, i + 1) for i in xrange(len(words)))
all_permutations_gen = itertools.chain(*_gen)
listing all_permutations_gen with list(all_permutations_gen)
would give us:
[('alpha',), ('beta',), ('charlie',), ('delta',), ('epsilon',), ('foxtrot',), ('alpha', 'beta'), ('alpha', 'charlie'), ('alpha', 'delta'), ('alpha', 'epsilon'), ('alpha', 'foxtrot'), ('beta', 'alpha'), ('beta', 'charlie'), ('beta', 'delta'), ('beta', 'epsilon'), ('beta', 'foxtrot'), ('charlie', 'alpha'), ('charlie', 'beta'), ('charlie', 'delta'), ('charlie', 'epsilon'), ('charlie', 'foxtrot'), ('delta', 'alpha'), ('delta', 'beta'), ('delta', 'charlie'), ('delta', 'epsilon'), ('delta', 'foxtrot'), ('epsilon', 'alpha'), ('epsilon', 'beta'), ('epsilon', 'charlie'), ('epsilon', 'delta'), ('epsilon', 'foxtrot'), ('foxtrot', 'alpha'), ('foxtrot', 'beta'), ('foxtrot', 'charlie'), ('foxtrot', 'delta'), ('foxtrot', 'epsilon'), ('alpha', 'beta', 'charlie'), ('alpha', 'beta', 'delta'), ...
If we materialized the generator in a list instead of a set, printing the first 20 items would show us:
>>> print all_permutations[:20] # this only works if you cast as a list instead
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot', 'alpha beta', 'alpha charlie', 'alpha delta', 'alpha epsilon', 'alpha foxtrot', 'beta alpha', 'beta charlie', 'beta delta', 'beta epsilon', 'beta foxtrot', 'charlie alpha', 'charlie beta', 'charlie delta', 'charlie epsilon']
But that would exhaust the generator before we're ready. So instead, now we get the set of all permutations of those words
all_permutations = set(' '.join(i) for i in all_permutations_gen)
Checking for Membership of any Permutations in Target List
So we see with this we can now search for an intersection with the target list:
>>> target_list = ["zero","omega virginia","apple beta charlie"]
>>> all_permutations.intersection(target_list)
set([])
And in this case, for the examples given, we get the empty set, but if we have a string in the target that's in our set of permutations:
>>> target_list_2 = ["apple beta charlie", "foxtrot alpha beta charlie"]
>>> all_permutations.intersection(target_list_2)
set(['foxtrot alpha beta charlie'])