I'm trying to make a method which can check whether a given phrase matches at least one item from list of phrases and returns them. Input is the phrase, a list of phrases and a dictionary of lists of synonyms. The point is to make it universal.
Here is the example:
phrase = 'This is a little house'
dictSyns = {'little':['small','tiny','little'],
'house':['cottage','house']}
listPhrases = ['This is a tiny house','This is a small cottage','This is a small building','I need advice']
I can create a code which can do that on this example which returns bool:
if any('This'+' '+'is'+' '+'a'+x+' '+y == phrase for x in dictSyns['little'] for y in dictSyns['house']):
print 'match'
The first point is that I have to create the function which would be universal (depends on results). The second is that I want this function to returns list of matched phrases.
Can you give me an advice how to do that so the method returns ['This is a tiny house','This is a small cottage']
in this case?
The output would be like:
>>> getMatches(phrase, dictSyns, listPhrases)
['This is a tiny house','This is a small cottage']
I would approach this as follows:
import itertools
def new_phrases(phrase, syns):
"""Generate new phrases from a base phrase and synonyms."""
words = [syns.get(word, [word]) for word in phrase.split(' ')]
for t in itertools.product(*words):
yield ' '.join(t)
def get_matches(phrase, syns, phrases):
"""Generate acceptable new phrases based on a whitelist."""
phrases = set(phrases)
for new_phrase in new_phrases(phrase, syns):
if new_phrase in phrases:
yield new_phrase
The root of the code is the assignment of words
, in new_phrases
, which transforms the phrase
and syns
into a more usable form, a list where each element is a list of the acceptable choices for that word:
>>> [syns.get(word, [word]) for word in phrase.split(' ')]
[['This'], ['is'], ['a'], ['small', 'tiny', 'little'], ['cottage', 'house']]
Note the following:
- Use of generators to deal more efficiently with large numbers of combinations (not building the whole list at once);
- Use of a
set
for efficient (O(1)
, vs. O(n)
for a list) membership testing;
- Use of
itertools.product
to generate the possible combinations of phrase
based on the syns
(you could also use itertools.ifilter
in implementing this); and
- Style guide compliance.
In use:
>>> list(get_matches(phrase, syns, phrases))
['This is a small cottage', 'This is a tiny house']
Things to think about:
- What about the case of characters (e.g. how should
"House of Commons"
be treated)?
- What about punctuation?
I went about it this way:
for value in dictSyns:
phrase = phrase + dictSyns[value]
for each_phrase in listPhrases:
if any(word not in phrase for word in each_phrase.split()):
pass
else:
print each_phrase
Probably not hugely efficient. It creates a list of acceptable words. It then compares each word in each string to that list and if there are no unacceptable words it prints the phrase.
EDIT: I've also realised this doesn't check for grammatical sense. For example the phrase 'little little this a' would still return as correct. It's simply checking for each word. I'll leave this here to display my shame.