I'm reading in a long list of words, and I made a node for every word in the list. Each node has an attribute 'word' for their position in the list.
I am trying to connect a node to the next node if the next node is the previous node, with an addition of just one letter
I also alphabetically ordered each word per character, so that CAT -> ACT
I want to draw an edge from each unique starting word, to all of the possible chains, so I can see all the possible chains in the list.
For example
A -> AN -> TAN -> RANT
However A --x-> T
This is my attempt
for i in range(0, G.number_of_nodes()-1):
if ( ( (len(G.node[i]['word'])+1) == len(G.node[i+1]['word']) ) and (G.node[i]['word'] in G.node[i+1]['word'])):
print G.node[i]['word'], G.node[i+1]['word']
Gave me this,
DGO DGOS
DGOS DGOSS
I IN
ELLMS ELLMSS
AEPRS AEPRSS
INW DINW
DINW DINWY
What the word list and the alphabetical list looks like
Why do I not see IN INW?
Also, AGNRT AGNRST should be on there but I don't understand why, along with a lot of other pairs
Where do you think I went wrong?
The problem is that you are only comparing words that appear right next to each other in the list, i.e. words i
and i+1
, e.g. I
and IN
appear next to each other, as do WIN
and WIND
, but IN
and WIND
are far apart. It seems you want to compare all possible words, which requires a more sophisticated algorithm. Here's an idea:
- Make a dictionary where they keys are sorted words and the values are lists of actual words, e.g.
{"ACT": ["CAT", "ACT", "TAC], ...}
. A collections.defaultdict(list)
will be useful for this.
- Sort the full input list of words by length. You can use
list.sort(key=len)
assuming you have just a list of words.
- Iterate through the list sorted by length. For each word, go through every subset of length
n-1
. Something like for i in range(len(word)): process(word[:i] + word[i+1:])
. You may want to be careful about duplicates here.
- For each subset, sort the subset and look it up in the dictionary. Make a link from every word in the dictionary's value (a list of actual words) to the bigger word.
Looks like a formal languages problem. How do you handle looping nodes?
IN INW is in the list you gave.
AGNRT AGNRST are not in the list, because you started out with a single letter, that letter has to be in the next word for example I -> IN, but IN is not in AGNRT or AGNRST
You seem to be comparing each node with just one other node, so
"IN" directly follows "I" in your wordlist, but "INW" is not directly after "IN"
You can use the 3rd party python library, python-levenshtein
, to calculate the Levenshtein Distance which is the string edit distance. In your case, the only allowed 'edit' is the 'insertion' of the character on the next string/word on your list, so you will also need to verify that the length of the next word is 1 plus the previous word.
Here is the sample code that would achieve our stuff:
import Levenshtein as lvst
if len(word2) - len(word1) == 1 and lvst.distance(word1, word2) == 1:
print(word1, word2)
You can install python-levenshtein
by either apt-get
(systemwide) or pip
:
sudo apt-get install python-levenshtein
or
sudo apt-get install python3-levenshtein
or
pip install python-levenshtein