The way I go about nested dictionary is this:
dicty = dict()
tmp = dict()
tmp["a"] = 1
tmp["b"] = 2
dicty["A"] = tmp
dicty == {"A" : {"a" : 1, "b" : 1}}
The problem starts when I try to implement this on a big file, reading in line by line. This is printing the content per line in a list:
['proA', 'macbook', '0.666667']
['proA', 'smart', '0.666667']
['proA', 'ssd', '0.666667']
['FrontPage', 'frontpage', '0.710145']
['FrontPage', 'troubleshooting', '0.971014']
I would like to end up with a nested dictionary (ignore decimals):
{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}
As I am reading in line by line, I have to check whether or not the first word is still found in the file (they are all grouped), before I add it as a complete dict to the higher dict.
This is my implementation:
def doubleDict(filename):
dicty = dict()
with open(filename, "r") as f:
row = 0
tmp = dict()
oldword = ""
for line in f:
values = line.rstrip().split(" ")
print(values)
if oldword == values[0]:
tmp[values[1]] = values[2]
else:
if oldword is not "":
dicty[oldword] = tmp
tmp.clear()
oldword = values[0]
tmp[values[1]] = values[2]
row += 1
if row % 25 == 0:
print(dicty)
break #print(row)
return(dicty)
I would actually like to have this in pandas, but for now I would be happy if this would work as a dict. For some reason after reading in just the first 5 lines, I end up with:
{'proA': {'frontpage': '0.710145', 'troubleshooting': '0.971014'}},
which is clearly incorrect. What is wrong?
While this isn't the ideal way to do things, you're pretty close to making it work.
Your main problem is that you're reusing the same
tmp
dictionary. After you insert it intodicty
under the first key, you thenclear
it and start filling it with the new values. Replacetmp.clear()
withtmp = {}
to fix that, so you have a different dictionary for each key, instead of the same one for all keys.Your second problem is that you're never storing the last
tmp
value in the dictionary when you reach the end, so add anotherdicty[oldword] = tmp
after thefor
loop.Your third problem is that you're checking
if oldword is not "":
. That may be true even if it's an empty string, because you're comparing identity, not equality. Just change that toif oldword:
. (This one, you'll usually get away with, because small strings are usually interned and will usually share identity… but you shouldn't count on that.)If you fix both of those, you get this:
I'm not sure how to turn this into the format you claim to want, because that format isn't even a valid dictionary. But hopefully this gets you close.
There are two simpler ways to do it:
itertools.groupby
, then transform each group into a dict and insert it all in one step. This, like your existing code, requires that the input already be batched byvalues[0]
.defaultdict
or thesetdefault
method will make this concise, but even if you don't know about those, it's pretty simple to write it out explicitly, and it'll still be less verbose than what you have now.The second version is already explained very nicely in Martijn Pieters's answer.
The first can be written like this:
Of course that doesn't print out the dict so far after every 25 rows, but that's easy to add by turning the comprehension into an explicit loop (and ideally using
enumerate
instead of keeping an explicitrow
counter).Use a
collections.defaultdict()
object to auto-instantiate nested dictionaries:I used
enumerate()
to generate the line count here; much simpler than keeping a separate counter going.Even without a
defaultdict
, you can let the outer dictionary keep the reference to the nested dictionary, and retrieve it again by usingvalues[0]
; there is no need to keep thetemp
reference around:All the
defaultdict
then does is keep us from having to test if we already created that nested dictionary. Instead of:we simply omit the
if
test asdefaultdict
will create a new dictionary for us if the key was not yet present.