Building Nested dictionary in Python reading in li

2020-03-30 08:00发布

问题:

The way I go about nested dictionary is this:

dicty = dict()
tmp = dict()
tmp["a"] = 1
tmp["b"] = 2
dicty["A"] = tmp

dicty == {"A" : {"a" : 1, "b" : 1}}

The problem starts when I try to implement this on a big file, reading in line by line. This is printing the content per line in a list:

['proA', 'macbook', '0.666667']
['proA', 'smart', '0.666667']
['proA', 'ssd', '0.666667']
['FrontPage', 'frontpage', '0.710145']
['FrontPage', 'troubleshooting', '0.971014']

I would like to end up with a nested dictionary (ignore decimals):

{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
 'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}

As I am reading in line by line, I have to check whether or not the first word is still found in the file (they are all grouped), before I add it as a complete dict to the higher dict.

This is my implementation:

def doubleDict(filename):
    dicty = dict()
    with open(filename, "r") as f:
        row = 0
        tmp = dict()
        oldword = ""
        for line in f:
            values = line.rstrip().split(" ")
            print(values)
            if oldword == values[0]:
                tmp[values[1]] = values[2]
            else:
                if oldword is not "":
                    dicty[oldword] = tmp
                tmp.clear()
                oldword = values[0]
                tmp[values[1]] = values[2]
            row += 1
            if row % 25 == 0:
                print(dicty)
                break #print(row)
    return(dicty)

I would actually like to have this in pandas, but for now I would be happy if this would work as a dict. For some reason after reading in just the first 5 lines, I end up with:

{'proA': {'frontpage': '0.710145', 'troubleshooting': '0.971014'}},

which is clearly incorrect. What is wrong?

回答1:

Use a collections.defaultdict() object to auto-instantiate nested dictionaries:

from collections import defaultdict

def doubleDict(filename):
    dicty = defaultdict(dict)
    with open(filename, "r") as f:
        for i, line in enumerate(f):
            outer, inner, value = line.split()
            dicty[outer][inner] = value
            if i % 25 == 0:
                print(dicty)
                break #print(row)
    return(dicty)

I used enumerate() to generate the line count here; much simpler than keeping a separate counter going.

Even without a defaultdict, you can let the outer dictionary keep the reference to the nested dictionary, and retrieve it again by using values[0]; there is no need to keep the temp reference around:

>>> dicty = {}
>>> dicty['A'] = {}
>>> dicty['A']['a'] = 1
>>> dicty['A']['b'] = 2
>>> dicty
{'A': {'a': 1, 'b': 1}}

All the defaultdict then does is keep us from having to test if we already created that nested dictionary. Instead of:

if outer not in dicty:
    dicty[outer] = {}
dicty[outer][inner] = value

we simply omit the if test as defaultdict will create a new dictionary for us if the key was not yet present.



回答2:

While this isn't the ideal way to do things, you're pretty close to making it work.

Your main problem is that you're reusing the same tmp dictionary. After you insert it into dicty under the first key, you then clear it and start filling it with the new values. Replace tmp.clear() with tmp = {} to fix that, so you have a different dictionary for each key, instead of the same one for all keys.

Your second problem is that you're never storing the last tmp value in the dictionary when you reach the end, so add another dicty[oldword] = tmp after the for loop.

Your third problem is that you're checking if oldword is not "":. That may be true even if it's an empty string, because you're comparing identity, not equality. Just change that to if oldword:. (This one, you'll usually get away with, because small strings are usually interned and will usually share identity… but you shouldn't count on that.)

If you fix both of those, you get this:

{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
 'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}

I'm not sure how to turn this into the format you claim to want, because that format isn't even a valid dictionary. But hopefully this gets you close.


There are two simpler ways to do it:

  • Group the values with, e.g., itertools.groupby, then transform each group into a dict and insert it all in one step. This, like your existing code, requires that the input already be batched by values[0].
  • Use the dictionary as a dictionary. You can look up each key as it comes in and add to the value if found, create a new one if not. A defaultdict or the setdefault method will make this concise, but even if you don't know about those, it's pretty simple to write it out explicitly, and it'll still be less verbose than what you have now.

The second version is already explained very nicely in Martijn Pieters's answer.

The first can be written like this:

def doubleDict(s):
    with open(filename, "r") as f:
        rows = (line.rstrip().split(" ") for line in f)
        return {k: {values[1]: values[2] for values in g}
                for k, g in itertools.groupby(rows, key=operator.itemgetter(0))}

Of course that doesn't print out the dict so far after every 25 rows, but that's easy to add by turning the comprehension into an explicit loop (and ideally using enumerate instead of keeping an explicit row counter).