Generate “special” dictionary structure from just

2019-07-18 21:28发布

问题:

Imagine a tab separated file like this one:

9606    1   GO:0002576  TAS -   platelet degranulation  -   Process
9606    1   GO:0003674  ND  -   molecular_function_z    -   Function
9606    1   GO:0003674  OOO -   molecular_function_z    -   Function
9606    1   GO:0005576  IDA -   extracellular region    -   Component
9606    1   GO:0005576  TAS -   extracellular region    -   Component
9606    1   GO:0005576  OOO -   extracellular region    -   Component
9606    1   GO:0005615  HDA -   extracellular spaces    -   Component
9606    1   GO:0008150  ND  -   biological_processes    -   Process
9606    1   GO:0008150  OOO -   biological_processes    -   Process
9606    1   GO:0008150  HHH -   biological_processes    -   Process
9606    1   GO:0008150  YYY -   biological_processes    -   Process
9606    1   GO:0031012  IDA -   extracellular matrix    -   Component
9606    1   GO:0043312  TAS -   neutrophil degranulat   -   Process

I want to create a function that receive the number of the columns which have the information to be saved and return a "special" dictionary. And I say "special" because in my case that information is always categorical, but it could have different levels, and I am tired to write constantly the logic to add the information for each level. (Maybe there is another way of doing, that I was not able to search for, so, sorry in advanced for my ignorance)

If the specified columns are 8, 2 and 3. Being 8 the column with the highest category and 3 with the lowest, the expected dictionary could be obtained:

three_userinput = "8:2:3"
three = map(lambda x: int(x) - 1, three_userinput.split(":"))
DICT3 = {}
for line in file_handle:
info = line.split("\t")
    if info[three[0]] in DICT3:
        if info[three[1]] in DICT3[info[three[0]]]:
            DICT3[info[three[0]]][info[three[1]]].add(info[three[2]])
        else:
            DICT3[info[three[0]]][info[three[1]]] = set([info[three[2]]])
    else:
        DICT3[info[three[0]]] = {info[three[1]]:set([info[three[2]]])}

pprint.pprint(DICT3)

Output:

{'Component': {'1': set(['GO:0005576', 'GO:0005615', 'GO:0031012'])},
 'Function': {'1': set(['GO:0003674'])},
 'Process': {'1': set(['GO:0002576', 'GO:0008150', 'GO:0043312'])}}

Now with four columns 8, 2, 3 and 4. Being 8 the column with the highest category and 4 with the lowest, the expected dictionary could be obtained:

four_userinput = "8:2:3:4"
four = map(lambda x: int(x) - 1, four_userinput.split(":"))
DICT4 = {}
for line in file_handle:
    info = line.split("\t")
    if info[four[0]] in DICT4:
        if info[four[1]] in DICT4[info[four[0]]]:
            if info[four[2]] in DICT4[info[four[0]]][info[four[1]]]:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]].add(info[four[3]])
            else:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]] = set([info[four[3]]])
        else:
            DICT4[info[four[0]]][info[four[1]]] = {info[four[2]]:set([info[four[3]]])}
    else:
        DICT4[info[four[0]]] = {info[four[1]]:{info[four[2]]:set([info[four[3]]])}}

pprint.pprint(DICT4)

Output:

{'Component': {'1': {'GO:0005576': set(['IDA', 'OOO', 'TAS']),
                     'GO:0005615': set(['HDA']),
                     'GO:0031012': set(['IDA'])}},
 'Function': {'1': {'GO:0003674': set(['ND', 'OOO'])}},
 'Process': {'1': {'GO:0002576': set(['TAS']),
                   'GO:0008150': set(['HHH', 'ND', 'OOO', 'YYY']),
                   'GO:0043312': set(['TAS'])}}}

Now when I faced five levels of information (five columns), the code was almost unreadable and really really tedious... I could create specific functions for each number of levels, but.. Is there a way to design a function that could handle any number of levels?

Please If I have not explained myself properly, do not hesitate in asking me.

回答1:

What you need is a defaultdict(). This allows you to update entries without having to first test if they exist. i.e. if it does not exist, a default value is automatically added. As you have multiple levels, you will need to create nested defaultdicts recursively using the build_defaultdict(levels) function. Setting the value would also need to be recursive but the logic would be simpler:

import pprint
import csv
from operator import itemgetter
from collections import defaultdict


def build_defaultdict(levels):
    return defaultdict(set) if levels <= 1 else defaultdict(lambda : build_defaultdict(levels - 1))


def set_value(d, row):
    if len(row) <= 2:
        d[row[0]].add(row[1])
    else:
        d[row[0]] = set_value(d[row[0]], row[1:])

    return d


req_cols = [7, 1, 2, 3]     # counting from col 0

data = build_defaultdict(len(req_cols) - 1)
get_cols = itemgetter(*req_cols)

with open('input.csv', 'r', newline='') as f_input:
    for row in csv.reader(f_input, delimiter='\t'):
        set_value(data, get_cols(row))

pprint.pprint(data)
print(data['Component']['1']['GO:0005576'])        

This would create your dictionary as follows:

defaultdict(<function <lambda> at 0x000002350F481B70>,
    {
        'Component': defaultdict(<function <lambda>.<locals>.<lambda> at 0x000002350F6EB378>,
            {'1': defaultdict(<class 'set'>,
                {'GO:0005576': {'IDA', 'OOO', 'TAS'},
                 'GO:0005615': {'HDA'},
                 'GO:0031012': {'IDA'}})}),
        'Function': defaultdict(<function <lambda>.<locals>.<lambda> at 0x000002350F6EB400>,
            {'1': defaultdict(<class 'set'>,
                {'GO:0003674': {'ND', 'OOO'}})}),
     'Process': defaultdict(<function <lambda>.<locals>.<lambda> at 0x00000235071BE0D0>,
            {'1': defaultdict(<class 'set'>,
                {'GO:0002576': {'TAS'},
                 'GO:0008150': {'HHH', 'ND', 'OOO', 'YYY'},
                 'GO:0043312': {'TAS'}})})})

{'TAS', 'OOO', 'IDA'}

It may display differently to a normal dictionary, but it works the same way as a normal dictionary. Also itemgetter() can be used to extract the required elements from a list into another list.



回答2:

You can define a recursive function which does this.

def update_nested_dict(d, vars):
    if len(vars) > 2:
        try:
            d[vars[0]] = update_nested_dict(d[vars[0]], vars[1:])
        except KeyError:
            d[vars[0]] = update_nested_dict({}, vars[1:])
    else:
        try:
            d[vars[0]] = d[vars[0]].union([vars[1]])
        except KeyError:
            d[vars[0]] = set([vars[1]])
    return d

Preserving as much of your code logic and variable names as needed,

>>> userinput = "8:2:3:4"
>>> cols = map(lambda x: int(x) - 1, userinput.split(":"))
>>> 
>>> DICT = {}
>>> 
>>> for line in file_handle:
>>>     info = line.replace("\n", "").split("\t")
>>>     names = [info[c] for c in cols]
>>>     _ = update_nested_dict(DICT, names)
>>>
>>> for k, v in DICT.iteritems():
...  print k, v
...
Process {'1': {'GO:0002576': set(['TAS']), 'GO:0008150': set(['YYY', 'OOO', 'HHH', 'ND']), 'GO:0043312': set(['TAS'])}}
Function {'1': {'GO:0003674': set(['OOO', 'ND'])}}
Component {'1': {'GO:0005576': set(['OOO', 'IDA', 'TAS']), 'GO:0005615': set(['HDA']), 'GO:0031012': set(['IDA'])}}