Finding common ID's (intersection) in two dict

2019-09-07 21:07发布

问题:

I wrote a piece of code that is supposed to find common intersecting ID's in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out why, can you suggest me what is wrong? The exact problem is when my input is i.e. 200 it gives me 90 intersections, if I reduce it to 150, it gives me intersections of 110, logically it cannot be higher.

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
#fileA.next()

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in set(dictA).intersection(dictB):
    output.write(dictB[key][0]+'\t'+dictA[key][1]+'\t'+dictA[key][4]+'\t'+dictA[key][5]+'\t'+dictA[key][9]+'\t'+dictA[key][10]+'\n')

My file1 is sorted by line[0] and has 0-15 lines, to make it simpler here I give an example putting only line[0] and line[1],

contig17    GRMZM2G052619_P03  x x x x x x x x x x x x x x
contig33    AT2G41790.1    x x x x x x x x x x x x x x
contig98    GRMZM5G888620_P01  x x x x x x x x x x x x x x  
contig102   GRMZM5G886789_P02  x x x x x x x x x x x x x x  
contig123   AT3G57470.1    x x x x x x x x x x x x x x

My file2 is not sorted and has 0-10 line, I give only line[1]

y GRMZM2G052619_P03 y y y y y y y y         
y GRMZM5G888620_P01 y y y y y y y y     
y GRMZM5G886789_P02 y y y y y y y y     

My desired output,

contig17    GRMZM2G052619_P03  y y y y
contig98    GRMZM5G888620_P01  y y y y  
contig102   GRMZM5G886789_P02  y y y y  

回答1:

Pay attention to this:

output.write(dictB[key][0]+'\t'+dictA[key][1]

It means you print file2 first column than file1 second column. It doesn't correspond with your examples and desired output.

As for intersection routine, it looks quite correct, so probably it's something wrong with your file. Are you sure all keys are unique? What do you mean by "reduce to 150" - do you mean just deleting some lines from this very file.

Also better replace

for key in set(dictA).intersection(dictB):

with

for key in dictA:
   if key in dictB:

It's actually the same, but should be faster and spends less memory.



回答2:

You shall narrow your problem and play a bit on testing. I will not detail on using testing frameworks and show you, how to use assert.

assert has two parameters, first is expression, which is expected to be true. Second one is optional and shall contain positively expressed assumption of what is expected to be true.

Here is modified example with these tests:

fileA_txt = """contig17    GRMZM2G052619_P03  x x x x x x x x x x x x x x
contig33    AT2G41790.1    x x x x x x x x x x x x x x
contig98    GRMZM5G888620_P01  x x x x x x x x x x x x x x  
contig102   GRMZM5G886789_P02  x x x x x x x x x x x x x x  
contig123   AT3G57470.1    x x x x x x x x x x x x x x
"""
# or read it from file
#with open("filaA.txt") as f:
#  fileA_txt = f.read()
fileB_txt = """y GRMZM2G052619_P03 y y y y y y y y         
y GRMZM5G888620_P01 y y y y y y y y     
y GRMZM5G886789_P02 y y y y y y y y     
"""
# or read it from file
#with open("filaB.txt") as f:
#  fileB_txt = f.read()

dictA = dict()
for line1 in fileA_txt.splitlines():
    listA = line1.split()
    dictA[listA[1]] = listA
assert len(dictA) == 5, "fileA_txt shall contain 5 unique IDs"

dictB = dict()
for line1 in fileB_txt.splitlines():
    listB = line1.split()
    dictB[listB[1]] = listB
assert len(dictB) == 3, "fileA_txt shall contain 3 unique IDs"

common_IDs = set(dictA).intersection(dictB)
assert len(common_IDs) == 3, "there shall be just 3 common keys"

You shall than play with your files and narrow down what is working and what not.

Simply replace the fileA_txt (or the alternative, reading it from a file) with other files, which were surprising you before.

Add more asserts, if you find assumptions you may expect (like if your files have always the same number of lines and unique ids, test it, the code would have to be modified)

Keep running your script until assert exceptions appear.