I wrote a piece of code that is supposed to find common intersecting ID's in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out why, can you suggest me what is wrong? The exact problem is when my input is i.e. 200 it gives me 90 intersections, if I reduce it to 150, it gives me intersections of 110, logically it cannot be higher.
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
#fileA.next()
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in set(dictA).intersection(dictB):
output.write(dictB[key][0]+'\t'+dictA[key][1]+'\t'+dictA[key][4]+'\t'+dictA[key][5]+'\t'+dictA[key][9]+'\t'+dictA[key][10]+'\n')
My file1 is sorted by line[0] and has 0-15 lines, to make it simpler here I give an example putting only line[0] and line[1],
contig17 GRMZM2G052619_P03 x x x x x x x x x x x x x x
contig33 AT2G41790.1 x x x x x x x x x x x x x x
contig98 GRMZM5G888620_P01 x x x x x x x x x x x x x x
contig102 GRMZM5G886789_P02 x x x x x x x x x x x x x x
contig123 AT3G57470.1 x x x x x x x x x x x x x x
My file2 is not sorted and has 0-10 line, I give only line[1]
y GRMZM2G052619_P03 y y y y y y y y
y GRMZM5G888620_P01 y y y y y y y y
y GRMZM5G886789_P02 y y y y y y y y
My desired output,
contig17 GRMZM2G052619_P03 y y y y
contig98 GRMZM5G888620_P01 y y y y
contig102 GRMZM5G886789_P02 y y y y
Pay attention to this:
output.write(dictB[key][0]+'\t'+dictA[key][1]
It means you print file2 first column than file1 second column. It doesn't correspond with your examples and desired output.
As for intersection routine, it looks quite correct, so probably it's something wrong with your file. Are you sure all keys are unique? What do you mean by "reduce to 150" - do you mean just deleting some lines from this very file.
Also better replace
for key in set(dictA).intersection(dictB):
with
for key in dictA:
if key in dictB:
It's actually the same, but should be faster and spends less memory.
You shall narrow your problem and play a bit on testing. I will not detail on using testing frameworks and show you, how to use assert
.
assert
has two parameters, first is expression, which is expected to be true.
Second one is optional and shall contain positively expressed assumption of what is expected to be true.
Here is modified example with these tests:
fileA_txt = """contig17 GRMZM2G052619_P03 x x x x x x x x x x x x x x
contig33 AT2G41790.1 x x x x x x x x x x x x x x
contig98 GRMZM5G888620_P01 x x x x x x x x x x x x x x
contig102 GRMZM5G886789_P02 x x x x x x x x x x x x x x
contig123 AT3G57470.1 x x x x x x x x x x x x x x
"""
# or read it from file
#with open("filaA.txt") as f:
# fileA_txt = f.read()
fileB_txt = """y GRMZM2G052619_P03 y y y y y y y y
y GRMZM5G888620_P01 y y y y y y y y
y GRMZM5G886789_P02 y y y y y y y y
"""
# or read it from file
#with open("filaB.txt") as f:
# fileB_txt = f.read()
dictA = dict()
for line1 in fileA_txt.splitlines():
listA = line1.split()
dictA[listA[1]] = listA
assert len(dictA) == 5, "fileA_txt shall contain 5 unique IDs"
dictB = dict()
for line1 in fileB_txt.splitlines():
listB = line1.split()
dictB[listB[1]] = listB
assert len(dictB) == 3, "fileA_txt shall contain 3 unique IDs"
common_IDs = set(dictA).intersection(dictB)
assert len(common_IDs) == 3, "there shall be just 3 common keys"
You shall than play with your files and narrow down what is working and what not.
Simply replace the fileA_txt (or the alternative, reading it from a file) with other files, which were surprising you before.
Add more asserts, if you find assumptions you may expect (like if your files have always the same number of lines and unique ids, test it, the code would have to be modified)
Keep running your script until assert exceptions appear.