Using Interval tree to find overlapping regions

2019-06-14 01:43发布

问题:

I have two files File 1

chr1:4847593-4847993 TGCCGGAGGGGTTTCGATGGAACTCGTAGCA

File 2

Pbsn|X|75083240|75098962| TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC

File 1 has approximately 8000 more lines with different header and sequence below it. I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let say by +- 100 if yes then match the sequence in file 2 and then print out the header info for file 2 and the matched sequence. My approach use interval tree(in python i am still trying to get a hang of it), store the co ordinates ?
I tried using re.match but its not giving me accurate results. Any tips would be highly appreciated. Thanks.

My first try, How ever now i have hit another road block so for my second second file if my start and end is 5000 and 8000 respectively I want to change this by subtracting 2000 so my new start and stop is 3000 and 5000 here is my code

from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
     for row in rows:
     #print row
     if row.startswith('>'):
        row = row.strip().split('|')
        chrom_name = row[5]
        start = int[row[3]
        end = int(row[3])
        # one interval tree per chromosome
        if chrom_name not in genome:
           genome[chrom_name] = IntervalTree()                
            # first time we've encountered this chromosome, createtree                    
            # index the feature
           genome[chrom_name].addi(start,end,row[2])
           #for key,value in genome.iteritems():
           #print key, ":", value

mast = defaultdict(list)
with open(file1', 'r') as f:
     for row in f:
     row = row.strip().split()
     row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
     row[0] = 'MT' if row[0] == 'M' else row[0]
     #print row[0]
     mast[row[0]].append({
     'start':int(row[1]),
     'end':int(row[2])
     })
     #for k,v in mast.iteritems():
     #print k, ":", v  

with open(binding_factor, 'w') as f :
     for k,v in mast.iteritems():
         for i in v:
             g = genome[k].search(i['start'],i['end'])
             if g:
                 print g
                 l = gene
                 f.write(str(l)`enter code here` + '\n')