I tagged python and perl in this only because that's what I've used thus far. If anyone knows a better way to go about this I'd certainly be willing to try it out. Anyway, my problem:
I need to create an input file for a gene prediction program that follows the following format:
seq1 5 15
seq1 20 34
seq2 50 48
seq2 45 36
seq3 17 20
Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. Now I have this information, in a .gff3 file that has a lot of other information. I can open this with excel and easily delete the columns with non-relevant data. Here's how it's arranged now:
PITG_00002 . gene 2 397 . + . ID=g.1;Name=ORF%
PITG_00002 . mRNA 2 397 . + . ID=m.1;
**PITG_00002** . exon **2 397** . + . ID=m.1.exon1;
PITG_00002 . CDS 2 397 . + . ID=cds.m.1;
PITG_00004 . gene 1 1275 . + . ID=g.3;Name=ORF%20g
PITG_00004 . mRNA 1 1275 . + . ID=m.3;
**PITG_00004** . exon **1 1275** . + . ID=m.3.exon1;P
PITG_00004 . CDS 1 1275 . + . ID=cds.m.3;P
PITG_00004 . gene 1397 1969 . + . ID=g.4;Name=
PITG_00004 . mRNA 1397 1969 . + . ID=m.4;
**PITG_00004** . exon **1397 1969** . + . ID=m.4.exon1;
PITG_00004 . CDS 1397 1969 . + . ID=cds.m.4;
So I need only the data that is in bold. For example,
PITG_0002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Any help you could give would be greatly appreciated, thanks!
Edit: Well I messed up the formatting. Anything that is between the **'s is what I need lol.