I have a series of strings in a file of the format:
>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
I am trying to find a regex pattern which will remove the new line characters below the >
character in between the next >
character. So the final result would look like:
>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
Does anyone know how I can come up with a regex pattern to do this?
Side note: This format is common in computational science as a FASTA format.
Thanks!
As noted in the comments, your best bet is to use an existing FASTA parser. Why not?
Here's how I would join lines based on the leading greater-than:
def joinup(f):
buf = []
for line in f:
if line.startswith('>'):
if buf:
yield " ".join(buf)
yield line.rstrip()
buf = []
else:
buf.append(line.rstrip())
yield " ".join(buf)
for joined_line in joinup(open("...")):
# blah blah...
you don't have to use regex:
[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
should work.
In [43]: f=open('test.txt')
In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
In [45]: contents
Out[45]:
['>HEADER_Text1\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text2\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text3\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada']
this should also work.
sampleText=""">HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
HEADER_Text2
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
HEADER_Text3
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada""""
cleartext = re.sub ("\n(?!>)", "", sampleText)
print cleartext
HEADER_Text1Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada
HEADER_Text2Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada
HEADER_Text3Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada
Given that the > is always expected to be the first character on the new line
"\n([^>])" with " \1"
You really don't want a regex. And for this job, python and biopython are superfluous. If that's actually FASTQ format, just use sed
:
sed '/^>/ { N; N; N; s/\n/ /2g }' file
Results:
>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada