Find, replace, and increment at each occurence of

I'm relatively new to scripting and apologize in advance for this painfully simple problem. I believe I've searched pretty thoroughly, but apparently no other answers or cookbooks have been explicit enough for me to understand (like here - still couldn't get it).

I have a file that is made up of strings of letters (DNA, if you care), one string per line. Above each string I've inserted another line to identify the underlying string. For those of you who are bioinformaticians, I'm trying to make up a test data set in fasta format, maybe you have tools? Anyway, I'd put a distinct word, "num", after each ">" with the intention of using a bash incrementer and sed to create a unique number heading each string. For example, in data.txt, I have...

>num, blah, blah, blah

ATCGACTGAATCGA

>num, blah, blah, blah

ATCGATCGATCGATCG

>num, blah, blah, blah

ATCGATCGATCGATCG

I would like it to be...

>0, blah, blah, blah

ATCGACTGAATCGA

>1, blah, blah, blah

ATCGATCGATCGATCG

>2, blah, blah, blah

ATCGATCGATCGATCG

The solution can be in any language as long as it's complete && gets the job done. I have a little experience with sed, awk, bash, and c++ (little == slightly more than no experience). I know, I know, I need to learn perl, but I've only just started. The question is this: How to replace "num" with a number that increments on each replacement? It doesn't matter if the underlying string is identical to another somewhere else. Thanks for your help in advance!

回答1:

perl -ple 's/num/$n++/e' filename

dry run 1st, if it is do that, what you want

回答2:

This uses process substitution, which may or may not be available on your system.

jcomeau@intrepid:/tmp$ exec 3< <(cat test.txt)
jcomeau@intrepid:/tmp$ i=0
jcomeau@intrepid:/tmp$ while read -u 3 first_word the_rest; do
 if [ "$first_word" == ">num," ]; then
 echo ">$i," $the_rest; i=$((i + 1)); else
 echo $first_word $the_rest; fi; done
>0, blah, blah, blah

ATCGACTGAATCGA

>1, blah, blah, blah

ATCGATCGATCGATCG

>2, blah, blah, blah

ATCGATCGATCGATCG