How to randomly extract FASTA sequences using Pyth

I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me?

A.fasta

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT

Expected Output

>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG

标签： python extract bioinformatics extraction fasta

6条回答

对你真心纯属浪费

2楼-- · 2019-01-15 21:23

Don't know much about Fasta, but Python has a Fasta module (you need to install it first).

>>> from pyfasta import Fasta

>>> f = Fasta('tests/test1.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

Then you can use the sample function from Python's Random module and pick as many as you want at random...

from random import sample
sample(f, how_many_you_want)

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-01-15 21:28

Given the file format that you have shown, and assuming that the file is not too large, you don't need any external module (e.g. biopython) to do this:

import random

with open('A.fasta') as f:
    data = f.read().splitlines()
    for i in random.sample(range(0, len(data), 2), 2):
        print data[i]
        print data[i+1]

Example output:

>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG

This simply selects 2 random sequence headers (those lines from A.fasta with even indices in data) and the line following it.

If your file is large then external modules might have optimisations to cope with larger data sets.

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-01-15 21:34

Depends if you have unix sort or shuf installed. If so, its very easy Select random 3000 lines from a file with awk codes

Create a list of headers

grep '>' A.fasta >FILE

Then select 2 random lines from that file

CONTIGS=sort -R FILE | head -n2|tr "\n" " "

CONTIGS=shuf -n2 FILE|tr "\n" " "

Then, use samtools to extract

samtools faidx A.fasta $CONTIGS

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-01-15 21:40

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
  choose = random.randint(1,totseq-1)               #Choose a random sequence record number
  for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
    if choose == outrandseq[j]:
      choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
  outrandseq.append(choose)
  print choose
  outseq = seq[choose]
  outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2019-01-15 21:43

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. 
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
  choice = random.randint(1,totseq)
  outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
    duplicate = 0
    for i in range(numrandseq):
      for j in range(i+1, numrandseq):
        if outrandseqset[i] == outrandseqset[j]:
            outrandseqset[j] = random.randint(1,totseq)
            duplicate = 1


i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
  print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
  seqnum = outrandseqset[i]
  outseq = seq[seqnum]
  outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

0人赞添加讨论(0) 举报

Lonely孤独者°

7楼-- · 2019-01-15 21:48

If you are working with fasta files use BioPython, to get n sequences use random.sample:

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f,"fasta")
    print(sample(list(seqs), 2))

Output:

[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]

You can extract the strings if necessary:

 print([(seq.name,str(seq.seq)) for seq in  sample(list(seqs),2)])
 [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]

If the lines were always in pairs and you skipped the metadata at the top you could zip:

from random import sample

with open("foo.fasta") as f:
    print(sample(list(zip(f, f)), 2))

Which will give you pairs of lines in tuples:

[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]

To get the lines ready to be written:

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f, "fasta")
    samps = ((seq.name, seq.seq) for seq in  sample(list(seqs),2))
    for samp in samps:
        print(">{}\n{}".format(*samp))

Output:

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG

0人赞添加讨论(0) 举报

How to randomly extract FASTA sequences using Pyth

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间