I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me?
A.fasta
>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
Expected Output
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
Don't know much about Fasta, but Python has a Fasta module (you need to install it first).
Then you can use the sample function from Python's Random module and pick as many as you want at random...
Given the file format that you have shown, and assuming that the file is not too large, you don't need any external module (e.g. biopython) to do this:
Example output:
This simply selects 2 random sequence headers (those lines from A.fasta with even indices in
data
) and the line following it.If your file is large then external modules might have optimisations to cope with larger data sets.
Depends if you have unix
sort
orshuf
installed. If so, its very easy Select random 3000 lines from a file with awk codesor
Then, use samtools to extract
If you are working with fasta files use BioPython, to get
n
sequences use random.sample:Output:
You can extract the strings if necessary:
If the lines were always in pairs and you skipped the metadata at the top you could zip:
Which will give you pairs of lines in tuples:
To get the lines ready to be written:
Output: