Python - Generating random dna sequences with Nump

2019-02-27 16:55发布

问题:

there are two questions i would like to ask anybody that is familiar with numpy. i have seen very similar questions (and answers) but none of those used numpy which i would like to use since it offers a lot of other options i might want to use within that code in the future. i have tried to generate a list of random nucleotide sequences using "random" in python. since i wanted to have non-uniform probabilities i decided to use numpy instead. however, i get the error message: "ValueError: a must be 1-dimensional or an integer".

import numpy as np

def random_dna_sequence(length):
    return ''.join(np.random.choice('ACTG') for _ in range(length))

with open('dna.txt', 'w+') as txtout:
    for _ in range(10):
        dna = random_dna_sequence(100)
        txtout.write(dna)
        txtout.write("\n")

        print (dna)

i'm a complete scrub and i can't figure out where or how multidimensionality comes into play. i suspect ".join()" but i'm not sure and also unsure how i could replace it. my other question is how to get non-uniform probability. i tried with "np.random.choice('ACTG', p=0.2, 0.2, 0.3, 0.3)" but it doesn't work.

i hope there is somebody out there that can help. thanks in advance.

greetings, bert

回答1:

For the first part of your question, pass a as a list:

def random_dna_sequence(length):
    return ''.join(np.random.choice(list('ACTG')) for _ in range(length))

Or define your bases as a list or tuple:

BASES = ('A', 'C', 'T', 'G')

def random_dna_sequence(length):
    return ''.join(np.random.choice(BASES) for _ in range(length))

The second part has a similar solution: pass the probabilities as a list or tuple:

BASES = ('A', 'C', 'T', 'G')
P = (0.2, 0.2, 0.3, 0.3)

def random_dna_sequence(length):
    return ''.join(np.random.choice(BASES, p=P) for _ in range(length))


回答2:

I had come to a similar solution as mhawke, as far as the random_dna_sequence function is concerned. However, I am generating a sequence as long as chromosome 1 of the human genome, and it was taking almost a minute with my method, so I tried mhawke's method to see if I had any speed gains. On the contrary, it took about 10 times as long. So, for anyone dealing with large sequences, I recommend making the following change to the return statement:

BASES = ('A', 'C', 'G', 'T')
def random_dna_sequence(length):
    return ''.join(np.random.choice(BASES, length))

This basically lets numpy perform the loop, which it does much more efficiently. I hope this helps.