-->

Random sampling of non-overlapping substrings of l

2020-06-04 09:35发布

问题:

Given a string of length n, how would I (pseudo)randomly sample m substrings of size k such that none of the sampled substrings overlap? Most of my scripting experience is in Perl, but an easy-to-run solution in any common language will suffice.

回答1:

If there is a character that cannot occur in the input, e.g. X, just:

my $size = 20;
my $count = 20;
my $mark = 'X';
my $input = 'CCACGCATTTTTGTTCATTGTTCTGGCTTCTTACAAGGTTCAGTAGACTTTGTAACACAGTTGTGTCTCTCACAGATTGGCAGATGTTTGGTAAAGGATTGACTTTTCAGCCAACTCATGGGAAAGTGAAATAATGTAAAAAACAGGAAGAATACAGTTTTAGGCCTTTCAAGTGAGGCATGGCTTTCAGCTCTTGGCAAGAACAGGCAAGGAGATGCAAGTTTTAGGACTCTAAGAGGCTAGGCTTTTCAAAGTGCTTCTCTCCCCTTCACCCTCCTTCAGTTACAGCACCAAGCACCACCGAGGTGTTACCTGCAGCCTCACTCTCTACCTGGTTGTGGGATCCTGCCACTTCCTTAACCCACACTGAGTTCCTTGTGGTTCACAGGGTCACACAGAGGGCTGTAGAGATACAAAAGATATATGTGATTTTATATCACCTATCATATGAAGATATATTTATAAAATAGGAAACATATTAACCACTTATCATTTTATATATTTATGGTTTTATGTGTCAAAAATATATTGTTTCATGTATGTATTAAAGGATAAGTATGTATAAGAGGTTTTATAGATGTGTAAAATTATATATTTATACGTATCTTTACAAATTTAAGAATAAAGGAAGGAAAATTCTCAAAGAGGAATTCAGATATCAAGCAGTGCCCTTTGACCAAGAGCCTTGGTTACAACATACCTACAAAAGTGAACTATCATTGAAAGACCTATGGACACTGGATTTCTCTTTCCTTATTTAGAAGGGCAGTCTGTGTCTTGGAAAAGCATACAGTTTGTTGTATCTTGCTGGACAACAGGAGTCA';

if (2*$size*$count-$size-$count >= length($input)) {
    die "selection may not complete; choose a shorter length or fewer substrings, or provide a longer input string\n";
}

my @substrings;
while (@substrings < $count) {
    my $pos = int rand(length($input)-$size+1);
    push @substrings, substr($input, $pos, $size, $mark x $size)
        if substr($input, $pos, $size) !~ /\Q$mark/;
}


回答2:

This is a recursive approach in Python. At each step, randomly select from among the remaining partitions of the string, then randomly select a substring of length k from the chosen partition. Replace this partition with the split of the partition on the substring chosen. Filter out partitions of length smaller than k, and repeat. The list of substrings returns when there are m of them, or there are no partitions left with length greater than or equal to k.

import random

def f(l, k, m, result=[]):
    if len(result) == m or len(l) == 0:
        return result
    else:
        if isinstance(l, str):
            l = [l]
        part_num = random.randint(0, len(l)-1)
        partition = l[part_num]
        start = random.randint(0, len(partition)-k)
        result.append(partition[start:start+k])
        l.remove(partition)
        l.extend([partition[:start], partition[start+k:]])
        return f([part for part in l if len(part) >= k], k, m, result)