Find the number of occurrences of a subsequence in

2019-01-12 14:32发布

问题:

For example, let the string be the first 10 digits of pi, 3141592653, and the subsequence be 123. Note that the sequence occurs twice:

3141592653
 1    2  3
   1  2  3

This was an interview question that I couldn't answer and I can't think of an efficient algorithm and it's bugging me. I feel like it should be possible to do with a simple regex, but ones like 1.*2.*3 don't return every subsequence. My naive implementation in Python (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

回答1:

This is a classical dynamic programming problem (and not typically solved using regular expressions).

My naive implementation (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

That would be an exhaustive search approach which runs in exponential time. (I'm surprised it runs for hours though).


Here's a suggestion for a dynamic programming solution:

Outline for a recursive solution:

(Apologies for the long description, but each step is really simple so bear with me ;-)

  • If the subsequence is empty a match is found (no digits left to match!) and we return 1

  • If the input sequence is empty we've depleted our digits and can't possibly find a match thus we return 0

  • (Neither the sequence nor the subsequence are empty.)

  • (Assume that "abcdef" denotes the input sequence, and "xyz" denotes the subsequence.)

  • Set result to 0

  • Add to the result the number of matches for bcdef and xyz (i.e., discard the first input digit and recurse)

  • If the first two digits match, i.e., a = x

    • Add to the result the number of matches for bcdef and yz (i.e., match the first subsequence digit and recurse on the remaining subsequence digits)

  • Return result


Example

Here's an illustration of the recursive calls for input 1221 / 12. (Subsequence in bold font, · represents empty string.)


Dynamic programming

If implemented naively, some (sub-)problems are solved multiple times (· / 2 for instance in the illustration above). Dynamic programming avoids such redundant computations by remembering the results from previously solved subproblems (usually in a lookup table).

In this particular case we set up a table with

  • [length of sequence + 1] rows, and
  • [length of subsequence + 1] columns:

          

The idea is that we should fill in the number of matches for 221 / 2 in the corresponding row / column. Once done, we should have the final solution in cell 1221 / 12.

We start populating the table with what we know immediately (the "base cases"):

  • When no subsequence digits are left, we have 1 complete match:

          

  • When no sequence digits are left, we can't have any matches:

We then proceed by populating the table top-down / left-to-right according to the following rule:

  • In cell [row][col] write the value found at [row-1][col].

    Intuitively this means "The number of matches for 221 / 2 includes all the matches for 21 / 2."

  • If sequence at row row and subseq at column col start with the same digit, add the value found at [row-1][col-1] to the value just written to [row][col].

    Intuitively this means "The number of matches for 1221 / 12 also includes all the matches for 221 / 12."

          

The final result looks as follows:

          

and the value at the bottom right cell is indeed 2.


In Code

Not in Python, (my apologies).

class SubseqCounter {

    String seq, subseq;
    int[][] tbl;

    public SubseqCounter(String seq, String subseq) {
        this.seq = seq;
        this.subseq = subseq;
    }

    public int countMatches() {
        tbl = new int[seq.length() + 1][subseq.length() + 1];

        for (int row = 0; row < tbl.length; row++)
            for (int col = 0; col < tbl[row].length; col++)
                tbl[row][col] = countMatchesFor(row, col);

        return tbl[seq.length()][subseq.length()];
    }

    private int countMatchesFor(int seqDigitsLeft, int subseqDigitsLeft) {
        if (subseqDigitsLeft == 0)
            return 1;

        if (seqDigitsLeft == 0)
            return 0;

        char currSeqDigit = seq.charAt(seq.length()-seqDigitsLeft);
        char currSubseqDigit = subseq.charAt(subseq.length()-subseqDigitsLeft);

        int result = 0;

        if (currSeqDigit == currSubseqDigit)
            result += tbl[seqDigitsLeft - 1][subseqDigitsLeft - 1];

        result += tbl[seqDigitsLeft - 1][subseqDigitsLeft];

        return result;
    }
}

Complexity

A bonus for this "fill-in-the-table" approach is that it is trivial to figure out complexity. A constant amount of work is done for each cell, and we have length-of-sequence rows and length-of-subsequence columns. Complexity is therefor O(MN) where M and N denote the lengths of the sequences.



回答2:

Great answer, aioobe! to complement your answer, some possible implementations in Python:

# straightforward, naïve solution; too slow!

def num_subsequences(seq, sub):
    if not sub:
        return 1
    elif not seq:
        return 0
    result = num_subsequences(seq[1:], sub)
    if seq[0] == sub[0]:
        result += num_subsequences(seq[1:], sub[1:])
    return result

# top-down solution using explicit memoization

def num_subsequences(seq, sub):
    m, n, cache = len(seq), len(sub), {}
    def count(i, j):
        if j == n:
            return 1
        elif i == m:
            return 0
        k = (i, j)
        if k not in cache:
            cache[k] = count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)
        return cache[k]
    return count(0, 0)

# top-down solution using the lru_cache decorator
# available from functools in python >= 3.2

from functools import lru_cache

def num_subsequences(seq, sub):
    m, n = len(seq), len(sub)
    @lru_cache(maxsize=None)
    def count(i, j):
        if j == n:
            return 1
        elif i == m:
            return 0
        return count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)
    return count(0, 0)

# bottom-up, dynamic programming solution using a lookup table

def num_subsequences(seq, sub):
    m, n = len(seq)+1, len(sub)+1
    table = [[0]*n for i in xrange(m)]
    def count(iseq, isub):
        if not isub:
            return 1
        elif not iseq:
            return 0
        return (table[iseq-1][isub] +
               (table[iseq-1][isub-1] if seq[m-iseq-1] == sub[n-isub-1] else 0))
    for row in xrange(m):
        for col in xrange(n):
            table[row][col] = count(row, col)
    return table[m-1][n-1]

# bottom-up, dynamic programming solution using a single array

def num_subsequences(seq, sub):
    m, n = len(seq), len(sub)
    table = [0] * n
    for i in xrange(m):
        previous = 1
        for j in xrange(n):
            current = table[j]
            if seq[i] == sub[j]:
                table[j] += previous
            previous = current
    return table[n-1] if n else 1


回答3:

One way to do it would be with two lists. Call them Ones and OneTwos.

Go through the string, character by character.

  • Whenever you see the digit 1, make an entry in the Ones list.
  • Whenever you see the digit 2, go through the Ones list and add an entry to the OneTwos list.
  • Whenever you see the digit 3, go through the OneTwos list and output a 123.

In the general case that algorithm will be very fast, since it's a single pass through the string and multiple passes through what will normally be much smaller lists. Pathological cases will kill it, though. Imagine a string like 111111222222333333, but with each digit repeated hundreds of times.



回答4:

from functools import lru_cache

def subseqsearch(string,substr):
    substrset=set(substr)
    #fixs has only element in substr
    fixs = [i for i in string if i in substrset]
    @lru_cache(maxsize=None) #memoisation decorator applyed to recs()
    def recs(fi=0,si=0):
        if si >= len(substr):
            return 1
        r=0
        for i in range(fi,len(fixs)):
            if substr[si] == fixs[i]:
                r+=recs(i+1,si+1)
        return r
    return recs()

#test
from functools import reduce
def flat(i) : return reduce(lambda x,y:x+y,i,[])
N=5
string = flat([[i for j in range(10) ] for i in range(N)])
substr = flat([[i for j in range(5) ] for i in range(N)]) 
print("string:","".join(str(i) for i in string),"substr:","".join(str(i) for i in substr),sep="\n")
print("result:",subseqsearch(string,substr))

output (instantly):

string:
00000000001111111111222222222233333333334444444444
substr:
0000011111222223333344444
result: 1016255020032


回答5:

My quick attempt:

def count_subseqs(string, subseq):
    string = [c for c in string if c in subseq]
    count = i = 0
    for c in string:
        if c == subseq[0]:
            pos = 1
            for c2 in string[i+1:]:
                if c2 == subseq[pos]:
                    pos += 1
                    if pos == len(subseq):
                        count += 1
                        break
        i += 1
    return count

print count_subseqs(string='3141592653', subseq='123')

Edit: This one should be correct also if 1223 == 2 and more complicated cases:

def count_subseqs(string, subseq):
    string = [c for c in string if c in subseq]
    i = 0
    seqs = []
    for c in string:
        if c == subseq[0]:
            pos = 1
            seq = [1]
            for c2 in string[i + 1:]:
                if pos > len(subseq):
                    break
                if pos < len(subseq) and c2 == subseq[pos]:
                    try:
                        seq[pos] += 1
                    except IndexError:
                        seq.append(1)
                        pos += 1
                elif pos > 1 and c2 == subseq[pos - 1]:
                    seq[pos - 1] += 1
            if len(seq) == len(subseq):
                seqs.append(seq)
        i += 1
    return sum(reduce(lambda x, y: x * y, seq) for seq in seqs)

assert count_subseqs(string='12', subseq='123') == 0
assert count_subseqs(string='1002', subseq='123') == 0
assert count_subseqs(string='0123', subseq='123') == 1
assert count_subseqs(string='0123', subseq='1230') == 0
assert count_subseqs(string='1223', subseq='123') == 2
assert count_subseqs(string='12223', subseq='123') == 3
assert count_subseqs(string='121323', subseq='123') == 3
assert count_subseqs(string='12233', subseq='123') == 4
assert count_subseqs(string='0123134', subseq='1234') == 2
assert count_subseqs(string='1221323', subseq='123') == 5


回答6:

psh. O(n) solutions are way better.

Think of it by building a tree:

iterate along the string if the character is '1', add a node to the the root of the tree. if the character is '2', add a child to each first level node. if the character is '3', add a child to each second level node.

return the number of third layer nodes.

this would be space inefficient so why don't we just store the number of nodes a each depth:

infile >> in;
long results[3] = {0};
for(int i = 0; i < in.length(); ++i) {
    switch(in[i]) {
        case '1':
        results[0]++;
        break;
        case '2':
        results[1]+=results[0];
        break;
        case '3':
        results[2]+=results[1];
        break;
        default:;
    }
}

cout << results[2] << endl;


回答7:

How to count all three-member sequences 1..2..3 in the array of digits.

Quickly and simply

Notice, we need not FIND all sequences, we need only COUNT them. So, all algorithms that search for sequences, are excessively complex.

  1. Throw off every digit, that is not 1,2,3. The result will be char array A
  2. Make parallel int array B of 0's. Running A from the end, count for the each 2 in A the number of 3's in A after them. Put these numbers into the appropriate elements of B.
  3. Make parallel int array C of 0's.Running A from the end count for the each 1 in A the sum of B after its position. The result put into the appropriate place in C.
  4. Count the sum of C.

That is all. The complexity is O(N). Really, for the normal line of digits, it will take about twice the time of the shortening of the source line.

If the sequence will be longer, of , say, M members, the procedure could be repeated M times. And complexity will be O(MN), where N already will be the length of the shortened source string.



回答8:

I have an interesting O(N) time and O(M) space solution for this problem.
N being length of text and M being length of pattern to be searched for. I will explain the algorithm to you because i implement in C++.

lets suppose the input given is as you provided 3141592653 and the pattern sequence whose count to find is 123 . I will begin by taking a hash map which maps characters to their positions in the input pattern . I also take an array of size M initially initialised to 0.

    string txt,pat;
    cin >> txt >> pat;
    int n = txt.size(),m = pat.size();
    int arr[m];
    map<char,int> mp;
    map<char,int> ::iterator it;
    f(i,0,m)
    {
        mp[pat[i]] = i;
        arr[i] = 0;
    }

I start looking for elements from the back and check if each element is in the pattern or not . If that element is in the pattern . I have to do something.

Now when i start looking from the back if i find a 2 and previous i have not found any 3 . This 2 is of no value to us.Because any 1 found after it will atmost form such sequence 12 and 123 wont be formed Ryt? think. Also at present position i have found a 2 and it will form sequences 123 only with 3's found previously and will form x sequences if we found x 3's previously (if part of sequence before 2 will be found)ryt? So the complete algorithm is whenever i find an element which is present in the array I check for its position j correspondingly at which it was present in the pattern (stored in hash map). I just inc increment

 arr[j] += arr[j+1];

signifying it will contribute to sequences of 3 found before it ryt? and if j found is m-1 i will simply increment it

 arr[j] += 1; 

Check the code snippets below which do these

    for(int i = (n-1);i > -1;i--)
    {
        char ch = txt[i];
        if(mp.find(ch) != mp.end())
        {
            int j = mp[ch];
            if(j == (m-1))
                arr[j]++;
            else if(j < (m-1))
                arr[j] += arr[j+1];
            else
                {;}
        }
    }

Now consider the fact

each index i in the array stores the number of times the substring of the pattern S[i,(m-1)] appers as a sequence of the input string So finally print the value of arr[0]

    cout << arr[0] << endl;

Code with Output(unique chars in pattern) http://ideone.com/UWaJQF

Code with Output(repetitions allowed of chars) http://ideone.com/14DZh7

Extension works only if pattern has unique elements What if pattern has unique elements then complexity may shoot to O(MN) Solution is similar without using DP just when an element occuring in the pattern appeared we just incremented array position j corresponding to it we now have to update all this characters' occurences in the pattern which will lead to a complexity of O(N*maxium frequency of a charater)

#define f(i,x,y) for(long long i = (x);i < (y);++i)



int main()
{
long long T;
cin >> T;
while(T--)
{
    string txt,pat;
    cin >> txt >> pat;
    long long n = txt.size(),m = pat.size();
    long long arr[m];
    map<char,vector<long long> > mp;
    map<char,vector<long long> > ::iterator it;
    f(i,0,m)
    {
        mp[pat[i]].push_back(i);
        arr[i] = 0;
    }

    for(long long i = (n-1);i > -1;i--)
    {
        char ch = txt[i];
        if(mp.find(ch) != mp.end())
        {
            f(k,0,mp[ch].size())
            {
                long long j = mp[ch][k];
                if(j == (m-1))
                    arr[j]++;
                else if(j < (m-1))
                    arr[j] += arr[j+1];
                else
                    {;}
                }
                }
                }
                cout <<arr[0] << endl;
                }
                 }

can be extended in similar way without DP in strings with repetitions but then complexity would be more O(MN)