For example, let the string be the first 10 digits of pi, 3141592653
, and the subsequence be 123
. Note that the sequence occurs twice:
3141592653
1 2 3
1 2 3
This was an interview question that I couldn't answer and I can't think of an efficient algorithm and it's bugging me. I feel like it should be possible to do with a simple regex, but ones like 1.*2.*3
don't return every subsequence. My naive implementation in Python (count the 3's for each 2 after each 1) has been running for an hour and it's not done.
output (instantly):
My quick attempt:
Edit: This one should be correct also if
1223 == 2
and more complicated cases:How to count all three-member sequences 1..2..3 in the array of digits.
Quickly and simply
Notice, we need not FIND all sequences, we need only COUNT them. So, all algorithms that search for sequences, are excessively complex.
That is all. The complexity is O(N). Really, for the normal line of digits, it will take about twice the time of the shortening of the source line.
If the sequence will be longer, of , say, M members, the procedure could be repeated M times. And complexity will be O(MN), where N already will be the length of the shortened source string.
This is a classical dynamic programming problem (and not typically solved using regular expressions).
That would be an exhaustive search approach which runs in exponential time. (I'm surprised it runs for hours though).
Here's a suggestion for a dynamic programming solution:
Outline for a recursive solution:
(Apologies for the long description, but each step is really simple so bear with me ;-)
If the subsequence is empty a match is found (no digits left to match!) and we return 1
If the input sequence is empty we've depleted our digits and can't possibly find a match thus we return 0
(Neither the sequence nor the subsequence are empty.)
(Assume that "abcdef" denotes the input sequence, and "xyz" denotes the subsequence.)
Set
result
to 0Add to the
result
the number of matches for bcdef and xyz (i.e., discard the first input digit and recurse)If the first two digits match, i.e., a = x
result
the number of matches for bcdef and yz (i.e., match the first subsequence digit and recurse on the remaining subsequence digits)Return
result
Example
Here's an illustration of the recursive calls for input 1221 / 12. (Subsequence in bold font, · represents empty string.)
Dynamic programming
If implemented naively, some (sub-)problems are solved multiple times (· / 2 for instance in the illustration above). Dynamic programming avoids such redundant computations by remembering the results from previously solved subproblems (usually in a lookup table).
In this particular case we set up a table with
The idea is that we should fill in the number of matches for 221 / 2 in the corresponding row / column. Once done, we should have the final solution in cell 1221 / 12.
We start populating the table with what we know immediately (the "base cases"):
When no sequence digits are left, we can't have any matches:
We then proceed by populating the table top-down / left-to-right according to the following rule:
In cell [row][col] write the value found at [row-1][col].
Intuitively this means "The number of matches for 221 / 2 includes all the matches for 21 / 2."
If sequence at row row and subseq at column col start with the same digit, add the value found at [row-1][col-1] to the value just written to [row][col].
Intuitively this means "The number of matches for 1221 / 12 also includes all the matches for 221 / 12."
The final result looks as follows:
and the value at the bottom right cell is indeed 2.
In Code
Not in Python, (my apologies).
Complexity
A bonus for this "fill-in-the-table" approach is that it is trivial to figure out complexity. A constant amount of work is done for each cell, and we have length-of-sequence rows and length-of-subsequence columns. Complexity is therefor O(MN) where M and N denote the lengths of the sequences.
psh. O(n) solutions are way better.
Think of it by building a tree:
iterate along the string if the character is '1', add a node to the the root of the tree. if the character is '2', add a child to each first level node. if the character is '3', add a child to each second level node.
return the number of third layer nodes.
this would be space inefficient so why don't we just store the number of nodes a each depth:
One way to do it would be with two lists. Call them
Ones
andOneTwos
.Go through the string, character by character.
1
, make an entry in theOnes
list.2
, go through theOnes
list and add an entry to theOneTwos
list.3
, go through theOneTwos
list and output a123
.In the general case that algorithm will be very fast, since it's a single pass through the string and multiple passes through what will normally be much smaller lists. Pathological cases will kill it, though. Imagine a string like
111111222222333333
, but with each digit repeated hundreds of times.