Given a sequence such as S = {1,8,2,1,4,1,2,9,1,8,4}, I need to find the minimal-length subsequence that contains all element of S (no duplicates, order does not matter). How do find this subsequence in an efficient way?
Note: There are 5 distinct elements in S: {1,2,4,8,9}. The minimum-length subsequence must contain all these 5 elements.
Algorithm:
First, determine the quantity of different elements in the array - this can be easily done in linear time. Let there be
k
different elements.Allocate an array
cur
of size 10^5, each showing how much of each element is used in current subsequence (see later).Hold a
cnt
variable showing how many different elements are there currently in the considered sequence. Now, take two indexes,begin
andend
and iterate them through the array the following way:cnt
andbegin
as0
,end
as-1
(to get0
after first increment). Then while possible perform follows:If
cnt != k
:2.1. increment
end
. Ifend
already is the end of array, then break. Ifcur[array[end]]
is zero, incrementcnt
. Incrementcur[array[end]]
.Else:
2.2 {
Try to increment the
begin
iterator: whilecur[array[begin]] > 1
, decrement it, and increment thebegin
(cur[array[begin]] > 1
means that we have another such element in our current subsequence). After all, compare the[begin, end]
interval with current answer and store it if it is better.}
After the further process becomes impossible, you got the answer. The complexity is
O(n)
- just passing two interators through the array.Implementation in C++:
I've got a O(N*M) algorithm where N is the length of S, and M is the number of elements (it tend to works better for small values of M, i.e : if there are very few duplicates, it may be a bad algorithm with quadratic cost) Edit : It seems that in fact, it's much closer to O(N) in practise. You get
O(N*M)
only in worst case scenariosStart by going through the sequence and record all the elements of S. Let's call this set E.
We're going to work with a dynamic subsequence of S. Create an empty
map
M where M associates to each element the number of times it is present in the subsequence.For example, if
subSequence = {1,8,2,1,4}
, andE = {1, 2, 4, 8, 9}
M[9]==0
M[2]==M[4]==M[8]==1
M[1]==2
You'll need two index, that will each point to an element of S. One of them will be called L because he's at the left of the subsequence formed by those two indexes. The other one will be called R as it's the index of the right part of the subsequence.
Begin by initializing
L=0
,R=0
andM[S[0]]++
The algorithm is :
To check if M contains all the elements of E, you can have a vector of booleans V.
V[i]==true
ifM[E[i]]>0
andV[i]==false
ifM[E[i]]==0
. So you begin by setting all the values of V atfalse
, and each time you doM[S[R]]++
, you can set V of this element totrue
, and each time you doM[S[L]]--
andM[S[L]]==0
then set V of this element tofalse
This can be solved by dynamic programming.
At each step
k
, we'll compute the shortest subsequence that ends at thek
-th position ofS
and that satisfies the requirement of containing all the unique elements ofS
.Given the solution to step
k
(hereinafter "the sequence"), computing the solution to stepk+1
is easy: append the(k+1)
-th element of S to the sequence and then remove, one by one, all elements at the start of the sequence that are contained in the extended sequence more than once.The solution to the overall problem is the shortest sequence found in any of the steps.
The initialization of the algorithm consists of two stages:
S
once, building the alphabet of unique values.S
; the last position of this sequence will be the initial value ofk
.All of the above can be done in
O(n logn)
worst-case time (let me know if this requires clarification).Here is a complete implementation of the above algorithm in Python:
Notes:
O(n)
in the worst case. If it's the worst case that you care about, replacing them with tree-based structures will give the overallO(n logn)
I've promised above;S
can be eliminated, making the algorithm suitable for streaming data;S
are small non-negative integers, as your comments indicate, thencount
can be flattened out into an integer array, bringing the overall complexity down toO(n)
.Here is an algorithm that requires O(N) time and O(N) space. It is similar to that one by Grigor Gevorgyan. It also uses an auxiliary O(N) array of flags. The algorithm finds the longest subsequence of unique elements. If
bestLength < numUnique
then there is no subsequence containing all unique elements. The algorithm assumes that the elements are positive numbers and that the maximal element is less than the length of the sequence.above solution is correct and java version of above code
If you need to do this quite often for the same sequence and different sets you can use inverted lists for this. You prepare the inverted lists for the sequence and then collect all the offsets. Then scan the results from the inverted lists for a sequence of m sequential numbers.
With
n
the length of the sequence andm
the size of the query the preparation would be inO(n)
. The response time for the query would be inO(m^2)
if I am not miscalculating the merge step.If you need more details have a look at the paper by Clausen/Kurth from 2004 on algebraic databases ("Content-Based Information Retrieval by Group Theoretical Methods"). This sketches out a general database framework that can be adapted to your task.