My problem is this: I have a large sequence of numbers. I know that, after some point, it becomes periodic - that is, there are k numbers at the beginning of the sequence, and then there are m more numbers that repeat for the rest of the sequence. As an example to make this more clear, the sequence might look like this: [1, 2, 5, 3, 4, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, ...], where k is 5 and m is 4, and the repeating block is then [2, 1, 1, 3]. As is clear from this example, I can have repeating bits inside of the larger block, so it doesn't help to just look for the first instances of repetition.
However, I do not know what k or m are - my goal is to take the sequence [a_1, a_2, ... , a_n] as an input and output the sequence [a_1, ... , a_k, [a_(k+1), ... , a_(k+m)]] - basically truncating the longer sequence by listing the majority of it as a repeating block.
Is there an efficient way to do this problem? Also, likely harder but more ideal computationally - is it possible to do this as I generate the sequence in question, so that I have to generate a minimal amount? I've looked at other, similar questions on this site, but they all seem to deal with sequences without the beginning non-repeating bit, and often without having to worry about internal repetition.
If it helps/would be useful, I can also get into why I am looking at this and what I will use it for.
Thanks!
EDITS: First, I should have mentioned that I do not know if the input sequence ends at exactly the end of a repeated block.
The real-world problem that I am attempting to work on is writing a nice, closed-form expression for continued fraction expansions (CFEs) of quadratic irrationals (actually, the negative CFE). It is very simple to generate partial quotients* for these CFEs to any degree of accuracy - however, at some point the tail of the CFE for a quadratic irrational becomes a repeating block. I need to work with the partial quotients in this repeating block.
My current thoughts are this: perhaps I can adapt some of the algorithms suggested that work from the right to work with one of these sequences. Alternatively, perhaps there is something in the proof of why quadratic irrationals are periodic that will help me see why they begin to repeat, which will help me come up with some easy criteria to check.
*If I am writing a continued fraction expansion as [a_0, a_1, ...], I refer to the a_i's as partial quotients.
Some background info can be found here for those interested: http://en.wikipedia.org/wiki/Periodic_continued_fraction
You can use a rolling hash to achieve linear time complexity and O(1) space complexity (I think this is the case, since I don't believe you can have an infinite repeating sequence with two frequencies which are not multiples of each other).
Algorithm: You just keep two rolling hashes which expand like this:
_______ _______ _______
/ \/ \/ \
...2038975623895769874883301010883301010883301010
. . . ||
. . . [][]
. . . [ ][ ]
. . .[ ][ ]
. . [. ][ ]
. . [ . ][ ]
. . [ .][ ]
. . [ ][ ]
. [ ][ ]
Keep on doing this for the entire sequence. The first pass will only detect repetitions repeated 2*n times for some value of n. However that's not our goal: our goal in the first pass is to detect all possible periods, which this does. As we go along the sequence performing this process, we also keep track of all relatively prime periods we will need to later check:
periods = Set(int)
periodsToFurthestReach = Map(int -> int)
for hash1,hash2 in expandedPairOfRollingHashes(sequence):
L = hash.length
if hash1==hash2:
if L is not a multiple of any period:
periods.add(L)
periodsToFurthestReach[L] = 2*L
else L is a multiple of some periods:
for all periods P for which L is a multiple:
periodsToFurthestReach[P] = 2*L
After this process, we have a list of all periods and how far they've reached. Our answer is probably the one with the furthest reach, but we check all other periods for repetition (fast because we know the periods we're checking for). If this is computationally difficult, we can optimize by pruning away periods (which stop repeating) as we're going through the list, very much like the sieve of Eratosthenes, by keeping a priority queue of when we next expect a period to repeat.
At the end, we double-check the result to make sure there was no hash collision (in unlikely even there is, blacklist and repeat).
Here I assumed your goal was to minimize non-repeating-length, and not give a repeating element which can be further factored; you can modify this algorithm to find all other compressions, if they exist.
So, ninjagecko provided a good working answer to the question I posed. Thanks very much! However, I ended up finding a more efficient, mathematically based way to do the specific case that I am looking at - that is, writing out a closed form expression for the continued fraction expansion of a quadratic irrational. Obviously this solution will only work for this specific case, rather than the general case that I asked about, but I thought it might be useful to put it here in case others have a similar question.
Basically, I remembered that a quadratic irrational is reduced if and only if its continued fraction expansion is purely periodic - as in, it repeats right from the beginning, without any leading terms.
When you work out the continued fraction expansion of a number x, you basically set x_0 to be x, and then you form your sequence [a_0; a_1, a_2, a_3, ... ] by defining a_n = floor(x_n) and x_(n+1) = 1/(x_n - a_n). Normally you would just continue this until you reach a desired precision. However, for our purposes, we just run this method until x_k is a reduced quadratic irrational (which occurs if it is bigger than 1 and its conjugate is between -1 and 0). Once this happens, we know that a_k is the first term of our repeating block. Then, when we find x_(k+m+1) equal to x_k, we know that a_(k+m) is the last term in our repeating block.
Search from the right:
- does a_n == a_n-1
- does (a_n,a_n-1) == (a_n-2,a_n-3)
- ...
This is clearly O(m^2). The only available bound appears to be that m<n/2, so it's O(n^2)
Is this acceptable for your application? (Are we doing your homework for you, or is there an actual real-world problem here?)
This page lists several good cycle-detection algorithms and gives an implementation of an algorithm in C.
Consider the sequence once it has repeated a number of times. It will end e.g. ...12341234123412341234. If you take the repeating part of the string up to just before the last cycle of repeats, and then slide it along by the length of that cycle, you will find that you have a long match between a substring at the end of the sequence and the same substring slid to the left a distance which is small compared with its length.
Conversely, if you have a string where a[x] = a[x + k] for a large number of x, then you also have a[x] = a[x + k] = a[x + 2k] = a[x + 3k]... so a string that matches itself when slid a short distance compared to its length must contain repeats.
If you look at http://en.wikipedia.org/wiki/Suffix_array, you will see that you can build the list of all suffixes of a string, in sorted order, in linear time, and also an array which tells you how many characters each suffix has in common with the previous suffix in sorted order. If you look for the entry with the largest value of this, this would be my candidate for a string going ..1234123412341234, and the distance between the starting points of the two suffixes would tell you the length at which the sequence repeats. (but in practice some sort of rolling hash search like http://en.wikipedia.org/wiki/Rabin-Karp might be quicker and easier, although there are quite codeable linear time Suffix Array algorithms, like "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders).
Suppose that you apply this algorithm when the number of characters available is 8, 16, 32, 64, .... 2^n, and you finally find a repeat at 2^p. How much time have you wasted in earlier stages? 2^(p-1) + 2^(p-2) + ..., which sums to about 2^p, so the repeated searches are only a constant overhead.