Fast Fibonacci computation

2019-03-21 05:07发布

问题:

I saw a comment on Google+ a few weeks ago in which someone demonstrated a straight-forward computation of Fibonacci numbers which was not based on recursion and didn't use memoization. He effectively just remembered the last 2 numbers and kept adding them. This is an O(n) algorithm, but he implemented it very cleanly. So I quickly pointed out that a quicker way is to take advantage of the fact that they can be computed as powers of [[0,1],[1,1]] matrix and it requires only a O(log(N)) computation.

The problem, of course is that this is far from optimal past a certain point. It is efficient as long as the numbers are not too large, but they grow in length at the rate of N*log(phi)/log(10), where N is the Nth Fibonacci number and phi is the golden ratio ( (1+sqrt(5))/2 ~ 1.6 ). As it turns out, log(phi)/log(10) is very close to 1/5. So Nth Fibonacci number can be expected to have roughly N/5 digits.

Matrix multiplication, heck even number multiplication, gets very slow when the numbers start to have millions or billions of digits. So the F(100,000) took about .03 seconds to compute (in Python), while F(1000,000) took roughly 5 seconds. This is hardly O(log(N)) growth. My estimate was that this method, without improvements, only optimizes the computation to be O( (log(N)) ^ (2.5) ) or so.

Computing a billionth Fibonacci number, at this rate, would be prohibitively slow (even though it would only have ~ 1,000,000,000 / 5 digits so it easily fits within 32-bit memory).

Does anyone know of an implementation or algorithm which would allow a faster computation? Perhaps something which would allow calculation of a trillionth Fibonacci number.

And just to be clear, I am not looking for an approximation. I am looking for the exact computation (to the last digit).

Edit 1: I am adding the Python code to show what I believe is O( (log N) ^ 2.5) ) algorithm.

from operator import mul as mul
from time import clock

class TwoByTwoMatrix:
    __slots__ = "rows"

    def __init__(self, m):
        self.rows = m

    def __imul__(self, other):
        self.rows = [[sum(map(mul, my_row, oth_col)) for oth_col in zip(*other.rows)] for my_row in self.rows]
        return self

    def intpow(self, i):
        i = int(i)
        result = TwoByTwoMatrix([[long(1),long(0)],[long(0),long(1)]])
        if i <= 0:
            return result
        k = 0
        while i % 2 == 0:
            k +=1
            i >>= 1
        multiplier = TwoByTwoMatrix(self.rows)
        while i > 0:
            if i & 1:
                result *= multiplier
            multiplier *= multiplier # square it
            i >>= 1
        for j in xrange(k):
            result *= result
        return result


m = TwoByTwoMatrix([[0,1],[1,1]])

t1 = clock()
print len(str(m.intpow(100000).rows[1][1]))
t2 = clock()
print t2 - t1

t1 = clock()
print len(str(m.intpow(1000000).rows[1][1]))
t2 = clock()
print t2 - t1

Edit 2: It looks like I didn't account for the fact that len(str(...)) would make a significant contribution to the overall runtime of the test. Changing tests to

from math import log as log

t1 = clock()
print log(m.intpow(100000).rows[1][1])/log(10)
t2 = clock()
print t2 - t1

t1 = clock()
print log(m.intpow(1000000).rows[1][1])/log(10)
t2 = clock()
print t2 - t1

shortened the runtimes to .008 seconds and .31 seconds (from .03 seconds and 5 seconds when len(str(...)) were used).

Because M=[[0,1],[1,1]] raised to power N is [[F(N-2), F(N-1)], [F(N-1), F(N)]], the other obvious source of inefficiency was calculating (0,1) and (1,0) elements of the matrix as if they were distinct. This (and I switched to Python3, but Python2.7 times are similar):

class SymTwoByTwoMatrix():
    # elments (0,0), (0,1), (1,1) of a symmetric 2x2 matrix are a, b, c.
    # b is also the (1,0) element because the matrix is symmetric

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

    def __imul__(self, other):
        # this multiplication does work correctly because we 
        # are multiplying powers of the same symmetric matrix
        self.a, self.b, self.c = \
            self.a * other.a + self.b * other.b, \
            self.a * other.b + self.b * other.c, \
            self.b * other.b + self.c * other.c
        return self

    def intpow(self, i):
        i = int(i)
        result = SymTwoByTwoMatrix(1, 0, 1)
        if i <= 0:
            return result
        k = 0
        while i % 2 == 0:
            k +=1
            i >>= 1
        multiplier = SymTwoByTwoMatrix(self.a, self.b, self.c)
        while i > 0:
            if i & 1:
                result *= multiplier
            multiplier *= multiplier # square it
            i >>= 1
        for j in range(k):
            result *= result
        return result

calculated F(100,000) in .006, F(1,000,000) in .235 and F(10,000,000) in 9.51 seconds.

Which is to be expected. It is producing results 45% faster for the fastest test and it is expected that the gain should asymptotically approach phi/(1+2*phi+phi*phi) ~ 23.6%.

The (0,0) element of M^N is actually N-2nd Fibonacci number:

for i in range(15):
    x = m.intpow(i)
    print([x.a,x.b,x.c])

gives

[1, 0, 1]
[0, 1, 1]
[1, 1, 2]
[1, 2, 3]
[2, 3, 5]
[3, 5, 8]
[5, 8, 13]
[8, 13, 21]
[13, 21, 34]
[21, 34, 55]
[34, 55, 89]
[55, 89, 144]
[89, 144, 233]
[144, 233, 377]
[233, 377, 610]

I would expect that not having to calculate element (0,0) would produce a speed up of additional 1/(1+phi+phi*phi) ~ 19%. But the lru_cache of F(2N) and F(2N-1) solution given by Eli Korvigo below actually gives a speed up of 4 times (ie, 75%). So, while I have not worked out a formal explanation, I am tempted to think that it caches the spans of 1's within the binary expansion of N and does the minimum number of multiplications necessary. Which obviates the need to find those ranges, precompute them and then multiply them at the right point in the expansion of N. lru_cache allows for a top-to-bottom computation of what would have been a more complicated buttom-to-top computation.

Both SymTwoByTwoMatrix and the lru_cache-of-F(2N)-and-F(2N-1) are taking roughly 40 times longer to compute every time N grows 10 times. I think that's possibly due to Python's implementation of multiplication of long ints. I think the multiplication of large numbers and their addition should be parallelizable. So a multi-threaded sub-O(N) solution should be possible even though (as Daniel Fisher states in comments) the F(N) solution is Theta(n).

回答1:

Since Fibonacci sequence is a linear recurrence, its members can be evaluated in closed form. This involves computing a power, which can be done in O(logn) similarly to the matrix-multiplication solution, but the constant overhead should be lower. That's the fastest algorithm I know.

EDIT

Sorry, I missed the "exact" part. Another exact O(log(n)) alternative for the matrix-multiplication can be calculated as follows

from functools import lru_cache

@lru_cache(None)
def fib(n):
    if n in (0, 1):
        return 1
    if n & 1:  # if n is odd, it's faster than checking with modulo
        return fib((n+1)//2 - 1) * (2*fib((n+1)//2) - fib((n+1)//2 - 1))
    a, b = fib(n//2 - 1), fib(n//2)
    return a**2 + b**2

This is based on the derivation from a note by Prof. Edsger Dijkstra. The solution exploits the fact that to calculate both F(2N) and F(2N-1) you only need to know F(N) and F(N-1). Nevertheless, you are still dealing with long-number arithmetics, though the overhead should be smaller than that of the matrix-based solution. In Python you'd better rewrite this in imperative style due to the slow memoization and recursion, though I wrote it this way for the clarity of the functional formulation.



回答2:

Using the weird square rooty equation in the other answer closed form fibo you CAN compute the kth fibonacci number exactly. This is because the $\sqrt(5)$ falls out in the end. You just have to arrange your multiplication to keep track of it in the meantime.

def rootiply(a1,b1,a2,b2,c):
    ''' multipy a1+b1*sqrt(c) and a2+b2*sqrt(c)... return a,b'''
    return a1*a2 + b1*b2*c, a1*b2 + a2*b1

def rootipower(a,b,c,n):
    ''' raise a + b * sqrt(c) to the nth power... returns the new a,b and c of the result in the same format'''
    ar,br = 1,0
    while n != 0:
        if n%2:
            ar,br = rootiply(ar,br,a,b,c)
        a,b = rootiply(a,b,a,b,c)
        n /= 2
    return ar,br

def fib(k):
    ''' the kth fibonacci number'''
    a1,b1 = rootipower(1,1,5,k)
    a2,b2 = rootipower(1,-1,5,k)
    a = a1-a2
    b = b1-b2
    a,b = rootiply(0,1,a,b,5)
    # b should be 0!
    assert b == 0
    return a/2**k/5

if __name__ == "__main__":
    assert rootipower(1,2,3,3) == (37,30) # 1+2sqrt(3) **3 => 13 + 4sqrt(3) => 39 + 30sqrt(3)
    assert fib(10)==55


回答3:

From Wikipedia,

For all n ≥ 0, the number Fn is the closest integer to phi^n/sqrt(5) where phi is the golden ratio. Therefore, it can be found by rounding, that is by the use of the nearest integer function



回答4:

This is too long for a comment, so I'll leave an answer.

The answer by Aaron is correct, and I've upvoted it, as should you. I will provide the same answer, and explain why it is not only correct, but the best answer posted so far. The formula we're discussing is:

Computing Φ is O(M(n)), where M(n) is the complexity of multiplication (currently a little over linearithmic) and n is the number of bits.

Then there's a power function, which can be expressed as a log (O(M(n)•log(n)), a multiply (O(M(n))), and an exp (O(M(n)•log(n)).

Then there's a square root (O(M(n))), a division (O(M(n))), and a final round (O(n)).

This makes this answer something like O(n•log^2(n)•log(log(n))) for n bits.


I haven't thoroughly analyzed the division algorithm, but if I'm reading this right, each bit might need a recursion (you need to divide the number log(2^n)=n times) and each recursion needs a multiply. Therefore it can't be better than O(M(n)•n), and that's exponentially worse.