Speed comparison with Project Euler: C vs Python v

2019-01-03 19:06发布

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

C:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

Python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

Python with PyPy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

Erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

Haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

  • C: 100%
  • Python: 692% (118% with PyPy)
  • Erlang: 436% (135% thanks to RichardC)
  • Haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

EDIT:

Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.


Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

18条回答
够拽才男人
2楼-- · 2019-01-03 19:31

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

The C implementation is suboptimal (as hinted at by Thomas M. DuBuisson), the version uses 64-bit integers (i.e. long datatype). I'll investigate the assembly listing later, but with an educated guess, there are some memory accesses going on in the compiled code, which make using 64-bit integers significantly slower. It's that or generated code (be it the fact that you can fit less 64-bit ints in a SSE register or round a double to a 64-bit integer is slower).

Here is the modified code (simply replace long with int and I explicitly inlined factorCount, although I do not think that this is necessary with gcc -O3):

#include <stdio.h>
#include <math.h>

static inline int factorCount(int n)
{
    double square = sqrt (n);
    int isquare = (int)square;
    int count = isquare == square ? -1 : 0;
    int candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    int triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index++;
        triangle += index;
    }
    printf ("%d\n", triangle);
}

Running + timing it gives:

$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12
842161320
./euler12  2.95s user 0.00s system 99% cpu 2.956 total

For reference, the haskell implementation by Thomas in the earlier answer gives:

$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12                                                                                      [9:40]
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12 ...
842161320
./euler12  9.43s user 0.13s system 99% cpu 9.602 total

Conclusion: Taking nothing away from ghc, its a great compiler, but gcc normally generates faster code.

查看更多
叼着烟拽天下
3楼-- · 2019-01-03 19:34

Take a look at this blog. Over the past year or so he's done a few of the Project Euler problems in Haskell and Python, and he's generally found Haskell to be much faster. I think that between those languages it has more to do with your fluency and coding style.

When it comes to Python speed, you're using the wrong implementation! Try PyPy, and for things like this you'll find it to be much, much faster.

查看更多
forever°为你锁心
4楼-- · 2019-01-03 19:35
Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

This is unlikely. I cannot say much about Erlang and Haskell (well, maybe a bit about Haskell below) but I can point a lot of other bottlenecks in Python. Every time the program tries to execute an operation with some values in Python, it should verify whether the values are from the proper type, and it costs a bit of time. Your factorCount function just allocates a list with range (1, isquare + 1) various times, and runtime, malloc-styled memory allocation is way slower than iterating on a range with a counter as you do in C. Notably, the factorCount() is called multiple times and so allocates a lot of lists. Also, let us not forget that Python is interpreted and the CPython interpreter has no great focus on being optimized.

EDIT: oh, well, I note that you are using Python 3 so range() does not return a list, but a generator. In this case, my point about allocating lists is half-wrong: the function just allocates range objects, which are inefficient nonetheless but not as inefficient as allocating a list with a lot of items.

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Are you using Hugs? Hugs is a considerably slow interpreter. If you are using it, maybe you can get a better time with GHC - but I am only cogitating hypotesis, the kind of stuff a good Haskell compiler does under the hood is pretty fascinating and way beyond my comprehension :)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

I'd say you are playing an unfunny game. The best part of knowing various languages is to use them the most different way possible :) But I digress, I just do not have any recommendation for this point. Sorry, I hope someone can help you in this case :)

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

As far as I remember, you just need to make sure that your recursive call is the last command before returning a value. In other words, a function like the one below could use such optimization:

def factorial(n, acc=1):
    if n > 1:
        acc = acc * n
        n = n - 1
        return factorial(n, acc)
    else:
        return acc

However, you would not have such optimization if your function were such as the one below, because there is an operation (multiplication) after the recursive call:

def factorial2(n):
    if n > 1:
        f = factorial2(n-1)
        return f*n
    else:
        return 1

I separated the operations in some local variables for make it clear which operations are executed. However, the most usual is to see these functions as below, but they are equivalent for the point I am making:

def factorial(n, acc=1):
    if n > 1:
        return factorial(n-1, acc*n)
    else:
        return acc

def factorial2(n):
    if n > 1:
        return n*factorial(n-1)
    else:
        return 1

Note that it is up to the compiler/interpreter to decide if it will make tail recursion. For example, the Python interpreter does not do it if I remember well (I used Python in my example only because of its fluent syntax). Anyway, if you find strange stuff such as factorial functions with two parameters (and one of the parameters has names such as acc, accumulator etc.) now you know why people do it :)

查看更多
小情绪 Triste *
5楼-- · 2019-01-03 19:36

C++11, < 20ms for me - Run it here

I understand that you want tips to help improve your language specific knowledge, but since that has been well covered here, I thought I would add some context for people who may have looked at the mathematica comment on your question, etc, and wondered why this code was so much slower.

This answer is mainly to provide context to hopefully help people evaluate the code in your question / other answers more easily.

This code uses only a couple of (uglyish) optimisations, unrelated to the language used, based on:

  1. every traingle number is of the form n(n+1)/2
  2. n and n+1 are coprime
  3. the number of divisors is a multiplicative function

#include <iostream>
#include <cmath>
#include <tuple>
#include <chrono>

using namespace std;

// Calculates the divisors of an integer by determining its prime factorisation.

int get_divisors(long long n)
{
    int divisors_count = 1;

    for(long long i = 2;
        i <= sqrt(n);
        /* empty */)
    {
        int divisions = 0;
        while(n % i == 0)
        {
            n /= i;
            divisions++;
        }

        divisors_count *= (divisions + 1);

        //here, we try to iterate more efficiently by skipping
        //obvious non-primes like 4, 6, etc
        if(i == 2)
            i++;
        else
            i += 2;
    }

    if(n != 1) //n is a prime
        return divisors_count * 2;
    else
        return divisors_count;
}

long long euler12()
{
    //n and n + 1
    long long n, n_p_1;

    n = 1; n_p_1 = 2;

    // divisors_x will store either the divisors of x or x/2
    // (the later iff x is divisible by two)
    long long divisors_n = 1;
    long long divisors_n_p_1 = 2;

    for(;;)
    {
        /* This loop has been unwound, so two iterations are completed at a time
         * n and n + 1 have no prime factors in common and therefore we can
         * calculate their divisors separately
         */

        long long total_divisors;                 //the divisors of the triangle number
                                                  // n(n+1)/2

        //the first (unwound) iteration

        divisors_n_p_1 = get_divisors(n_p_1 / 2); //here n+1 is even and we

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1);   //n_p_1 is now odd!

        //now the second (unwound) iteration

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1 / 2);   //n_p_1 is now even!
    }

    return (n * n_p_1) / 2;
}

int main()
{
    for(int i = 0; i < 1000; i++)
    {
        using namespace std::chrono;
        auto start = high_resolution_clock::now();
        auto result = euler12();
        auto end = high_resolution_clock::now();

        double time_elapsed = duration_cast<milliseconds>(end - start).count();

        cout << result << " " << time_elapsed << '\n';
    }
    return 0;
}

That takes around 19ms on average for my desktop and 80ms for my laptop, a far cry from most of the other code I've seen here. And there are, no doubt, many optimisations still available.

查看更多
神经病院院长
6楼-- · 2019-01-03 19:37

Looking at your Erlang implementation. The timing has included the start up of the entire virtual machine, running your program and halting the virtual machine. Am pretty sure that setting up and halting the erlang vm takes some time.

If the timing was done within the erlang virtual machine itself, results would be different as in that case we would have the actual time for only the program in question. Otherwise, i believe that the total time taken by the process of starting and loading of the Erlang Vm plus that of halting it (as you put it in your program) are all included in the total time which the method you are using to time the program is outputting. Consider using the erlang timing itself which we use when we want to time our programs within the virtual machine itself timer:tc/1 or timer:tc/2 or timer:tc/3. In this way, the results from erlang will exclude the time taken to start and stop/kill/halt the virtual machine. That is my reasoning there, think about it, and then try your bench mark again.

I actually suggest that we try to time the program (for languages that have a runtime), within the runtime of those languages in order to get a precise value. C for example has no overhead of starting and shutting down a runtime system as does Erlang, Python and Haskell (98% sure of this - i stand correction). So (based on this reasoning) i conclude by saying that this benchmark wasnot precise /fair enough for languages running on top of a runtime system. Lets do it again with these changes.

EDIT: besides even if all the languages had runtime systems, the overhead of starting each and halting it would differ. so i suggest we time from within the runtime systems (for the languages for which this applies). The Erlang VM is known to have considerable overhead at start up!

查看更多
Root(大扎)
7楼-- · 2019-01-03 19:39

Just for fun. The following is a more 'native' Haskell implementation:

import Control.Applicative
import Control.Monad
import Data.Either
import Math.NumberTheory.Powers.Squares

isInt :: RealFrac c => c -> Bool
isInt = (==) <$> id <*> fromInteger . round

intSqrt :: (Integral a) => a -> Int
--intSqrt = fromIntegral . floor . sqrt . fromIntegral
intSqrt = fromIntegral . integerSquareRoot'

factorize :: Int -> [Int]
factorize 1 = []
factorize n = first : factorize (quot n first)
  where first = (!! 0) $ [a | a <- [2..intSqrt n], rem n a == 0] ++ [n]

factorize2 :: Int -> [(Int,Int)]
factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize

numDivisors :: Int -> Int
numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2

nextTriangleNumber :: (Int,Int) -> (Int,Int)
nextTriangleNumber (n,acc) = (n+1,acc+n+1)

forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int)
forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val)

problem12 :: Int -> (Int, Int)
problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n

main = do
  let (n,val) = problem12 1000
  print val

Using ghc -O3, this consistently runs in 0.55-0.58 seconds on my machine (1.73GHz Core i7).

A more efficient factorCount function for the C version:

int factorCount (int n)
{
  int count = 1;
  int candidate,tmpCount;
  while (n % 2 == 0) {
    count++;
    n /= 2;
  }
    for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2)
    if (n % candidate == 0) {
      tmpCount = 1;
      do {
        tmpCount++;
        n /= candidate;
      } while (n % candidate == 0);
       count*=tmpCount;
      }
  if (n > 1)
    count *= 2;
  return count;
}

Changing longs to ints in main, using gcc -O3 -lm, this consistently runs in 0.31-0.35 seconds.

Both can be made to run even faster if you take advantage of the fact that the nth triangle number = n*(n+1)/2, and n and (n+1) have completely disparate prime factorizations, so the number of factors of each half can be multiplied to find the number of factors of the whole. The following:

int main ()
{
  int triangle = 0,count1,count2 = 1;
  do {
    count1 = count2;
    count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2);
  } while (count1*count2 < 1001);
  printf ("%lld\n", ((long long)triangle)*(triangle+1)/2);
}

will reduce the c code run time to 0.17-0.19 seconds, and it can handle much larger searches -- greater than 10000 factors takes about 43 seconds on my machine. I leave a similar haskell speedup to the interested reader.

查看更多
登录 后发表回答