I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.
The result is the following:
C:
lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320
real 0m11.074s
user 0m11.070s
sys 0m0.000s
Python:
lorenzo@enzo:~/erlang$ time ./euler12.py
842161320
real 1m16.632s
user 1m16.370s
sys 0m0.250s
Python with PyPy:
lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py
842161320
real 0m13.082s
user 0m13.050s
sys 0m0.020s
Erlang:
lorenzo@enzo:~/erlang$ erlc euler12.erl
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.7.4 (abort with ^G)
1> 842161320
real 0m48.259s
user 0m48.070s
sys 0m0.020s
Haskell:
lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx
842161320
real 2m37.326s
user 2m37.240s
sys 0m0.080s
Summary:
- C: 100%
- Python: 692% (118% with PyPy)
- Erlang: 436% (135% thanks to RichardC)
- Haskell: 1421%
I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).
Question 1:
Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT
?
Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)
Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.
EDIT:
Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?
I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.
Source codes used:
#include <stdio.h>
#include <math.h>
int factorCount (long n)
{
double square = sqrt (n);
int isquare = (int) square;
int count = isquare == square ? -1 : 0;
long candidate;
for (candidate = 1; candidate <= isquare; candidate ++)
if (0 == n % candidate) count += 2;
return count;
}
int main ()
{
long triangle = 1;
int index = 1;
while (factorCount (triangle) < 1001)
{
index ++;
triangle += index;
}
printf ("%ld\n", triangle);
}
#! /usr/bin/env python3.2
import math
def factorCount (n):
square = math.sqrt (n)
isquare = int (square)
count = -1 if isquare == square else 0
for candidate in range (1, isquare + 1):
if not n % candidate: count += 2
return count
triangle = 1
index = 1
while factorCount (triangle) < 1001:
index += 1
triangle += index
print (triangle)
-module (euler12).
-compile (export_all).
factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).
factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
factorCount (Number, Sqrt, Candidate, Count) ->
case Number rem Candidate of
0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
_ -> factorCount (Number, Sqrt, Candidate + 1, Count)
end.
nextTriangle (Index, Triangle) ->
Count = factorCount (Triangle),
if
Count > 1000 -> Triangle;
true -> nextTriangle (Index + 1, Triangle + Index + 1)
end.
solve () ->
io:format ("~p~n", [nextTriangle (1, 1) ] ),
halt (0).
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
where square = sqrt $ fromIntegral number
isquare = floor square
factorCount' number sqrt candidate count
| fromIntegral candidate > sqrt = count
| number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
| otherwise = factorCount' number sqrt (candidate + 1) count
nextTriangle index triangle
| factorCount triangle > 1000 = triangle
| otherwise = nextTriangle (index + 1) (triangle + index + 1)
main = print $ nextTriangle 1 1
Using
GHC 7.0.3
,gcc 4.4.6
,Linux 2.6.29
on an x86_64 Core2 Duo (2.5GHz) machine, compiling usingghc -O2 -fllvm -fforce-recomp
for Haskell andgcc -O3 -lm
for C.-O3
)-O2
flag)factorCount'
code isn't explicitly typed and defaulting toInteger
(thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) usingInt
and the time changes to 11.1 secondsfactorCount'
you have needlessly calledfromIntegral
. A fix results in no change though (the compiler is smart, lucky for you).mod
whererem
is faster and sufficient. This changes the time to 8.5 seconds.factorCount'
is constantly applying two extra arguments that never change (number
,sqrt
). A worker/wrapper transformation gives us:That's right, 7.95 seconds. Consistently half a second faster than the C solution. Without the
-fllvm
flag I'm still getting8.182 seconds
, so the NCG backend is doing well in this case too.Conclusion: Haskell is awesome.
Resulting Code
EDIT: So now that we've explored that, lets address the questions
In Haskell, using
Integer
is slower thanInt
but how much slower depends on the computations performed. Luckily (for 64 bit machines)Int
is sufficient. For portability sake you should probably rewrite my code to useInt64
orWord64
(C isn't the only language with along
).That was what I answered above. The answer was
-O2
rem
notmod
(a frequently forgotten optimization) andYes, that wasn't the issue. Good work and glad you considered this.
gcc -lm -Ofast euler.c
time ./a.out
2.79s user 0.00s system 99% cpu 2.794 total
In regards to Python optimization, in addition to using PyPy (for pretty impressive speed-ups with zero change to your code), you could use PyPy's translation toolchain to compile an RPython-compliant version, or Cython to build an extension module, both of which are faster than the C version in my testing, with the Cython module nearly twice as fast. For reference I include C and PyPy benchmark results as well:
C (compiled with
gcc -O3 -lm
)PyPy 1.5
RPython (using latest PyPy revision,
c2f583445aee
)Cython 0.15
The RPython version has a couple of key changes. To translate into a standalone program you need to define your
target
, which in this case is themain
function. It's expected to acceptsys.argv
as it's only argument, and is required to return an int. You can translate it by using translate.py,% translate.py euler12-rpython.py
which translates to C and compiles it for you.The Cython version was rewritten as an extension module
_euler12.pyx
, which I import and call from a normal python file. The_euler12.pyx
is essentially the same as your version, with some additional static type declarations. The setup.py has the normal boilerplate to build the extension, usingpython setup.py build_ext --inplace
.I honestly have very little experience with either RPython or Cython, and was pleasantly surprised at the results. If you are using CPython, writing your CPU-intensive bits of code in a Cython extension module seems like a really easy way to optimize your program.
Trying GO:
I get:
original c version: 9.1690 100%
go: 8.2520 111%
But using:
I get:
original c version: 9.1690 100%
thaumkid's c version: 0.1060 8650%
first go version: 8.2520 111%
second go version: 0.0230 39865%
I also tried Python3.6 and pypy3.3-5.5-alpha:
original c version: 8.629 100%
thaumkid's c version: 0.109 7916%
Python3.6: 54.795 16%
pypy3.3-5.5-alpha: 13.291 65%
and then with following code I got:
original c version: 8.629 100%
thaumkid's c version: 0.109 8650%
Python3.6: 1.489 580%
pypy3.3-5.5-alpha: 0.582 1483%
Some more numbers and explanations for the C version. Apparently noone did it during all those years. Remember to upvote this answer so it can get on top for everyone to see and learn.
Step One: Benchmark of the author's programs
Laptop Specifications:
Commands:
.
Filenames are:
integertype_architecture_compiler.exe
Step Two: Investigate, Improve and Benchmark Again
VS is 250% faster than gcc. The two compilers should give a similar speed. Obviously, something is wrong with the code or the compiler options. Let's investigate!
The first point of interest is the integer types. Conversions can be expensive and consistency is important for better code generation & optimizations. All integers should be the same type.
It's a mixed mess of
int
andlong
right now. We're going to improve that. What type to use? The fastest. Gotta benchmark them'all!Integer types are
int
long
int32_t
uint32_t
int64_t
anduint64_t
from#include <stdint.h>
There are LOTS of integer types in C, plus some signed/unsigned to play with, plus the choice to compile as x86 or x64 (not to be confused with the actual integer size). That is a lot of versions to compile and run ^^
Step Three: Understanding the Numbers
Definitive conclusions:
Trick question: "What are the sizes of int and long in C?"
The right answer is: The size of int and long in C are not well-defined!
From the C spec:
From the gcc man page (-m32 and -m64 flags):
From MSDN documentation (Data Type Ranges) https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx :
To Conclude This: Lessons Learned
32 bits integers are faster than 64 bits integers.
Standard integers types are not well defined in C nor C++, they vary depending on compilers and architectures. When you need consistency and predictability, use the
uint32_t
integer family from#include <stdint.h>
.Speed issues solved. All other languages are back hundreds percent behind, C & C++ win again ! They always do. The next improvement will be multithreading using OpenMP :D
Change:
case (divisor(T,round(math:sqrt(T))) > 500) of
To:
case (divisor(T,round(math:sqrt(T))) > 1000) of
This will produce the correct answer for the Erlang multi-process example.