I implemented the simplest possible way, not using templates at all and it works:
constexpr bool isPrimeLoop(int i, int k) {
return (k*k > i)?true:(i%k == 0)?false:isPrimeLoop(i, k + 1);
}
constexpr bool isPrime(int i) {
return isPrimeLoop(i, 2);
}
constexpr int nextPrime(int k) {
return isPrime(k)?k:nextPrime(k + 1);
}
constexpr int getPrimeLoop(int i, int k) {
// i - nr of primes to advance
// k - some starting prime
return (i == 0)?k:getPrimeLoop(i - 1, nextPrime(k + 1));
}
constexpr int getPrime(int i) {
return getPrimeLoop(i, 2);
}
static_assert(getPrime(511) == 3671, "computed incorrectly");
It needs increased constexpr-depth a bit, but it fits in time easily:
$ time g++ -c -std=c++11 vec.cpp -fconstexpr-depth=600
real 0m0.093s
user 0m0.080s
sys 0m0.008s
The following trick reduces depth of getPrimeLoop
recursion to logarithmic, so g++ can complete with default depth (without measurable time penalty):
constexpr int getPrimeLoop(int i, int k) {
return (i == 0)?k:
(i % 2)?getPrimeLoop(i-1, nextPrime(k + 1)):
getPrimeLoop(i/2, getPrimeLoop(i/2, k));
}
I doubt that your 1 second goal is within reach of any hardware that
doesn't have security guards. However I believe the following
meta-program can close down on it a lot:
#include <type_traits>
template<unsigned N>
using boolean = std::integral_constant<bool,N>;
template<unsigned N>
constexpr bool is_co_prime()
{
return true;
};
template<unsigned N, unsigned D>
constexpr bool is_co_prime()
{
return N % D != 0;
};
template<unsigned N, unsigned D0, unsigned D1, unsigned ...Di>
constexpr bool is_co_prime()
{
typedef typename std::conditional<
is_co_prime<N,D1>(),
typename std::conditional<
is_co_prime<N,Di...>(),
boolean<is_co_prime<N,D0>()>,
std::false_type
>::type,
std::false_type
>::type type;
return type::value;
};
template<unsigned N>
constexpr unsigned inc()
{
return N == 2 ? 3 : N + 2;
}
template<unsigned Counter, unsigned Candidate, unsigned ...Primes>
struct nth_prime;
template<unsigned Candidate, unsigned Prime, unsigned ...Primes>
struct nth_prime<0,Candidate,Prime,Primes...>
{
static const unsigned value = Prime;
};
template<unsigned Counter, unsigned Candidate = 2, unsigned ...Primes>
struct nth_prime
{
typedef typename std::conditional<
is_co_prime<Candidate,Primes...>(),
nth_prime<Counter - 1,inc<Candidate>(),Candidate,Primes...>,
nth_prime<Counter,inc<Candidate>(),Primes...>
>::type type;
static const unsigned value = type::value;
};
#include <iostream>
using namespace std;
int main()
{
cout << nth_prime<512>::value << endl;
return 0;
}
I'll call this meta-program MyNthPrime and call your one YourNthPrime.
Your seem to have much heftier hardware than mine and certainly more memory. I
have a routine Lenovo ThinkPad T420, 4-core i5, 8GB RAM, 8GB swap, running Linux
Mint 14, kernel 3.5.0. You report it takes you 3 mins. to build your YourNthPrime.
Measured by the time
command, it takes me 35 mins. to build YourNthPrime,
with no apps running but the terminal and system monitor.
The compiler is GCC 4.7.2 and the commandline the options are:
-Wall -O0 -std=c++11 -ftemplate-depth=1200
That elapsed time breaks down as:
real 34m2.534s
user 3m40.414s
sys 0m33.450s
It takes me 1.5 mins to build MyNthPrime, with:
-Wall -O0 -std=c++11 -ftemplate-depth=2100
and the elapsed time breaks down as:
real 1m27.832s
user 1m22.205s
sys 0m2.612s
That -ftemplate-depth=2100
is not a transposition typo. More of this
shortly.
MyNthPrime isn't fairly and squarely 23 times faster than YourNthPrime. The
broken-down timings suggest that MyNthPrime is actually around
2.75 times as fast as as YourNthPrime on user-time. But they also show
that YourNthPrime's build really lost the farm on real time. What was it
doing for the fruitless 9/10ths of its existence? Swapping, of course.
Both builds scoffed 95% of my 8GB system RAM within 45 secs., but MyNthPrime
topped out around there and didn't swap. YourNthPrime went on eating swap
space up to a peak 3.9GB, and long before that all my CPUs were dozing.
This point is notable when you take it with the fact that MyNthPrime
needs getting on for twice as much -ftemplate-depth
as YourNthPrime.
Folk wisdom is that an extravagant -ftemplate-depth
is the road to
ruin for a meta-program build time, because it equates to extravagant
memory consumption, which only has to slide into heavy swapping and
you're watching paint dry. But the run-off of YourNthPrime and MyNthPrime does
not bear this out - quite the opposite. The lesson I take is that
the depth to which you must instantiate recursive templates is not
always a good measure of the quantity of template instantiating you
must do, and it is the quantity that matters for your memory resources.
Although they don't look superficially similar, MyNthPrime and YourNthPrime,
both implement the trial-division algorithm for prime generation. MyNthPrime
is that much faster solely because it is rather more shrewdly coded to
be sparing to of recursive template instantiations, and of the memory they gobble up.
YourNthPrime fields 4 recursive templates for the computation, all
consuming the same recursive variadic template argument list.
MyNthPrime fields 2: it's just giving the compiler about half as
many enormous instantiations to do.
YourNthPrime (as I read it) perceives the potential efficiency of
doing its trial-divisions in ascending order of the primes in hand - because the chances of
successful division increase toward the smaller primes; and once a prime
in hand exceeds 1/2 of the candidate number being divided, the chances are 0.
Hit the most likely divisors first and you optimize your prospect of a quick
verdict and exit. But an obstacle to exploiting this efficiency is the fact that
the primes in hand are represented by a variadic template argument list with the largest
always at its head. To surmount this obstacle YourNthPrime deploys the recursive
variadic template function lastArg<>()
, effectively to reverse the
order in which the primes in hand are consumed in the divisions.
lastArg<>()
presents the primes in hand to the template
function:
template<unsigned i, unsigned head, unsigned... tail>
constexpr bool isPrime() {
return i % head && isPrime<i, tail...>();
}
in ascending order for recursive trial division of the next candidate i
by the primes head, tail...
. It's here I think you look to lastArg<>()
to pay its way by ensuring head
is always he next-best prospect for
getting you out of the costly righthand side of that &&
.
But to achieve this lastArg<>()
itself recursively traverses the whole
list of primes in hand at each invocation, before you even get the opportunity
for a verdict on i
. So it would be cheaper just to let isPrime<>()
traverse the primes in hand as far as it needs to, testing i
as it goes,
dispense with lastArg<>()
and save all the recursive instantiations
thereof.
The job done by isPrime<>()
in YourNthPrime - the recursive trial division -
is done in MyNthPrime by:
template<unsigned N, unsigned D0, unsigned D1, unsigned ...Di>
constexpr bool is_co_prime()
{
typedef typename std::conditional<
is_co_prime<N,D1>(),
typename std::conditional<
is_co_prime<N,Di...>(),
boolean<is_co_prime<N,D0>()>,
std::false_type
>::type,
std::false_type
>::type type;
return type::value;
};
is_co_prime<>()
takes 10 lines to do what is_prime<>()
does in one,
I could have done it in one line also:
return is_co_prime<N,D0>() && is_co_prime<N,D1,Di...>();
could be the body of the function. But the ugly trounces the pretty for
efficiency here. Every time is_prime<>()
has to recurse into the tail,
that tail is only one prime shorter than it was before. Every time
is_co_prime<>()
has to do the same thing the tail is two primes
shorter than it was before. Its tail recursions are shallower in the
worst case than those of is_prime<>()
and there can only half as many.
is_co_prime<>()
divides the candidate number N
by the righthand - the
smallest and likeliest - of any pair of available divisors first,
and returns no-prime on success; otherwise recurses to any divisors still
to the right of that pair and continues trial division of the next-but-one
until success or exhaustion. Only at exhaustion does it resort to
trial division by the larger of the the original pair - the least
likely divisor of any it has tried. And likewise within each of
the intervening smaller recursions, the least likely divisor gets
tried last.
Though this algorithm can be seen to get speedily to the smaller and
likelier divisors of N
, an itch will persist to get straight
to the smallest of them first and try them in true descending likelihood,
as per lastArg<>()
. We have to quell this itch with the recognition
that any way of "getting straight to the smallest", when it is at the
tail of the list, will be a meta-function that must itself recurse over
the whole list before we even get started with trial divisions.
The implementation of is_co_prime<>()
gets us down the list
"two steps at a time" doing the trial division as it does so.
True, sometimes it will unluckily "step over" the largest divisor of N
first time and then won't find it again until and unless it bottoms out
and recurses backwards up the list. But once N
is big enough to make it
matter there will mostly be at least one smaller divisor further on
to the right and it would be really unlucky to miss all of them. So the
risk of skipping the largest on the way down is no big deal. Remember
too there aren't going to be any divisors on the way down till we
reach the N/2
point. That means the worst-case waste of recursions by
skipping the one-and-only divisor of some N
on the way down is confined
to the tail of the list from that point.
You speculate that a Sieve of Erathosthenes meta-program might
compile faster, and you are right. As a prime generator, Sieve has better
theoretical complexity than Trial Division. An elegant template
meta-program implementation by Peter Simons, is
here, dating from 2006 or before. (And
as Peter Wood commented, a non-terminating Sieve of Erathosthenes
meta-program broke the news that the C++ template system is Turing-complete.)
With C++11 facilities Simons' meta-program can be much abbreviated but
I don't think made much more efficient. Just it stands, Simons
Sieve can generate at compiletime all the primes up to the 512th in
under 9 secs. on my ThinkPad. It needs -ftemplate-depth=4000
or
thereabouts to do it, but only about 0.5GB of memory - an even more
arresting counterexample to the folk wisdom that big template-depth ==
big memory usage.
So Erathosthenes' Sieve leaves Trial Division floundering in the dust.
Sadly, for the purpose of your exercise, Sieve is no use. It's called
the sieve because we have to input an integer upper limit U
and it
sieves out of the composite numbers between 2 and U
, leaving the primes.
Thus to apply it for finding exactly the Nth prime = Pn
and not
possibly any other, you must anyhow compute Pn
in some other way,
give Sieve an upper limit of Pn + 1
(or Pn + 2
, for Pn > 2
), then
throw away all the Pi, 2 <= Pi < Pn
, that it returns to you, and keep
just the Pn
that you had already computed. This is a no-op.
Several of the commenters make the point that the identity any Nth prime
you could possibly generate by compiletime meta-programming will be
known beforehand or calculable beforehand by much simpler and vastly
faster means. I can't disagree, but I support your general point that
with C++11 facilities, TMP takes a huge stride toward real-world utility
that is well worth exploring - the more so as whatever takes a minute
to compile right now will take a second within in a decade.
Meanwhile, without even leaving our uncannily sophisticated C++ compilers,
with TMP problems like this we can still experience the nature of programming
an early computer, with clock speed and memory in the K's, in a "language"
modelled tightly - but within arcane constraints! - on classical recursive
function theory. That's why you've really gotta love them!
I've given this a try myself, and written the following implementation:
template<unsigned... args> constexpr unsigned countArgs();
template<> constexpr unsigned countArgs() { return 0; }
template<unsigned head, unsigned... tail>
constexpr unsigned countArgs() { return 1 + countArgs<tail...>(); }
template<unsigned last>
constexpr unsigned lastArg() { return last; }
template<unsigned head, unsigned next, unsigned... tail>
constexpr unsigned lastArg() { return lastArg<next, tail...>(); }
template<unsigned i> constexpr bool isPrime() { return true; }
template<unsigned i, unsigned head, unsigned... tail>
constexpr bool isPrime()
{ return i % head && isPrime<i, tail...>(); }
template<bool found, unsigned i, unsigned... primesSoFar> struct nextPrime
{ static constexpr unsigned val =
nextPrime<isPrime<i + 2, primesSoFar...>(), i + 2, primesSoFar...>::val; };
template<unsigned i, unsigned... primesSoFar> struct
nextPrime<true, i, primesSoFar...> { static constexpr unsigned val = i; };
template<unsigned n, unsigned... primesSoFar> struct nthPrimeImpl
{ static constexpr unsigned val = nthPrimeImpl<n - 1, primesSoFar...,
nextPrime<false, lastArg<primesSoFar...>(), primesSoFar...>::val>::val; };
template<unsigned... primesSoFar> struct nthPrimeImpl<0, primesSoFar...>
{ static constexpr unsigned val = lastArg<primesSoFar...>(); };
template<unsigned n>
constexpr unsigned nthPrime() {
return n == 1 ? 2 : nthPrimeImpl<n - 2, 3>::val;
}
constexpr unsigned p512 = nthPrime<512>();
static_assert(p512 == 3671, "computed incorrectly");
This requires increasing the maximum template depth of gcc
to more than the default of 900 (in my gcc 4.7.2), e.g. by passing -ftemplate-depth=1200
. And it is way too slow: it takes about 3 minutes on my hardware. So I very much hope for some more efficient code in a different answer.
In terms of computation method, the above does something like trial division. A sieve of Eratosthenes might perform better, but so far I couldn't think of a way to write that in a constexpr-compatible fashion.
The answer by Mike Kinghan got me thinking along lines I hadn't though before. If template instantiation is such a problem which causes such severe memory consumption, how can we reduce that? I eventually came up with a scheme where instead of an argument pack for all primes found so far, I use a chain of types, each one referencing the one before, and a chain of static functions in these types which can use the ones from types before.
The result I'll paste below is still a lot slower than the one suggested by zch, but I find it interesting enough to share, since it might be a useful approach for other applications.
template<unsigned N> struct NthPrime {
typedef NthPrime<N - 1> previous;
static constexpr unsigned prime = previous::nextPrime();
static constexpr unsigned nextPrime() { return nextCoprime(prime + 2); }
static constexpr unsigned nextCoprime(unsigned x) {
// x is a candidate. We recurse to obtain a number which is
// coprime to all smaller primes, then check that value against
// the current prime.
return checkCoprime(previous::nextCoprime(x));
}
static constexpr unsigned checkCoprime(unsigned x) {
// if x is coprime to the current prime as well, then it is the
// next prime. Otherwise we have to try the next candidate.
return (x % prime) ? x : nextCoprime(x + 2);
}
};
template<> struct NthPrime<1> {
static constexpr unsigned prime = 2;
static constexpr unsigned nextPrime() {
return 3;
}
static constexpr unsigned nextCoprime(unsigned x) {
return x; // x is guaranteed to be odd, so no need to check anything.
}
};
template<unsigned n>
constexpr unsigned nthPrime() {
return NthPrime<n>::prime;
}
constexpr unsigned p512 = nthPrime<512>();
static_assert(p512 == 3671, "computed incorrectly");
The beast above requires modifications both to the constexpr depth and the template depth. The following values are sharp bounds for my compiler.
time g++-4.7.2 -c -fconstexpr-depth=519 -ftemplate-depth=2042 -std=c++11 foo.cc
real 0m0.397s
user 0m0.368s
sys 0m0.025s