Compute nth prime at compile time [closed]

I doubt that your 1 second goal is within reach of any hardware that doesn't have security guards. However I believe the following meta-program can close down on it a lot:

#include <type_traits>

template<unsigned N>
using boolean = std::integral_constant<bool,N>;

template<unsigned N>
constexpr bool is_co_prime()
{
    return true;
};

template<unsigned N, unsigned D>
constexpr bool is_co_prime()
{
    return N % D != 0;
};

template<unsigned N, unsigned D0, unsigned D1, unsigned ...Di>
constexpr bool is_co_prime()
{
    typedef typename std::conditional<
        is_co_prime<N,D1>(),
        typename std::conditional<
            is_co_prime<N,Di...>(),
            boolean<is_co_prime<N,D0>()>,
            std::false_type
        >::type,
        std::false_type
    >::type type;
    return type::value;
};

template<unsigned N>
constexpr unsigned inc()
{
    return N == 2 ? 3 : N + 2;
}

template<unsigned Counter, unsigned Candidate, unsigned ...Primes>
struct nth_prime;

template<unsigned Candidate, unsigned Prime, unsigned ...Primes>
struct nth_prime<0,Candidate,Prime,Primes...>
{
    static const unsigned value = Prime;
};

template<unsigned Counter, unsigned Candidate = 2, unsigned ...Primes>
struct nth_prime
{
    typedef typename std::conditional<
        is_co_prime<Candidate,Primes...>(),
        nth_prime<Counter - 1,inc<Candidate>(),Candidate,Primes...>,
        nth_prime<Counter,inc<Candidate>(),Primes...>
    >::type type;
    static const unsigned value = type::value;
};

#include <iostream>

using namespace std;

int main()
{
    cout << nth_prime<512>::value << endl;
    return 0;
}

I'll call this meta-program MyNthPrime and call your one YourNthPrime.

Your seem to have much heftier hardware than mine and certainly more memory. I have a routine Lenovo ThinkPad T420, 4-core i5, 8GB RAM, 8GB swap, running Linux Mint 14, kernel 3.5.0. You report it takes you 3 mins. to build your YourNthPrime. Measured by the time command, it takes me 35 mins. to build YourNthPrime, with no apps running but the terminal and system monitor.

The compiler is GCC 4.7.2 and the commandline the options are:

-Wall -O0 -std=c++11 -ftemplate-depth=1200

That elapsed time breaks down as:

real    34m2.534s
user    3m40.414s
sys     0m33.450s

It takes me 1.5 mins to build MyNthPrime, with:

-Wall -O0 -std=c++11 -ftemplate-depth=2100

and the elapsed time breaks down as:

real    1m27.832s
user    1m22.205s
sys     0m2.612s

That -ftemplate-depth=2100 is not a transposition typo. More of this shortly.

MyNthPrime isn't fairly and squarely 23 times faster than YourNthPrime. The broken-down timings suggest that MyNthPrime is actually around 2.75 times as fast as as YourNthPrime on user-time. But they also show that YourNthPrime's build really lost the farm on real time. What was it doing for the fruitless 9/10ths of its existence? Swapping, of course.

Both builds scoffed 95% of my 8GB system RAM within 45 secs., but MyNthPrime topped out around there and didn't swap. YourNthPrime went on eating swap space up to a peak 3.9GB, and long before that all my CPUs were dozing.

This point is notable when you take it with the fact that MyNthPrime needs getting on for twice as much -ftemplate-depth as YourNthPrime. Folk wisdom is that an extravagant -ftemplate-depth is the road to ruin for a meta-program build time, because it equates to extravagant memory consumption, which only has to slide into heavy swapping and you're watching paint dry. But the run-off of YourNthPrime and MyNthPrime does not bear this out - quite the opposite. The lesson I take is that the depth to which you must instantiate recursive templates is not always a good measure of the quantity of template instantiating you must do, and it is the quantity that matters for your memory resources.

Although they don't look superficially similar, MyNthPrime and YourNthPrime, both implement the trial-division algorithm for prime generation. MyNthPrime is that much faster solely because it is rather more shrewdly coded to be sparing to of recursive template instantiations, and of the memory they gobble up.

YourNthPrime fields 4 recursive templates for the computation, all consuming the same recursive variadic template argument list. MyNthPrime fields 2: it's just giving the compiler about half as many enormous instantiations to do.

YourNthPrime (as I read it) perceives the potential efficiency of doing its trial-divisions in ascending order of the primes in hand - because the chances of successful division increase toward the smaller primes; and once a prime in hand exceeds 1/2 of the candidate number being divided, the chances are 0. Hit the most likely divisors first and you optimize your prospect of a quick verdict and exit. But an obstacle to exploiting this efficiency is the fact that the primes in hand are represented by a variadic template argument list with the largest always at its head. To surmount this obstacle YourNthPrime deploys the recursive variadic template function lastArg<>(), effectively to reverse the order in which the primes in hand are consumed in the divisions.

lastArg<>() presents the primes in hand to the template function:

template<unsigned i, unsigned head, unsigned... tail>
constexpr bool isPrime() {
    return i % head && isPrime<i, tail...>();
}

in ascending order for recursive trial division of the next candidate i by the primes head, tail.... It's here I think you look to lastArg<>() to pay its way by ensuring head is always he next-best prospect for getting you out of the costly righthand side of that &&.

But to achieve this lastArg<>() itself recursively traverses the whole list of primes in hand at each invocation, before you even get the opportunity for a verdict on i. So it would be cheaper just to let isPrime<>() traverse the primes in hand as far as it needs to, testing i as it goes, dispense with lastArg<>() and save all the recursive instantiations thereof.

The job done by isPrime<>() in YourNthPrime - the recursive trial division - is done in MyNthPrime by:

template<unsigned N, unsigned D0, unsigned D1, unsigned ...Di>
constexpr bool is_co_prime()
{
    typedef typename std::conditional<
        is_co_prime<N,D1>(),
        typename std::conditional<
            is_co_prime<N,Di...>(),
            boolean<is_co_prime<N,D0>()>,
            std::false_type
        >::type,
        std::false_type
    >::type type;
    return type::value;
};

is_co_prime<>() takes 10 lines to do what is_prime<>() does in one, I could have done it in one line also:

return is_co_prime<N,D0>() && is_co_prime<N,D1,Di...>();

could be the body of the function. But the ugly trounces the pretty for efficiency here. Every time is_prime<>() has to recurse into the tail, that tail is only one prime shorter than it was before. Every time is_co_prime<>() has to do the same thing the tail is two primes shorter than it was before. Its tail recursions are shallower in the worst case than those of is_prime<>() and there can only half as many.

is_co_prime<>() divides the candidate number N by the righthand - the smallest and likeliest - of any pair of available divisors first, and returns no-prime on success; otherwise recurses to any divisors still to the right of that pair and continues trial division of the next-but-one until success or exhaustion. Only at exhaustion does it resort to trial division by the larger of the the original pair - the least likely divisor of any it has tried. And likewise within each of the intervening smaller recursions, the least likely divisor gets tried last.

Though this algorithm can be seen to get speedily to the smaller and likelier divisors of N, an itch will persist to get straight to the smallest of them first and try them in true descending likelihood, as per lastArg<>(). We have to quell this itch with the recognition that any way of "getting straight to the smallest", when it is at the tail of the list, will be a meta-function that must itself recurse over the whole list before we even get started with trial divisions. The implementation of is_co_prime<>() gets us down the list "two steps at a time" doing the trial division as it does so.

True, sometimes it will unluckily "step over" the largest divisor of N first time and then won't find it again until and unless it bottoms out and recurses backwards up the list. But once N is big enough to make it matter there will mostly be at least one smaller divisor further on to the right and it would be really unlucky to miss all of them. So the risk of skipping the largest on the way down is no big deal. Remember too there aren't going to be any divisors on the way down till we reach the N/2 point. That means the worst-case waste of recursions by skipping the one-and-only divisor of some N on the way down is confined to the tail of the list from that point.

You speculate that a Sieve of Erathosthenes meta-program might compile faster, and you are right. As a prime generator, Sieve has better theoretical complexity than Trial Division. An elegant template meta-program implementation by Peter Simons, is here, dating from 2006 or before. (And as Peter Wood commented, a non-terminating Sieve of Erathosthenes meta-program broke the news that the C++ template system is Turing-complete.) With C++11 facilities Simons' meta-program can be much abbreviated but I don't think made much more efficient. Just it stands, Simons Sieve can generate at compiletime all the primes up to the 512th in under 9 secs. on my ThinkPad. It needs -ftemplate-depth=4000 or thereabouts to do it, but only about 0.5GB of memory - an even more arresting counterexample to the folk wisdom that big template-depth == big memory usage.

So Erathosthenes' Sieve leaves Trial Division floundering in the dust. Sadly, for the purpose of your exercise, Sieve is no use. It's called the sieve because we have to input an integer upper limit U and it sieves out of the composite numbers between 2 and U, leaving the primes. Thus to apply it for finding exactly the Nth prime = Pn and not possibly any other, you must anyhow compute Pn in some other way, give Sieve an upper limit of Pn + 1 (or Pn + 2, for Pn > 2), then throw away all the Pi, 2 <= Pi < Pn, that it returns to you, and keep just the Pn that you had already computed. This is a no-op.

Several of the commenters make the point that the identity any Nth prime you could possibly generate by compiletime meta-programming will be known beforehand or calculable beforehand by much simpler and vastly faster means. I can't disagree, but I support your general point that with C++11 facilities, TMP takes a huge stride toward real-world utility that is well worth exploring - the more so as whatever takes a minute to compile right now will take a second within in a decade.

Meanwhile, without even leaving our uncannily sophisticated C++ compilers, with TMP problems like this we can still experience the nature of programming an early computer, with clock speed and memory in the K's, in a "language" modelled tightly - but within arcane constraints! - on classical recursive function theory. That's why you've really gotta love them!

回答3:

I've given this a try myself, and written the following implementation:

template<unsigned... args> constexpr unsigned countArgs();
template<> constexpr unsigned countArgs() { return 0; }
template<unsigned head, unsigned... tail>
constexpr unsigned countArgs() { return 1 + countArgs<tail...>(); }

template<unsigned last>
constexpr unsigned lastArg() { return last; }
template<unsigned head, unsigned next, unsigned... tail>
constexpr unsigned lastArg() { return lastArg<next, tail...>(); }

template<unsigned i> constexpr bool isPrime() { return true; }
template<unsigned i, unsigned head, unsigned... tail>
constexpr bool isPrime()
{ return i % head && isPrime<i, tail...>(); }

template<bool found, unsigned i, unsigned... primesSoFar> struct nextPrime
{ static constexpr unsigned val =
    nextPrime<isPrime<i + 2, primesSoFar...>(), i + 2, primesSoFar...>::val; };
template<unsigned i, unsigned... primesSoFar> struct
nextPrime<true, i, primesSoFar...> { static constexpr unsigned val = i; };

template<unsigned n, unsigned... primesSoFar> struct nthPrimeImpl
{ static constexpr unsigned val = nthPrimeImpl<n - 1, primesSoFar...,
    nextPrime<false, lastArg<primesSoFar...>(), primesSoFar...>::val>::val; };
template<unsigned... primesSoFar> struct nthPrimeImpl<0, primesSoFar...>
{ static constexpr unsigned val = lastArg<primesSoFar...>(); };

template<unsigned n>
constexpr unsigned nthPrime() {
  return n == 1 ? 2 : nthPrimeImpl<n - 2, 3>::val;
}

constexpr unsigned p512 = nthPrime<512>();
static_assert(p512 == 3671, "computed incorrectly");

This requires increasing the maximum template depth of gcc to more than the default of 900 (in my gcc 4.7.2), e.g. by passing -ftemplate-depth=1200. And it is way too slow: it takes about 3 minutes on my hardware. So I very much hope for some more efficient code in a different answer.

In terms of computation method, the above does something like trial division. A sieve of Eratosthenes might perform better, but so far I couldn't think of a way to write that in a constexpr-compatible fashion.

回答4:

The answer by Mike Kinghan got me thinking along lines I hadn't though before. If template instantiation is such a problem which causes such severe memory consumption, how can we reduce that? I eventually came up with a scheme where instead of an argument pack for all primes found so far, I use a chain of types, each one referencing the one before, and a chain of static functions in these types which can use the ones from types before.

The result I'll paste below is still a lot slower than the one suggested by zch, but I find it interesting enough to share, since it might be a useful approach for other applications.

template<unsigned N> struct NthPrime {
  typedef NthPrime<N - 1> previous;
  static constexpr unsigned prime = previous::nextPrime();
  static constexpr unsigned nextPrime() { return nextCoprime(prime + 2); }
  static constexpr unsigned nextCoprime(unsigned x) {
    // x is a candidate. We recurse to obtain a number which is
    // coprime to all smaller primes, then check that value against
    // the current prime.
    return checkCoprime(previous::nextCoprime(x));
  }
  static constexpr unsigned checkCoprime(unsigned x) {
    // if x is coprime to the current prime as well, then it is the
    // next prime. Otherwise we have to try the next candidate.
    return (x % prime) ? x : nextCoprime(x + 2);
  }
};

template<> struct NthPrime<1> {
  static constexpr unsigned prime = 2;
  static constexpr unsigned nextPrime() {
    return 3;
  }
  static constexpr unsigned nextCoprime(unsigned x) {
    return x; // x is guaranteed to be odd, so no need to check anything.
  }
};

template<unsigned n>
constexpr unsigned nthPrime() {
  return NthPrime<n>::prime;
}

constexpr unsigned p512 = nthPrime<512>();
static_assert(p512 == 3671, "computed incorrectly");

The beast above requires modifications both to the constexpr depth and the template depth. The following values are sharp bounds for my compiler.

time g++-4.7.2 -c -fconstexpr-depth=519 -ftemplate-depth=2042 -std=c++11 foo.cc

real    0m0.397s
user    0m0.368s
sys     0m0.025s