Clang performance drop for specific C++ random num

Using C++11's random module, I encountered an odd performance drop when using std::mt19937 (32 and 64bit versions) in combination with a uniform_real_distribution (float or double, doesn't matter). Compared to a g++ compile, it's more than an order of magnitude slower!

The culprit isn't just the mt generator, as it's fast with a uniform_int_distribution. And it isn't a general flaw in the uniform_real_distribution since that's fast with other generators like default_random_engine. Just that specific combination is oddly slow.

I'm not very familiar with the intrinsics, but the Mersenne Twister algorithm is more or less strictly defined, so a difference in implementation couldn't account for this difference I guess? measure Program is following, but here are my results for clang 3.4 and gcc 4.8.1 on a 64bit linux machine:

gcc 4.8.1
runtime_int_default: 185.6
runtime_int_mt: 179.198
runtime_int_mt_64: 175.195
runtime_float_default: 45.375
runtime_float_mt: 58.144
runtime_float_mt_64: 94.188

clang 3.4
runtime_int_default: 215.096
runtime_int_mt: 201.064
runtime_int_mt_64: 199.836
runtime_float_default: 55.143
runtime_float_mt: 744.072  <--- this and
runtime_float_mt_64: 783.293 <- this is slow

Program to generate this and try out yourself:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>

template< typename T_rng, typename T_dist>
double time_rngs(T_rng& rng, T_dist& dist, int n){
    std::vector< typename T_dist::result_type > vec(n, 0);
    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < n; ++i)
        vec[i] = dist(rng);
    auto t2 = std::chrono::high_resolution_clock::now();
    auto runtime = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000.0;
    auto sum = vec[0]; //access to avoid compiler skipping
    return runtime;
}

int main(){
    const int n = 10000000;
    unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
    std::default_random_engine rng_default(seed);
    std::mt19937 rng_mt (seed);
    std::mt19937_64 rng_mt_64 (seed);
    std::uniform_int_distribution<int> dist_int(0,1000);
    std::uniform_real_distribution<float> dist_float(0.0, 1.0);

    // print max values
    std::cout << "rng_default_random.max(): " << rng_default.max() << std::endl;
    std::cout << "rng_mt.max(): " << rng_mt.max() << std::endl;
    std::cout << "rng_mt_64.max(): " << rng_mt_64.max() << std::endl << std::endl;

    std::cout << "runtime_int_default: " << time_rngs(rng_default, dist_int, n) << std::endl;
    std::cout << "runtime_int_mt: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
    std::cout << "runtime_int_mt_64: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
    std::cout << "runtime_float_default: " << time_rngs(rng_default, dist_float, n) << std::endl;
    std::cout << "runtime_float_mt: " << time_rngs(rng_mt, dist_float, n) << std::endl;
    std::cout << "runtime_float_mt_64: " << time_rngs(rng_mt_64, dist_float, n) << std::endl;
}

compile via clang++ -O3 -std=c++11 random.cpp or g++ respectively. Any ideas?

edit: Finally, Matthieu M. had a great idea: The culprit is inlining, or rather a lack thereof. Increasing the clang inlining limit eliminated the performance penalty. That actually solved a number of performance oddities I encountered. Thanks, I learned something new.

As already stated in the comments, the problem is caused by the fact that gcc inlines more aggressive than clang. If we make clang inline very aggressively, the effect disappears:

Compiling your code with g++ -O3 yields

runtime_int_default: 3000.32
runtime_int_mt: 3112.11
runtime_int_mt_64: 3069.48
runtime_float_default: 859.14
runtime_float_mt: 1027.05
runtime_float_mt_64: 1777.48

while clang++ -O3 -mllvm -inline-threshold=10000 yields

runtime_int_default: 3623.89
runtime_int_mt: 751.484
runtime_int_mt_64: 751.132
runtime_float_default: 1072.53
runtime_float_mt: 968.967
runtime_float_mt_64: 1781.34

Apparently, clang now out-inlines gcc in the int_mt cases, but all of the other runtimes are now in the same order of magnitude. I used gcc 4.8.3 and clang 3.4 on Fedora 20 64 bit.