what is the performance impact of using int64_t in

2019-01-16 12:16发布

问题:

Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.

Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.

What I'm interested in:

  • which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
  • which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
  • are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
  • does anyone have own experience about this performance impact?

I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).

回答1:

which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well?

Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.

The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).

Will a normal 32-bit system use the 64-bit registers of modern CPUs?

This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?

Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.

Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.

Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.

The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.

The compiler will also need to use more registers.

are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?

Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.

If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.

does anyone have own experience about this performance impact?

I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.

Changing your OS to a 64-bit version of the same OS, would help, perhaps?

Edit:

Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )

Here's the code I wrote, trying to replicate a few common functons:

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i];
    return sum;
}


template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
    T sum[size] = {};
    for(int i = 0; i < size; i++)
    {
    for(int j = 0; j < size; j++)
        sum[i] += v[i][j];
    }
    T tsum=0;
    for(int i = 0; i < size; i++)
    tsum += sum[i];
    return tsum;
}



template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] * mul;
    return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] / mul;
    return sum;
}


template<typename T> 
void fill_array(T *v, const int size)
{
    for(int i = 0; i < size; i++)
    v[i] = i;
}

template<typename T, const int size> 
void fill_array(T v[size][size])
{
    for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
        v[i][j] = i + size * j;
}




uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
    uint32_t res = add_numbers(v, size);
    return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
    uint64_t res = add_numbers(v, size);
    return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v, c, size);
    return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v, c, size);
    return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v, c, size);
    return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_div_numbers(v, c, size);
    return res;
}


template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
    uint32_t res = add_matrix(v);
    return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
    uint64_t res = add_matrix(v);
    return res;
}


template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
    fill_array(v, size);

    uint64_t long t = rdtsc();
    T res = func(v, size);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
    fill_array(v);

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}


int main()
{
    // spin up CPU to full speed...
    time_t t = time(NULL);
    while(t == time(NULL)) ;

    const int vsize=10000;

    uint32_t v32[vsize];
    uint64_t v64[vsize];

    uint32_t m32[100][100];
    uint64_t m64[100][100];


    runbench(bench_add_numbers, "Add 32", v32, vsize);
    runbench(bench_add_numbers, "Add 64", v64, vsize);

    runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
    runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

    runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
    runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

    runbench2(bench_matrix, "Matrix 32", m32);
    runbench2(bench_matrix, "Matrix 64", m64);
}

Compiled with:

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.

And is it faster on 64-bit I hear some of you ask: Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

Edit, update for 2016: four variants, with and without SSE, in 32- and 64-bit mode of the compiler.

I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)

As a short table:

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
----------------------------------------------------------
Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
----------------------------------------------------------
Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
----------------------------------------------------------
Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
----------------------------------------------------------
Add Div 32     |   44570   |   47695   |  17713 |  17523 |
----------------------------------------------------------
Add Div 64     |  405258   |   52875   | 405150 |  47043 |
----------------------------------------------------------
Matrix 32      |   41470   |   15811   |  21542 |   8622 |
----------------------------------------------------------
Matrix 64      |   22184   |   15168   |  13757 |  12448 |

Full results with compile options.

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757


$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448


$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168


回答2:

More than you ever wanted to know about doing 64-bit math in 32-bit mode...

When you use 64-bit numbers on 32-bit mode (even on 64-bit CPU if an code is compiled for 32-bit), they are stored as two separate 32-bit numbers, one storing higher bits of a number, and another storing lower bits. The impact of this depends on an instruction. (tl;dr - generally, doing 64-bit math on 32-bit CPU is in theory 2 times slower, as long you don't divide/modulo, however in practice the difference is going to be smaller (1.3x would be my guess), because usually programs don't just do math on 64-bit integers, and also because of pipelining, the difference may be much smaller in your program).

Addition/subtraction

Many architectures support so called carry flag. It's set when the result of addition overflows, or result of subtraction doesn't underflow. The behaviour of those bits can be show with long addition and long subtraction. C in this example shows either a bit higher than the highest representable bit (during operation), or a carry flag (after operation).

  C 7 6 5 4 3 2 1 0      C 7 6 5 4 3 2 1 0
  0 1 1 1 1 1 1 1 1      1 0 0 0 0 0 0 0 0
+   0 0 0 0 0 0 0 1    -   0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0    = 0 1 1 1 1 1 1 1 1

Why is carry flag relevant? Well, it just so happens that CPUs usually have two separate addition and subtraction operations. In x86, the addition operations are called add and adc. add stands for addition, while adc for addition with carry. The difference between those is that adc considers a carry bit, and if it is set, it adds one to the result.

Similarly, subtraction with carry subtracts 1 from the result if carry bit is not set.

This behaviour allows easily implementing arbitrary size addition and subtraction on integers. The result of addition of x and y (assuming those are 8-bit) is never bigger than 0x1FE. If you add 1, you get 0x1FF. 9 bits is enough therefore to represent results of any 8-bit addition. If you start addition with add, and then add any bits beyond initial ones with adc, you can do addition on any size of data you like.

Addition of two 64-bit values on 32-bit CPU is as follows.

  1. Add first 32 bits of b to first 32 bits of a.
  2. Add with carry later 32 bits of b to later 32 bits of a.

Analogically for subtraction.

This gives 2 instructions, however, because of instruction pipelinining, it may be slower than that, as one calculation depends on the other one to finish, so if CPU doesn't have anything else to do than 64-bit addition, CPU may wait for the first addition to be done.

Multiplication

It so happens on x86 that imul and mul can be used in such a way that overflow is stored in edx register. Therefore, multiplying two 32-bit values to get 64-bit value is really easy. Such a multiplication is one instruction, but to make use of it, one of multiplication values must be stored in eax.

Anyway, for a more general case of multiplication of two 64-bit values, they can be calculated using a following formula (assume function r removes bits beyond 32 bits).

First of all, it's easy to notice the lower 32 bits of a result will be multiplication of lower 32 bits of multiplied variables. This is due to congrugence relation.

a1b1 (mod n)
a2b2 (mod n)
a1a2b1b2 (mod n)

Therefore, the task is limited to just determining the higher 32 bits. To calculate higher 32 bits of a result, following values should be added together.

  • Higher 32 bits of multiplication of both lower 32 bits (overflow which CPU can store in edx)
  • Higher 32 bits of first variable mulitplied with lower 32 bits of second variable
  • Lower 32 bits of first variable multiplied with higher 32 bits of second variable

This gives about 5 instructions, however because of relatively limited number of registers in x86 (ignoring extensions to an architecture), they cannot take too much advantage of pipelining. Enable SSE if you want to improve speed of multiplication, as this increases number of registers.

Division/Modulo (both are similar in implementation)

I don't know how it works, but it's much more complex than addition, subtraction or even multiplication. It's likely to be ten times slower than division on 64-bit CPU however. Check "Art of Computer Programming, Volume 2: Seminumerical Algorithms", page 257 for more details if you can understand it (I cannot in a way that I could explain it, unfortunately).

If you divide by a power of 2, please refer to shifting section, because that's what essentially compiler can optimize division to (plus adding the most significant bit before shifting for signed numbers).

Or/And/Xor

Considering those operations are single bit operations, nothing special happens here, just bitwise operation is done twice.

Shifting left/right

Interestingly, x86 actually has an instruction to perform 64-bit left shift called shld, which instead of replacing the least significant bits of value with zeros, it replaces them with most significant bits of a different register. Similarly, it's the case for right shift with shrd instruction. This would easily make 64-bit shifting a two instructions operation.

However, that's only a case for constant shifts. When a shift is not constant, things get tricker, as x86 architecture only supports shift with 0-31 as a value. Anything beyond that is according to official documentation undefined, and in practice, bitwise and operation with 0x1F is performed on a value. Therefore, when a shift value is higher than 31, one of value storages is erased entirely (for left shift, that's lower bytes, for right shift, that's higher bytes). The other one gets the value that was in the register that was erased, and then shift operation is performed. This in result, depends on branch predictor to make good predictions, and is a bit slower because a value needs to be checked.

__builtin_popcount[ll]

__builtin_popcount(lower) + __builtin_popcount(higher)

Other builtins

I'm too lazy to finish the answer at this point. Does anyone even use those?

Unsigned vs signed

Addition, subtraction, multiplication, or, and, xor, shift left generate the exact same code. Shift right uses only slightly different code (arithmetic shift vs logical shift), but structurally it's the same. It's likely that division does generate a different code however, and signed division is likely to be slower than unsigned division.

Benchmarks

Benchmarks? They are mostly meaningless, as instruction pipelining will usually lead to things being faster when you don't constantly repeat the same operation. Feel free to consider division slow, but nothing else really is, and when you get outside of benchmarks, you may notice that because of pipelining, doing 64-bit operations on 32-bit CPU is not slow at all.

Benchmark your own application, don't trust micro-benchmarks that don't do what your application does. Modern CPUs are quite tricky, so unrelated benchmarks can and will lie.



回答3:

Your question sounds pretty weird in its environment. You use time_t that uses up 32 bits. You need additional info, what means more bits. So you are forced to use something bigger than int32. It doesn't matter what the performance is, right? Choices will go between using just say 40 bits or go ahead to int64. Unless millions of instances must be stored of it, the latter is a sensible choice.

As others pointed out the only way to know the true performance is to measure it with profiler, (in some gross samples a simple clock will do). so just go ahead and measure. It must not be hard to globalreplace your time_t usage to a typedef and redefine it to 64 bit and patch up the few instances where real time_t was expected.

My bet would be on "unmeasurable difference" unless your current time_t instances take up at least a few megs of memory. on current Intel-like platforms the cores spend most of the time waiting for external memory to get into cache. A single cache miss stalls for hundred(s) of cycles. What makes calculating 1-tick differences on instructions infeasible. Your real performance may drop due yo things like your current structure just fits a cache line and the bigger one needs two. And if you never measured your current performance you might discover that you could gain extreme speedup of some funcitons just by adding some alignment or exchange order of some members in a structure. Or pack(1) the structure instead of using the default layout...



回答4:

Addition/subtraction basically becomes two cycles each, multiplication and division depend on the actual CPU. The general perfomance impact will be rather low.

Note that Intel Core 2 supports EM64T.