Hyper-threading Performance Comparison

2019-05-16 01:21发布

问题:

I have written a project, which uses some basic functions in openssl such as RAND_bytes and des_ecb_encrypt.

My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.

What I mean is that hyper-threading doesn't give me any performance improvement. In Linux, the experiment result is same.

I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.

However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement. Sadly, I don't find it.

So, my questions is that whether there are some simple tests shows the hyper-threading won't give me any performance improvement.

回答1:

You may find that hyperthreading helps more on code that is using large amounts of memory, so that the processor is regularly blocked on fetching from memory.

In my experience, it's quite hard to find "simple code" that shows benefits from hyperthreading. It tends to be more complex examples that show the benefit. Still, the benefit will most likely not be 2x that of "no hyperthreading". Count on getting perhaps 20-30% improvement.



回答2:

Hyper threading takes advantage of the fact that the CPU has many components and when one is used, when there's no hyper threading, the others just sit there idle. You can try writing two types of threads, one doing integer calculations (that will hopefully use the ALU) and one doing floating point arithmetic (that will hopefully use the FPU).

I did not try this myself but it seems that in such a scenario hyper threading should improve the performance.

To show the opposite you can use only one type of the threads (either threads only doing integer operations or threads only doing floating point operations).

It may also be that your test is flawed, but in order to know if that is the case we'll need more information about that test.



回答3:

I have written a project, which use some basic functions in openssl such as RAND_bytes and des_ecb_encrypt... My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.

When using RDRAND (which RAND_bytes will do in this case), the bus us the limiting factor. You should peak at around 800MB/sec. It does not matter how many threads you have - the bus cannot transfer data fast enough. See Intel rdrand instruction revisited.

If you used AES, then you might see a better speedup over the DES/3DES observations. Your Ivy Bridge has AES-NI and it can achieve almost 1.3 cycle/byte, and that should be about double or triple AES is software. To ensure you are using the AES-NI instructions, you have to use the EVP_* interfaces.


I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.

I think @selalerer and @Mats Petersson answered your question. The problem does not scale linearly and there's a maximum speedup you will encounter. Intel states its about 30%.

Intel's newest architecture favors of Out-Of-Order execution over Hyper-threading execution because its supposed to be more efficient. Read about the Silvermont processor cores.

But if you want a formal deep dive, then see a book on computer engineering. Here's the book we used when I studied it in college: Computer Organization and Design (its probably a bit dated now).


However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement.

OpenSSL also has a benchmarking app. See the source code in <openssl source>/apps/speed.c.

Also, benchmarking apps have their own personalities. An encryption stress test may not reveal the differences as predominantly as you hope to see them. See, for example, Benchmarking Tools.



回答4:

Following are details and results of my MP benchmarks for Linux and Windows, that can behave differently. Not much HT but Linux tests include Atom (1 core 2 threads) and Windows has Core i7 results (4+4).

http://www.roylongbottom.org.uk/linux%20multithreading%20benchmarks.htm

http://www.roylongbottom.org.uk/quad%20core%208%20thread.htm

Take your pick, depending what you want to prove whether HT provides better or worse performance. Following are RandMem results on i7 (Linux seems better using this test). For such as i7, you also need to consider Turbo Boost that might be lower with multiple threads.

             CPUs          MBytes Per Second Using Threads        Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Core i7     4/8 L1   11458   22661   37039   43717   46374   2.0   3.2   3.8   4.0
 930             L2   10380   20832   32853   41711   42839   2.0   3.2   4.0   4.1
 #### MHz        L3    8828   17743   29610   38414   40330   2.0   3.4   4.4   4.6
 Win 764        RAM    4266    8712   17347   24946   25589   2.0   4.1   5.8   6.0

 Serial RW
 Core i7     4/8 L1   15282   13724   16240   16209   18379   0.9   1.1   1.1   1.2
 930             L2   12223   18216   25326   28104   27047   1.5   2.1   2.3   2.2
 #### MHz        L3   10234   19266   21931   24450   26351   1.9   2.1   2.4   2.6
 Win 764        RAM    4533    7656   13876   14543   13390   1.7   3.1   3.2   3.0

 Random RD
 Core i7     4/8 L1   11266   22548   38174   45592   47141   2.0   3.4   4.0   4.2
 930             L2    6233   12463   20059   24986   25667   2.0   3.2   4.0   4.1
 #### MHz        L3    3499    6915    9211   10002    9531   2.0   2.6   2.9   2.7
 Win 764        RAM     459     909    1241    1398    1364   2.0   2.7   3.0   3.0

 Random RW
 Core i7     4/8 L1   14375    3027    2780    2901    3297   0.2   0.2   0.2   0.2
 930             L2    5887    4555    6117    6693    7281   0.8   1.0   1.1   1.2
 #### MHz        L3    3104    4604    4721    5047    4933   1.5   1.5   1.6   1.6
 Win 764        RAM     428     860     899     948    1026   2.0   2.1   2.2   2.4

 #### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM 

Then the MP Whetstone benchmark that shows real gains

                      MWIPS  MFLOP  MFLOP  MFLOP   COS    EXP   FIXPT   IF    EQUAL
CPU              MHz            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

Core i7 1 Thrd  ####   3115   1065    886    738   79.3   39.7   2447   2936   1154

Core i7 Win7    ####  21690   8676   7621   5844    531    291  16643  12027   5034
Quad Core Thread 1            1091   1027    728   66.4   36.5   2050   1501    629
Plus HT   Thread 2            1089   1037    742   66.0   36.5   2090   1507    630
          Thread 3            1090    946    742   66.8   36.5   2069   1534    631
          Thread 4            1092   1037    727   66.6   36.6   2031   1501    630
          Thread 5            1042    959    736   66.4   36.5   1912   1483    630
          Thread 6            1091    874    723   66.6   36.1   2049   1507    629
          Thread 7            1090    867    725   65.6   36.3   2094   1516    631
          Thread 8            1091    874    722   66.3   36.3   2350   1476    624

Gain %                  696    815    860    792    670    733    680    410    436