Benchmarking CPU-bound algorithms/implementations

Let's say I'm writing my own StringBuilder in a compiled language (e.g. C++).

What is the best way to measure the performance of various implementations? Simply timing a few hundred thousand runs yields highly inconsistent results: the timings from one batch to the other can differ by as much as 15%, making it impossible to accurately assess potential performance improvements that yield performance gains smaller than that.

I've done the following:

Disable SpeedStep
Use RDTSC for timing
Run the process with realtime priority
Set the affinity to a single CPU core

This stabilizied the results somewhat. Any other ideas?

标签： performance benchmarking cpu

3条回答

劳资没心，怎么记你

2楼-- · 2019-03-31 07:13

I have achieved 100% consistent results in this manner:

Set up Bochs with MS-DOS.
Set up your toolchain to target MS-DOS
— or —
1. Set up your toolchain to target 32-bit Windows
2. Install the HX-DOS extender in Bochs.
3. If necessary, hack your toolkit's standard library / runtime and stub out/remove features requiring Windows APIs not implemented in HX-DOS. The extender will print a list of unimplemented APIs when you attempt to run the program.
Reduce the number of cycles in your benchmark by a few orders of magnitude.
Wrap the benchmark code with assembler cli / sti instructions (note that the binary won't run on modern OSes after this change).
If you haven't already, make your benchmark use rdtsc deltas for timing. The samples should be within the cli…sti instructions.
Run it in the Bochs!

Bochs screenshot

The result seems to be completely deterministic, but is not an accurate assessment of overall performance (see the discussion under Osman Turan's answer for details).

As a bonus tip, here's an easy way to share files with Bochs (so you don't have to unmount/rebuild/remount the floppy image every time):

On Windows, Bochs will lock the floppy image file, but the file is still opened in shared-write mode. This means that you can't overwrite the file, but you can write to it. (I think *nix OSes might cause overwriting to create a new file, as far as file descriptors are concerned.) The trick is to use dd. I had the following batch script set up:

... benchmark build commands here ...
copy /Y C:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe
bfi -t=288 -f=floppysrc.img floppy
dd if=floppysrc.img of=floppy.img

bfi is Bart's Build Floppy Image.

Then, just mount floppy.img in Bochs.

Bonus tip #2: To avoid having to manually start the benchmark every time in Bochs, put an empty go.txt file in the floppy directory, and run this batch in Bochs:

@echo off
A:
:loop
choice /T:y,1 > nul
if not exist go.txt goto loop
del go.txt
echo ---------------------------------------------------
test2
goto loop

It will start the test program every time it detects a fresh floppy image. This way, you can automate a benchmark run in a single script.

Update: this method is not very reliable. Sometimes the timings would change as much as by 200% just by reordering some tests (these timing changes were not observed when ran on real hardware, using the methods described in the original question).

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-03-31 07:14

It's really hard to precisely measure a piece of code. For such requirements, I recommend you to have look at Agner Fog's test suite. By using it, you can measure clock cycles and collect some important factors (such as cache misses, branch mispredictions etc.).

Also, I recommend you to have look at PDF document from Agner's site. It's a really invaluable document to make possible such micro-optimization.

As a side note, actual performance is not a function of "clock cycles". Cache misses can change everything for each run within a real application. So, I would optimize cache misses first. Simply running a piece of code several times for same memory portion, decreases cache miss dramatically. So, it makes it hard to measure precisely. Whole application tuning is usually better idea IMO. Intel VTune and other tools are really good for such usages.

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2019-03-31 07:15

I have been concerned about this issue a lot in the past, and I have come to the realization that there is only one ~~perfect~~ ideal solution, which though requires a lot of work, (preparation mostly,) so I never actually did it this way.

The solution is to run your code using a 386 emulator which will tell you exactly how many operations were executed. You should be able to find an open-source 386 emulator out there. It will be accurate to the instruction, and it will require a single run of your test. If you do it, please post how you did it!

0人赞添加讨论(0) 举报

Benchmarking CPU-bound algorithms/implementations

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间