What can cause a program to run much faster the se

2019-02-04 02:19发布

Something I've noticed when testing code I write is that long-running operations tend to run much longer the first time a program is run than on subsequent runs, sometimes by a factor of 10 or more. Obviously there's some sort of cold cache/warm cache issue here, but I can't seem to figure out what it is.

It's not the CPU cache, since these long-running operations tend to be loops that I feed a lot of data to, and they should be fully loaded after the first iteration. (Plus, unloading and reloading the program should clear the cache.)

Also, it's not the disc cache. I've ruled that out by loading all data from disc up-front and processing it afterwards, and it's the actual CPU-bound data processing that's going slowly.

So what can cause my program to run slow the first time I run it, but then if I close it and run it again, it runs dramatically faster? I've seen this in several different programs that do very different things, so it seems to be a general issue.

EDIT: For clarification, I'm writing in Delphi, though I don't really think this is a Delphi-specific issue. But that means that whatever the problem is, it's not related to JIT issues, garbage collection issues, or any of the other baggage that managed code brings with it. And I'm not dealing with network connections. This is pure CPU-bound processing.

One example: a script compiler. It runs like this:

  • Load entire file into memory from disc
  • Lex the entire file into a queue of tokens
  • Parse the queue into a tree
  • Run codegen on the tree to produce bytecode

If I feed it an enormous script file (~100k lines,) after loading the entire thing from disc into memory, the lex step takes about 15 seconds the first time I run, and 2 seconds on subsequent runs. (And yes, I know that's still a long time. I'm working on that...) I'd like to know where that slowdown is coming from and what I can do about it.

9条回答
贪生不怕死
2楼-- · 2019-02-04 02:26

I'd guess it's all your libraries/DLLs. These are usually loaded on-demand at run-time, so the first time your program runs the OS will have to read them all from disk. Once read, though, they'll stay loaded unless your system starts running low on memory. So if you run the same program several times in succession, the first run takes the brunt of the load time, and the other runs benefit from the pre-loaded libraries.

查看更多
神经病院院长
3楼-- · 2019-02-04 02:32

Just a random guess...

Does your processor support adaptive frequency? Maybe it's just the processor that doesn't have time to adapt its frequency on the first run, and is running full speed on second one.

查看更多
甜甜的少女心
4楼-- · 2019-02-04 02:35

Three things to try:

  • Run it in a sampling profiler, including a "cold" run (first thing after a reboot). Should usually be enough.
  • Check memory usage, does it grow so high (even transiently) the OS would have to swap things out of RAM to make room for your app? That alone could be an explanation for what you're seeing. Also look at the amount of free RAM you have when you start your app.
  • Enable system performance tools and check the I/O counters or file accesses, and make sure under FileMon / Process Explorer that you don't have some file or network accesses you've forgotten about (leftover log/test code)
查看更多
仙女界的扛把子
5楼-- · 2019-02-04 02:35

where that slowdown is coming from and what I can do about it.

I would speak about quick execution the next times can from from performance caching

  • Disk internal cache (8MB or more)
  • Windows applicationDependencies (as DLL)/Core cache
  • CPU cache L3 (or L2 if some programming loop are small enough)

So you see that the first time you do not benefits from these caching systems.

查看更多
男人必须洒脱
6楼-- · 2019-02-04 02:36

I usually experienced the contrary: for computation intensitive work (if anti virus is not working), I only have a 5-10% diff between calls. For instance, the 6,000,000 regression tests run for our framework have a very constant time of running, and it's very disk and CPU intensive work.

I really don't believe of a CPU cache or pipelining / branch prediction issue either, since both processed data and code seem to be consistent, as you wrote. If anti virus is off, it may be about OS thread settings: did you try to change the process CPU affinity and priority?

This should be very specific to the process you are running. Without any actual source code to reproduce it, it's almost impossible to tell what's happening with you. How many threads are there? What is the HW configuration (isn't there any Intel CPU boost there - are you using a laptop, and what are your energy settings)? Is it using CPU/FPU/MMX/SSE2 (e.g. MMX and FPU do not mix)? Does it move a lot of data, or process some existing data? Does your SW depends on external libraries (even some Windows libraries may need some time to initialize)? How do you use memory (did you try to pre-allocate the memory; or on a multi-threaded application, did you try using a scaling MM instead of FastMM4)?

I think using a sample profiler may not help so much, since it will change the general CPU core use, but it's worth trying in all cases. I'd better rely on logging profiling - see e.g. this class or you may write your own timestamps to find where the timing changes in your app.

AFAIK it has always been written that, when benchmarking, the first run of an application shall never be taken in account. Computer systems are so complex nowadays, that the first time, all the internal (SW and HW) plumbing is to be purged - so you shall not drink the first water coming out of your tap when you come back from 1 month of travel. ;)

查看更多
Animai°情兽
7楼-- · 2019-02-04 02:40

Even (especially) for very small command-line program, the issue can be the time it takes to load the process, link to dynamically-linked libraries etc. I believe modern operating systems avoid repeating a lot of this work if the same program is run twice at once, or repeatedly.

I wouldn't dismiss CPU cache so easily, as well. Level 0 cache is very relevant for inner loops, but much less so for a second run of the same application. On my cheap Athlon 2 X4 645 system, there's 64K + 64K (data + instruction) level 0 cache per core - not exactly a huge amount of memory. Level 1 cache is IIRC 512K per core, so less likely to be dirtied to complete irrelevance by the O/S code needed to start up a new run of the program, calls to operating system services and standard libraries, etc. Level 2 cache (on CPUs that have it - my Athlon 2 doesn't, IIRC) is larger still, and there may be some even higher level and larger cache provided by the motherboard/chipset.

There's at least one other kind of cache - branch prediction tables. Though I'd have thought they'd be dirtied to irrelevance even quicker than the level 0 cache.

I generally find that unit test programs run many times slower the first time. However, the larger and more complex the program, the less significant the effect.

For some time now, performance of applications has often been considered non-deterministic. Although it isn't strictly true, the performance is determined by so many hard-to-predict factors that it's a good model. For example, if the CPU is a bit warm, the clock speed may be reduced to prevent overheating. And the temperature varies at different parts of the chip, with changes conducting across the chip in complex ways. As changes in clock speed and the different demands of different pieces of code alter the patterns of changing temperature, there's a clear potential for chaotic (as in chaos theory) behaviour.

On some platforms, I wouldn't be surprised if the first run of the program got the processor to run if it's "fast" (rather than cool/quiet) mode, and that meant that the beginning of the second run benefitted from that speed boost as well as the end. However, this would be a tricky one - it would have to be a CPU-intensive program, and if your cooling is inadequate, the processor may then slow down again to avoid overheating.

查看更多
登录 后发表回答