Performance optimization strategies of last resort

2018-12-30 22:48发布

There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

Let's assume:

  • the code already is working correctly
  • the algorithms chosen are already optimal for the circumstances of the problem
  • the code has been measured, and the offending routines have been isolated
  • all attempts to optimize will also be measured to ensure they do not make matters worse

What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

30条回答
公子世无双
2楼-- · 2018-12-30 23:09

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:

  • Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
  • Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
  • Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
  • Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
  • Sequential floating-point ops. Make these SIMD.

And one more thing I like to do:

  • Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.
查看更多
笑指拈花
3楼-- · 2018-12-30 23:10

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.

These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.

A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:

  • Avoid memory fetch latencies.
  • Lower memory bus pressure (bandwidth).

The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.

The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.

Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

查看更多
临风纵饮
4楼-- · 2018-12-30 23:11

Did you know that a CAT6 cable is capable of 10x better shielding off extrenal inteferences than a default Cat5e UTP cable?

For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds... but if you are talking about the last drops, that's a some drops gained, 24/7 for any packge sent or received.

查看更多
路过你的时光
5楼-- · 2018-12-30 23:11

If better hardware is an option then definitely go for that. Otherwise

  • Check you are using the best compiler and linker options.
  • If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
  • See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
  • Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
  • Consider: are you really using the best algorithm? Is it the best algorithm for your input size?
查看更多
浪荡孟婆
6楼-- · 2018-12-30 23:14

Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:

Find out what takes the largest amounts of time first, and understand why.

It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.

So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.

Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.

HPH, asoudmove.

查看更多
忆尘夕之涩
7楼-- · 2018-12-30 23:15

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.

Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.

Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.

Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).

Index, index, index. And get those stats updated if applicable to your database.

Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.

Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.

Never put a loop of any kind into a trigger!

Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.

Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.

Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.

The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.

查看更多
登录 后发表回答