Buffer growth strategy

2019-02-08 10:45发布

问题:

I have a generic growing buffer indended to accumulate "random" string pieces and then fetch the result. Code to handle that buffer is written in plain C.

Pseudocode API:

void write(buffer_t * buf, const unsigned char * bytes, size_t len);/* appends */
const unsigned char * buffer(buffer_t * buf);/* returns accumulated data */

I'm thinking about the growth strategy I should pick for that buffer.

I do not know if my users would prefer memory or speed — or what would be the nature of user's data.

I've seen two strategies in the wild: grow buffer by fixed size increments (that is what I've currently implemented) or grow data exponentially. (There is also a strategy to allocate the exact amount of memory needed — but this is not that interesting in my case.)

Perhaps I should let user to pick the strategy... But that would make code a bit more complex...

Once upon a time, Herb Sutter wrote (referencing Andrew Koenig) that the best strategy is, probably, exponential growth with factor 1.5 (search for "Growth Strategy"). Is this still the best choice?

Any advice? What does your experience say?

回答1:

Unless you have a good reason to do otherwise, exponential growth is probably the best choice. Using 1.5 for the exponent isn't really magical, and in fact that's not what Andrew Koenig originally said. What he originally said was that the growth factor should be less than (1+sqrt(5))/2 (~1.6).

Pete Becker says when he was at Dinkumware P.J. Plauger, owner of Dinkumware, says they did some testing and found that 1.5 worked well. When you allocate a block of memory, the allocator will usually allocate a block that's at least slightly larger than you requested to give it room for a little book-keeping information. My guess (though unconfirmed by any testing) is that reducing the factor a little lets the real block size still fit within the limit.

References: I believe Andrew originally published this in a magazine (the Journal of Object Oriented Programming, IIRC) which hasn't been published in years now, so getting a re-print would probably be quite difficult.

Andrew Koenig's Usenet post, and P.J. Plauger's Usenet post.



回答2:

The exponential growth strategy is used throughout STL and it seems to work fine. I'd say stick with that at least until you find a definite case where it won't work.



回答3:

I usually use a combination of addition of a small fixed amount and multiplication by 1.5 because it is efficent to implement and leads to reasonable step widths which are bigger at first and more memory sensible when the buffer grows. As fixed offset I usually use the initial size of the buffer and start with rather small initial sizes:

new_size = old_size + ( old_size >> 1 ) + initial_size;

As initial_size I use 4 for collection types, 8, 12 or 16 for string types and 128 to 4096 for in-/output buffers depending on the context.

Here is a little chart that shows that this grows much faster (yellow+red) in the early steps compared to multiplying by 1.5 only (red).

So, if you started with 100 you would need for example 6 increases to accommodate 3000 elements while multiplying with 1.5 alone would need 9.

At larger sizes the influence of the addition becomes negligible, which makes both approaches scale equally well by a factor of 1.5 then. These are the effective growth factors if you use the initial size as fixed amount for the addition:

2.5
1.9
1.7
1.62
1.57
1.54
1.53
1.52
1.51
1.5
...


回答4:

The key point is that the exponential growth strategy lets you avoid expensive copies of the buffer content when you hit the current size for the cost of some wasted memory. The article you link has the numbers for the trade-of.



回答5:

The answer, as always is, it "depends".

The idea behind exponential growth - ie allocating a new buffer that is x times the current size is that as you require more buffer, you'll need more buffer ansd the chances are that you'll be needing much more buffer than a small fixed increment provides.

So, if you have a 8-byte buffer, and need more allocating an extra 8 bytes is ok, then allocating an additional 16 bytes is probably a good idea - someone with a 16-byte buffer is not likely to require a extra 1 byte. And if they do, all that's happening is you're wasting a little memory.

I thought the best growth factor was 2 - ie double your buffer, but if Koenig/Sutter say 1.5 is optimal, then I'm agreeing with them. You may want to tweak your growth rate after getting some usage statistics though.

So exponential growth is a good trade-off between performance and keeping memory usage low.



回答6:

  • Double the size until a threshold (~100MB?) and then lower the exponential growth toward 1.5,.., 1.3
  • Another option would be to make the default buffer size configurable at runtime.


回答7:

There's no way anyone can give good advice without knowing something about the allocations, runtime environment, execution characteristics, etc., etc.

Code which works is way more important than highly optimized code... which is under development. Choose some algorithm—any workable algorithm—and try it! If it proves suboptimal, then change the strategy. Placing this in the control of the library user often does them no favors. But if you already have some option scheme in place, then adding it could be useful, unless you hit on a good algorithm (and n^1.5 is a pretty good one).


Also, the use of a function named write in C (not C++) conflicts with <io.h> and <stdio.h>. It's fine if nothing uses them, but it would also be hard to add them later. Best to use a more descriptive name.



回答8:

The point of using exponential growth (whether the factor be 1.5 or 2) is to avoid copies. Each time you realloc the array, you can trigger an implicit copy of the item, which, of course, gets more expensive the larger it gets. By using an exponential growth, you get an amortized constant number of recopies -- i.e. you rarely end up copying.

As long as you're running on a desktop computer of some kind, you can expect an essentially unlimited amount of memory, so time is probably the right side of that tradeoff. For hard real-time systems, you would probably want to find a way to avoid the copies altogether -- a linked list comes to mind.



回答9:

As a wild idea, for this specific case, you could change the API to require the caller to allocate the memory for each chunk, and then remembering the chunks instead of copying the data.

Then, when it's time to actually produce the result, you know exactly how much memory is going to be needed and can allocate exactly that.

This has the benefit that the caller will need to allocate memory for the chunks anyway, and so you might as well make use of that. This also avoids copying data more than once.

It has the disadvantage that the caller will have to dynamically allocate each chunk. To get around that, you could allocate memory for each chunk, and remember those, rather than keeping one large buffer, which gets resized when it gets full. This way, you'll copy data twice (once into the chunk you allocate, another time into the resulting string), but no more. If you have to resize several times, you may end up with more than two copies.

Further, really large areas of free memory may be difficult for the memory allocator to find. Allocating smaller chunks may well be easier. There might not be space for a one-gigabyte chunk of memory, but there might be space for a thousand megabyte chunks.



标签: c buffer