I'm trying to unpack binary vector of 140 Million bits into list. I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my @unpacked = split //, (unpack "B*", $vector );
return @unpacked;
}
A single integer value in Perl is going to be stored in an
SVt_IV
orSVt_UV
scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that theSV *
pointers in theAvARRAY
used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
Scalars contain a lot of information.
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by
split
) is aSVt_PV
. (It's usually calledPV
, butPV
can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)The first block is the head.
ANY
is a pointer to the body.REFCNT
is a reference count that allows Perl to know when the scalar can be deallocated.FLAGS
contains information about what the scalar actually contains. (e.g.SVf_POK
means the scalar contains a string.)TYPE
contains information the type of scalar (what kind of information it can contain.)SVt_PV
, the last field points to the string buffer.The second block is the body. The body of an
SVt_PV
has the following fields:STASH
is not used in the scalars in question since they're not objects.MAGIC
is not used for the scalars in question. Magic allows code to be called when the variable is accessed.CUR
is the length of the string in the buffer.LEN
is the length of the string buffer. Perl over-allocates to speed up concatenation.The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by
split
are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means
bin2list
continues to use up 140MB of memory after it exits.Footnotes
SVt_PV
stores the pointer to the string buffer.The images are from illguts. They are protected by Copyright.