How does PHP memory actually work

2020-02-08 15:57发布

问题:

I've always heard and searched for new php 'good writing practice', for example: It's better (for performance) to check if array key exists than search in array, but also it seems better for memory too:

Assuming we have:

$array = array
(
    'one'   => 1,
    'two'   => 2,
    'three' => 3,
    'four'  => 4,
);

this allocates 1040 bytes of memory,

and

$array = array
(
    1 => 'one',
    2 => 'two',
    3 => 'three',
    4 => 'four',
);

requires 1136 bytes

I understand that the key and value surely will have different storing mechanism, but please can you actually point me to the principle how does it work?

Example 2 (for @teuneboon):

$array = array
(
    'one'   => '1',
    'two'   => '2',
    'three' => '3',
    'four'  => '4',
);

1168 bytes

$array = array
(
    '1' => 'one',
    '2' => 'two',
    '3' => 'three',
    '4' => 'four',
);

1136 bytes

consuming same memory:

  • 4 => 'four',
  • '4' => 'four',

回答1:

Note, answer below is applicable for PHP prior to version 7 as in PHP 7 major changes were introduced which also involve values structures.

TL;DR

Your question is not actually about "how memory works in PHP" (here, I assume, you meant "memory allocation"), but about "how arrays work in PHP" - and these two questions are different. To summarize what's written below:

  • PHP arrays aren't "arrays" in classical sense. They are hash-maps
  • Hash-map for PHP array has specific structure and uses many additional storage things, such as internal links pointers
  • Hash-map items for PHP hash-map also use additional fields to store information. And - yes, not only string/integer keys matters, but also what are strings themselves, which are used for your keys.
  • Option with string keys in your case will "win" in terms of memory amount because both options will be hashed into ulong (unsigned long) keys hash-map, so real difference will be in values, where string-keys option has integer (fixed-length) values, while integer-keys option has strings (chars-dependent length) values. But that may not always will be true due to possible collisions.
  • "String-numeric" keys, such as '4', will be treated as integer keys and translated into integer hash result as it was integer key. Thus, '4'=>'foo' and 4 => 'foo' are same things.

Also, important note: the graphics here are copyright of PHP internals book

Hash-map for PHP arrays

PHP arrays and C arrays

You should realize one very important thing: PHP is written on C, where such things as "associative array" simply does not exist. So, in C "array" is exactly what "array" is - i.e. it's just a consecutive area in memory which can be accessed by a consecutive offset. Your "keys" may be only numeric, integer and only consecutive, starting from zero. You can't have, for instance, 3,-6,'foo' as your "keys" there.

So to implement arrays, which are in PHP, there's hash-map option, it uses hash-function to hash your keys and transform them to integers, which can be used for C-arrays. That function, however, will never be able to create a bijection between string keys and their integer hashed results. And it's easy to understand why: because cardinality of strings set is much, much larger that cardinality of integer set. Let's illustrate with example: we'll recount all strings, up to length 10, which have only alphanumeric symbols (so, 0-9, a-z and A-Z, total 62): it's 6210 total strings possible. It's around 8.39E+17. Compare it with around 4E+9 which we have for unsigned integer (long integer, 32-bits) type and you'll get the idea - there will be collisions.

PHP hash-map keys & collisions

Now, to resolve collisions, PHP will just place items, which have same hash-function result, into one linked list. So, hash-map would not be just "list of hashed elements", but instead it will store pointers to lists of elements (each element in certain list will have same hash-function key). And this is where you have point to how it will affect memory allocation: if your array has string keys, which did not result in collisions, then no additional pointers inside those list would be needed, so memory amount will be reduced (actually, it's a very small overhead, but, since we're talking about precise memory allocation, this should be taken to account). And, same way, if your string keys will result into many collisions, then more additional pointers would be created, so total memory amount will be a bit more.

To illustrate those relations within those lists, here's a graphic:

Above there is how PHP will resolve collisions after applying hash-function. So one of your question parts lies here, pointers inside collision-resolution lists. Also, elements of linked lists are usually called buckets and the array, which contains pointers to heads of those lists is internally called arBuckets. Due to structure optimization (so, to make such things as element deletion, faster), real list element has two pointers, previous element and next element - but that's only will make difference in memory amount for non-collision/collision arrays little wider, but won't change concept itself.

One more list: order

To fully support arrays as they are in PHP, it's also needed to maintain order, so that is achieved with another internal list. Each element of arrays is a member of that list too. It won't make difference in terms of memory allocation, since in both options this list should be maintained, but for full picture, I'm mentioning this list. Here's the graphic:

In addition to pListLast and pListNext, pointers to order-list head and tail are stored. Again, it's not directly related to your question, but further I'll dump internal bucket structure, where these pointers are present.

Array element from inside

Now we're ready to look into: what is array element, so, bucket:

typedef struct bucket {
    ulong h;
    uint nKeyLength;
    void *pData;
    void *pDataPtr;
    struct bucket *pListNext;
    struct bucket *pListLast;
    struct bucket *pNext;
    struct bucket *pLast;
    char *arKey;
} Bucket;

Here we are:

  • h is an integer (ulong) value of key, it's a result of hash-function. For integer keys it is just same as key itself (hash-function returns itself)
  • pNext / pLast are pointers inside collision-resolution linked list
  • pListNext/pListLast are pointers inside order-resolution linked list
  • pData is a pointer to the stored value. Actually, value isn't same as inserted at array creation, it's copy, but, to avoid unnecessary overhead, PHP uses pDataPtr (so pData = &pDataPtr)

From this viewpoint, you may get next thing to where difference is: since string key will be hashed (thus, h is always ulong and, therefore, same size), it will be a matter of what is stored in values. So for your string-keys array there will be integer values, while for integer-keys array there will be string values, and that makes difference. However - no, it isn't a magic: you can't "save memory" with storing string keys such way all the times, because if your keys would be large and there will be many of them, it will cause collisions overhead (well, with very high probability, but, of course, not guaranteed). It will "work" only for arbitrary short strings, which won't cause many collisions.

Hash-table itself

It's already been spoken about elements (buckets) and their structure, but there's also hash-table itself, which is, in fact, array data-structure. So, it's called _hashtable:

typedef struct _hashtable {
    uint nTableSize;
    uint nTableMask;
    uint nNumOfElements;
    ulong nNextFreeElement;
    Bucket *pInternalPointer;   /* Used for element traversal */
    Bucket *pListHead;
    Bucket *pListTail;
    Bucket **arBuckets;
    dtor_func_t pDestructor;
    zend_bool persistent;
    unsigned char nApplyCount;
    zend_bool bApplyProtection;
#if ZEND_DEBUG
    int inconsistent;
#endif
} HashTable;

I won't describe all the fields, since I've already provided much info, which is only related to the question, but I'll describe this structure briefly:

  • arBuckets is what was described above, the buckets storage,
  • pListHead/pListTail are pointers to order-resolution list
  • nTableSize determines size of hash-table. And this is directly related to memory allocation: nTableSize is always power of 2. Thus, it's no matter if you'll have 13 or 14 elements in array: actual size will be 16. Take that to account when you want to estimate array size.

Conclusion

It's really difficult to predict, will one array be larger than another in your case. Yes, there are guidelines which are following from internal structure, but if string keys are comparable by their length to integer values (like 'four', 'one' in your sample) - real difference will be in such things as - how many collisions occurred, how many bytes were allocated to save the value.

But choosing proper structure should be matter of sense, not memory. If your intention is to build the corresponding indexed data, then choice always be obvious. Post above is only about one goal: to show how arrays actually work in PHP and where you can find the difference in memory allocation in your sample.

You may also check article about arrays & hash-tables in PHP: it's Hash-tables in PHP by PHP internals book: I've used some graphics from there. Also, to realize, how values are allocated in PHP, check zval Structure article, it may help you to understand, what will be differences between strings & integers allocation for values of your arrays. I didn't include explanations from it here, since much more important point for me - is to show array data structure and what may be difference in context of string keys/integer keys for your question.



回答2:

Although both arrays are accessed in a different way (i.e. via string or integer value), the memory pattern is mostly similar.

This is because the string allocation either happens as part of the zval creation or when a new array key needs to be allocated; the small difference being that numeric indices don't require a whole zval structure, because they're stored as an (unsigned) long.

The observed differences in memory allocation are so minimal that they can be largely attributed to either the inaccuracy of memory_get_usage() or allocations due to additional bucket creation.

Conclusion

How you want to use your array must be the guiding principle in choosing how it should be indexed; memory should only become an exception to this rule when you run out of it.



回答3:

From PHP manual Garbage Collection http://php.net/manual/en/features.gc.php

gc_enable(); // Enable Garbage Collector
var_dump(gc_enabled()); // true
var_dump(gc_collect_cycles()); // # of elements cleaned up
gc_disable(); // Disable Garbage Collector

PHP does not return released memory very well; Its primary usage online does not require it and effective garbage collection takes time away from providing the output; When the script ends the memory is going to be returned anyway.

Garbage collection happens.

  1. When you tell it to

    int gc_collect_cycles ( void )

  2. When you leave a function

  3. When the script ends

Better understanding of PHP's Garbage collection from a web host, (no affiliation). http://www.sitepoint.com/better-understanding-phps-garbage-collection/

If you are considering byte by byte how the data is set in memory. Different ports are going to effect those values. 64bit CPUs performance is best when data sits on the first bit of a 64bit word. For the max performance a specific binary they would allocate the start of a block of memory on the first bit, leaving up to 7 bytes unused. This CPU specific stuff depends on what compiler was used to compile the PHP.exe. I can not offer any way to predict exact memory usage, given that it will be determined differently by different compilers.

Alma Do, post goes to the specifics of the source which is sent to the compiler. What the PHP source requests and the compiler optimizes.

Looking at the specific examples you posted. When the key is a ascii letter they are taking 4 bytes (64 bits) more per entry ... this suggests to me, (assuming no garbage or memory holes, ect), that the ascii keys are greater than 64 bits, but the numeric keys are fit in a 64bit word. It suggests to me your using a 64bit computer and your PHP.exe is compiled for 64bit CPUs.



回答4:

Arrays in PHP are implemented as hashmaps. Hence the length of the value you use for the key has little impact on the data requirement. In older versions of PHP there was a significant performance degradation with large arrays as the hash size was fixed at array creation - when collisions starting occurring then increasing numbers of hash values would map to linked lists of values which then had to be further searched (with an O(n) algorithm) instead of a single value, but more recently the hash appears to either use a much larger default size or is resized dynamically (it just works - I can't really be bothered reading the source code).

Saving 4 bytes from your scripts is not going to cause Google any sleepless nights. If you are writing code which uses large arrays (where the savings may be more significant) you're probably doing it wrong - the time and resource taken to fill up the array could be better spent elsewhere (like indexed storage).