How much memory does this byte string actually tak

2019-07-09 08:43发布

问题:

My understanding is that os.urandom(size) outputs a random string of bytes of the given "size", but then:

import os
import sys

print(sys.getsizeof(os.urandom(42)))

>>>
75

Why is this not 42?

And a related question:

import base64
import binascii


print(sys.getsizeof(base64.b64encode(os.urandom(42))))
print(sys.getsizeof(binascii.hexlify(os.urandom(42))))

>>>
89
117

Why are these so different? Which encoding would be the most memory efficient way to store a string of bytes such as that given by os.urandom?

Edit: It seems like quite a stretch to say that this question is a duplicate of What is the difference between len() and sys.getsizeof() methods in python? My question is not about the difference between len() and getsizeof(). I was confused about the memory used by Python objects in general, which the answer to this question has clarified for me.

回答1:

Python byte string objects are more than just the characters that comprise them. They are fully fledged objects. As such they require more space to accommodate the object's components such as the type pointer (needed to identify what kind of object the bytestring even is) and the length (needed for efficiency and because Python bytestrings can contain null bytes).

The simplest object, an object instance, requires space:

>>> sys.getsizeof(object())
16

The second part of your question is simply because the strings produced by b64encode() and hexlify() have different lengths; the latter being 28 characters longer which, unsurprisingly, is the difference in the values reported by sys.getsizeof().

>>> s1 = base64.b64encode(os.urandom(42))
>>> s1
b'CtlMjDM9q7zp+pGogQci8gr0igJsyZVjSP4oWmMj2A8diawJctV/8sTa'
>>> s2 = binascii.hexlify(os.urandom(42))
>>> s2
b'c82d35f717507d6f5ffc5eda1ee1bfd50a62689c08ba12055a5c39f95b93292ddf4544751fbc79564345'

>>> len(s2) - len(s1)
28
>>> sys.getsizeof(s2) - sys.getsizeof(s1)
28

Unless you use some form of compression, there is no encoding that will be more efficient than the binary string that you already have, and this is particularly true in this case because the data is random which is inherently incompressible.