calculating the real size of a python string

2019-09-01 15:41发布

问题:

First of all this is my computer Spec :

Memory - https://gist.github.com/vyscond/6425304

CPU - https://gist.github.com/vyscond/6425322

So this morning I've tested the following 2 code snippets:

code A

a = 'a' * 1000000000

and code B

a = 'a' * 10000000000

The code A works fine. But the code B give me some error message :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

So I started a researching about method to measuring the size of data on python.

The first thing I've found is the classic built-in function len().

for code A function len() returned the value 1000000000, but for code B the same memory error was returned.

After this I decided to get more precision on this tests. So I've found a function from the sys module called getsizeof(). With this function I made the same test on code A:

sys.getsizeof( 'a' * 1000000000 )

the result return is 1000000037 (in bytes)

  • question 1 - which means 0.9313226090744 gigabytes?

So I checked the amount of bytes of a string with a single character 'a'

sys.getsizeof( 'a' )

the result return is 38 (in bytes)

  • question 02 - which means if we need a string composed of 1000000000 character 'a' this will result in 38 * 1000000000 = 38.000.000.000 bytes?

  • question 03 - which means we need a 35.390257835388 gigabytes to hold a string like this?

I would like to know where is the error in this reasoning! Because this not any sense to me '-'

回答1:

Python objects have a minimal size, the overhead of keeping several pieces of bookkeeping data attached to the object.

A Python str object is no exception. Take a look at the difference between a string with no, one, two and three characters:

>>> import sys
>>> sys.getsizeof('')
37
>>> sys.getsizeof('a')
38
>>> sys.getsizeof('aa')
39
>>> sys.getsizeof('aaa')
40

The Python str object overhead is 37 bytes on my machine, but each character in the string only takes one byte over the fixed overhead.

Thus, a str value with 1000 million characters requires 1000 million bytes + 37 bytes overhead of memory. That is indeed about 0.931 gigabytes.

Your sample code 'B' created ten times more characters, so you needed nearly 10 gigabyte of memory just to hold that one string, not counting the rest of Python, and the OS and whatever else might be running on that machine.