How much memory will a list with one million eleme

2019-01-20 10:17发布

问题:

There are more than a million subreddits on Reddit, according to redditmetrics.com.

I wrote a script that repeatedly queries this Reddit API endpoint until all the subreddits are stored in an array, all_subs:

all_subs = []
for sub in <repeated request here>:
    all_subs.append({"name": display_name, "subscribers": subscriber_count})

The script has been running for close to ten hours, and it's about halfway done (it gets rate-limited every three or four requests). When it's finished, I expect an array like this:

[
    { "name": "AskReddit", "subscribers", 16751677 },
    { "name": "news", "subscribers", 13860169 },
    { "name": "politics", "subscribers", 3350326 },
    ... # plus one million more entries
]

Approximately how much space in memory will this list take up?

回答1:

This depends on your Python version and your system, but I will give you a hand figuring out about how much memory it will take. First thing is first, sys.getsizeof only returns the memory use of the object representing the container, not all the elements in the container.

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

If given, default will be returned if the object does not provide means to retrieve the size. Otherwise a TypeError will be raised.

getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

So, I've loaded up that recipe in an interactive interpreter session:

So, a CPython list is actually a heterogenous, resizable arraylist. The underlying array only contains pointers to Py_Objects. So, a pointer takes up a machine word worth of memory. On a 64-bit system, this is 64 bits, so 8 bytes. So, just for the container a list of size 1,000,000 will take up roughly 8 million bytes, or 8 megabytes. Building a list with 1000000 entries bears that out:

In [6]: for i in range(1000000):
   ...:     x.append([])
   ...:

In [7]: import sys

In [8]: sys.getsizeof(x)
Out[8]: 8697464

The extra memory is accounted for by the overhead of a python object, and the extra space that a the underlying array leaves at the end to allow for efficient .append operations.

Now, a dictionary is rather heavy-weight in Python. Just the container:

In [10]: sys.getsizeof({})
Out[10]: 288

So a lower bound on the size of 1 million dicts is: 288000000 bytes. So, a rough lower bound:

In [12]: 1000000*288 + 1000000*8
Out[12]: 296000000

In [13]: 296000000 * 1e-9 # gigabytes
Out[13]: 0.29600000000000004

So you can expect about about 0.3 gigabytes worth of memory. Using the recipie and a more realistic dict:

In [16]: x = []
    ...: for i in range(1000000):
    ...:     x.append(dict(name="my name is what", subscribers=23456644))
    ...:

In [17]: total_size(x)
Out[17]: 296697669

In [18]:

So, about 0.3 gigs. Now, that's not a lot on a modern system. But if you wanted to save space, you should use a tuple or even better, a namedtuple:

In [24]: from collections import namedtuple

In [25]: Record = namedtuple('Record', "name subscribers")

In [26]: x = []
    ...: for i in range(1000000):
    ...:     x.append(Record(name="my name is what", subscribers=23456644))
    ...:

In [27]: total_size(x)
Out[27]: 72697556

Or, in gigabytes:

In [29]: total_size(x)*1e-9
Out[29]: 0.07269755600000001

namedtuple works just like a tuple, but you can access the fields with names:

In [30]: r = x[0]

In [31]: r.name
Out[31]: 'my name is what'

In [32]: r.subscribers
Out[32]: 23456644