I'm trying to create a matrix containing 2 708 000 000 elements. When I try to create a numpy array of this size it gives me a value error. Is there any way I can increase the maximum array size?
a=np.arange(2708000000)
ValueError Traceback (most recent call last)
ValueError: Maximum allowed size exceeded
A ValueError indicates the size is too big for allocation, not that there is not enough memory. On my laptop, using 64bits python, I can allocate it if I reduce the number of bits:
In your case, arange uses
int64
bits, which means it is 16 times more, or around 43 GB. a 32 bits process can only access around 4 GB of memory.The underlying reason is the size of the pointers used to access data and how many numbers you can represent with those bits:
Note that I can replicate your ValueError if I try to create an absurdly large array:
If your machine has a lot of memory, as you said, it will be 64 bits, so you should install Python 64 bits to be able to access it. On the other hand, for such big datasets, you should consider the possibility of using out of core computations.
I was able to create an array with a size of 6Billion that ate up 45GB of memory. By default, numpy created the array with a dtype of float64. By dropping the precision, I was able to save a lot of memory.
default == float64
np.float64 -- 45.7GB
np.float32 -- 22.9GB
np.int8 -- 5.7GB
Obviously a 8bit integer cant store a value of 6B. I'm sure a max size exists at some point but I suspect it's FAR past anything possible in 2016. Interestingly, "Python Blaze" allows you to create numpy arrays on disk. I recall playing with it some time ago and creating an extremely large array that took up 1TB of disk.
It is indeed related to the system maximum address length, to say it simply, 32-bit system or 64-bit system. Here is an explanation for these questions, originally from Mark Dickinson
Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe. On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
You're trying to create an array with 2.7 billion entries. If you're running 64-bit numpy, at 8 bytes per entry, that would be 20 GB in all.
So almost certainly you just ran out of memory on your machine. There is no general maximum array size in numpy.