numpy.std on memmapped ndarray fails with MemoryEr

2019-08-04 04:55发布

I have a huge (30GB) ndarray memory-mapped:

arr = numpy.memmap(afile, dtype=numpy.float32, mode="w+", shape=(n, n,))

After filling it in with some values (which goes very fine - max memory usage is below 1GB) I want to calculate standard deviation:

print('stdev: {0:4.4f}\n'.format(numpy.std(arr)))

This line fails miserably with MemoryError.

I am not sure why this fails. I would be grateful for tips how to calculate these in a memory-efficient way?

Environment: venv + Python3.6.2 + NumPy 1.13.1

标签： python numpy out-of-memory

2条回答

手持菜刀，她持情操

2楼-- · 2019-08-04 05:37

Indeed numpy's implementation of std and mean make full copies of the array, and are horribly memory inefficient. Here is a better implementation:

# Memory overhead is BLOCKSIZE * itemsize. Should be at least ~1MB 
# for efficient HDD access.
BLOCKSIZE = 1024**2
# For numerical stability. The closer this is to mean(arr), the better.
PIVOT = arr[0]

n = len(arr)
sum_ = 0.
sum_sq = 0.
for block_start in xrange(0, n, BLOCKSIZE):
     block_data = arr[block_start:block_start + BLOCKSIZE]
     block_data -= PIVOT
     sum_ += np.sum(block_data)
     sum_sq += np.sum(block_data**2)
stdev = np.sqrt(sum_sq / n - (sum_ / n)**2)

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2019-08-04 05:38

import math
BLOCKSIZE = 1024**2
# For numerical stability. The closer this is to mean(arr), the better.
PIVOT = arr[0]


n = len(arr)
sum_ = 0.
sum_sq = 0.
for block_start in xrange(0, n, BLOCKSIZE):
    block_data = arr[block_start:block_start + BLOCKSIZE]
    block_data -= PIVOT
    sum_ += math.fsum(block_data)
    sum_sq += math.fsum(block_data**2)

stdev = np.sqrt(sum_sq / n - (sum_ / n)**2)

0人赞添加讨论(0) 举报

numpy.std on memmapped ndarray fails with MemoryEr

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间