可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have around 700 matrices stored on disk, each with around 70k rows and 300 columns.

I have to load parts of these matrices relatively quickly, around 1k rows per matrix, into another matrix I have in memory. The fastest way I found to do this is using memory maps, where initially I am able to load the 1k rows in around 0.02 seconds. However, performance is not consistent at all and sometimes, loading takes up to 1 second per matrix!

My code looks like this roughly:

target = np.zeros((7000, 300))
target.fill(-1)  # allocate memory

for path in os.listdir(folder_with_memmaps):
    X = np.memmap(path, dtype=_DTYPE_MEMMAPS, mode='r', shape=(70000, 300))
    indices_in_target = ... # some magic
    indices_in_X = ... # some magic
    target[indices_in_target, :] = X[indices_in_X, :]

With line by line timing I determined that it is definitely the last line that slows down over time.

Upadte: Plotting the load times gives different results. One time it looked like this, i.e. the degrade was not gradual but instead jumped after precisely 400 files. Could this be some OS limit?

But another time it looked completely different:

After a few more test runs, it seems that second plot is rather typical of the performance development.

Also, I tried to del X after the loop, without any impact. Neither did accessing the underlying Python mmap via X._mmap.close() work.

Any ideas as to why there is inconsistent performance? Are there any faster alternatives to store & retrieve these matrices?

回答1:

HDDs are poor at "serving more than one master" -- the slowdown can be much larger than one might expect. To demonstrate, I used this code to read the backup files (about 50 MB each) on the HDD of my Ubuntu 12.04 machine:

import os, random, time

bdir = '/hdd/backup/'
fns = os.listdir(bdir)

while True:
  fn = random.choice(fns)
  if not fn.startswith("duplicity-full."):
    continue
  ts = time.time()
  with open(bdir+fn, 'rb') as f:
    c = f.read()
  print "MB/s: %.1f" %(len(c)/(1000000*(time.time()-ts)))

Running one of these "processes" gives me decent read performance:

MB/s: 148.6
MB/s: 169.1
MB/s: 184.1
MB/s: 188.1
MB/s: 185.3
MB/s: 146.2

Adding a second such process in parallel slows things down by more than an order of magnitude:

MB/s: 14.3
MB/s: 11.6
MB/s: 12.7
MB/s: 8.7
MB/s: 8.2
MB/s: 15.9

My guess is this (i.e., other HDD use) is the reason for your inconsistent performance. My hunch is an SSD would do significantly better. For my machine, for large files on the SSD the slowdown due to a parallel reader process was only twofold, from about 440 MB/s to about 220 MB/s. (See my comment.)

回答2:

You might consider using bcolz . It compresses numerical data on disk and in memory to speed things up. You may have to transpose the matrices in order to get a sparse read since bcolz stores things by column rather than row.