Performance issue with reading integers from a bin

2019-07-21 10:52发布

问题:

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.

import os, struct

def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            values.append(v)
    return values

For comparison here is the F# version:

open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    // set stream to correct location
    reader.BaseStream.Position <- cellIndex*4L
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)

let readValues fileName indices = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
    let values = List.map (readValue reader) (List.ofSeq indices)
    values

Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?

Update

Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):

import tables
def read_values_hdf5(filename, indices):
    values = []
    with tables.open_file(filename) as f:
        dset = f.root.raster
        return dset[indices]

Update 2

I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.

回答1:

Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.

One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.

import numpy as np

# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]

If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.

Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.

It should be something like this:

import mmap

with open("my_index", "rb") as f:
    memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
    for i in indices:
        # the index at position i:
        idx_value = struct.unpack('I', memory_map[4*i:4*i+4])

(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)

Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:

import numpy as np

index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]

I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.


EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.

What is mmap?

Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.

File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.

The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.

If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)

mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.

mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.



回答2:

Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks