可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?

Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.

Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).

import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct

"""
plots a cfile

cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext

note: directly plotting with numpy results into shadowed functions
"""

# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
    index = 0   # index of the samples
    output_filename = open(output_filename, 'wb')

    with open(input_filename, "rb") as f:

        byte = f.read(4)    # read 1. column of the vector

        while byte != "":
        # stored Bit Values
            floati = struct.unpack('f', byte)   # write value of 1. column to a variable
            byte = f.read(4)            # read 2. column of the vector
            floatq = struct.unpack('f', byte)   # write value of 2. column to a variable
            byte = f.read(4)            # next row of the vector and read 1. column
            # delimeter format for matplotlib 
            lines = ["%d," % index, format(floati), ",",  format(floatq), "\n"]
            output_filename.writelines(lines)
            index = index + 1
    output_filename.close
    return output_filename.name

# reformats output (precision configuration here)
def format(value):
    return "%.8f" % value            

# start
def main():

    # specify path
    unpacked_file = unpack_set("test01.cfile", "test01.txt")
    # pass file reference to matplotlib
    fname = str(unpacked_file)
    plt.plotfile(fname, cols=(0,1)) # index vs. in-phase

    # optional
    # plt.axes([0, 0.5, 0, 100000]) # for 100k samples
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")

    plt.show();

if __name__ == "__main__":
    main()

Something like plt.swap_on_disk() could cache the stuff on my SSD ;)

回答1:

So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib.... isn't that good. It has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.

So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:

#!/usr/bin/env python
import numpy
import scipy.io.numpyio

npts=20000000
filename='bigdata.bin'

def main():
    data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
    data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
    fd = open(filename,'wb')
    scipy.io.numpyio.fwrite(fd,data.size,data)
    fd.close()

if __name__ == "__main__":
    main()

This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.

Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.

Gnuplot has no trouble dealing with this:

gnuplot> set term png
gnuplot> set output 'foo.png'
gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots

And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):

#!/usr/bin/env python
import numpy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

datatype=[('index',numpy.float32), ('floati',numpy.float32), 
        ('floatq',numpy.float32)]
filename='bigdata.bin'

def main():
    data = numpy.memmap(filename, datatype, 'r') 
    plt.plot(data['floati'],data['floatq'],'r,')
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")
    plt.savefig('foo2.png')

if __name__ == "__main__":
    main()

Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.

On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.

回答2:

You can certainly optimize the reading of your file: you could directly read it into a NumPy array, so as to leverage the raw speed of NumPy. You have a few options. If RAM is an issue, you can use memmap, which keeps most of the file on disk (instead of in RAM):

# Each data point is a sequence of three 32-bit floats:
data = np.memmap(filename, mode='r', dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])

If RAM is not an issue, you can put the whole array in RAM with fromfile:

data = np.fromfile(filename, dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])

Plotting can then be done with Matplotlib's usual plot(*data) function, possibly through the "zoom in" method proposed in another solution.

回答3:

A more recent project has strong potential for large data sets: Bokeh, which was created with exactly this in mind.

In fact, only the data that's relevant at the scale of the plot is sent to the display backend. This approach is much faster than the Matplotlib approach.

回答4:

I would suggest something a bit complex but that should work : build your graph at different resolutions, for different ranges.

Think of Google Earth, for example. If you unzoom at maximum level to cover the whole planet, the resolution is the lowest. When you zoom, the pictures change for more detailed ones, but just on the region you're zooming on.

So basically for your plot (is it 2D ? 3D ? I'll assume it's 2D), I suggest you build one big graph that covers the whole [0, n] range with low resolution, 2 smaller graphs that cover [0, n/2] and [n/2 + 1, n] with twice the resolution of the big one, 4 smaller graphs that cover [0, n/4] ... [3 * n / 4 + 1, n] with twice the resolution of the 2 above, and so on.

Not sure my explanation is really clear. Also, I don't know if this kind of multi-resolution graph is handled by any existing plot program.

回答5:

I wonder if there's a win to be had by speeding up lookup of your points? (I've been intrigued by R* (r star) trees for a while.)

I wonder if using something like an r* tree in this case could be the way to go. (when zoomed out, higher up nodes in the tree could contain information about the coarser, zoomed out rendering, nodes further towards the leaves contain the individual samples)

maybe even memory map the tree (or whatever structure you end up using) into memory to keep your performance up and your RAM usage low. (you offload the task of memory management to the kernel)

hope that makes sense.. rambling a bit. it's late!