可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have two directories, each of which contains about 50,000 images, which are mostly 240x180 sizes.

I want to pickle their pixel infos as training, validation, and test sets,

but this apparently turns out to be very, very large, and eventually cause the computer to either free or run out of disk spaces.

When the computer froze, the pkl file in the middle of being generated was 28GB.

I'm not sure if this is supposed to be this large.

Am I doing something wrong? Or is there a more efficient way to do this?

from PIL import Image
import pickle
import os

indir1 = 'Positive'
indir2 = 'Negative'

trainimage = []
trainpixels = []
trainlabels = []
validimage = []
validpixels = []
validlabels = []
testimage = []
testpixels = []
testlabels = []


i=0
for (root, dirs, filenames) in os.walk(indir1):
    print 'hello'
    for f in filenames:
        try:
            im = Image.open(os.path.join(root,f))
            if i<40000:
                trainpixels.append(im.tostring())
                trainlabels.append(0)
            elif i<45000:
                validpixels.append(im.tostring())
                validlabels.append(0)
            else:
                testpixels.append(im.tostring())
                testlabels.append(0)
            print str(i)+'\t'+str(f)
            i+=1
        except IOError:
            continue

i=0
for (root, dirs, filenames) in os.walk(indir2):
print 'hello'
    for f in filenames:
        try:
            im = Image.open(os.path.join(root,f))
            if i<40000:
                trainpixels.append(im.tostring())
                trainlabels.append(1)
            elif i<45000:
                validpixels.append(im.tostring())
                validlabels.append(1)
            else:
                testpixels.append(im.tostring())
                testlabels.append(1)
            print str(i)+'\t'+str(f)
            i+=1
        except IOError:
            continue

trainimage.append(trainpixels)
trainimage.append(trainlabels)
validimage.append(validpixels)
validimage.append(validlabels)
testimage.append(testpixels)
testimage.append(testlabels)

output=open('data.pkl','wb')

pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)

回答1:

The pickle file format isn't particularly efficient, especially not for images. Even if your pixels were stored as 1 byte per pixel, you would have

50,000 × 240 × 180 = 2,160,000,000

so 2 GB. Your pixels undoubtedly take more space than that, I'm not sure what the PIL tostring() method actually does on an image. It's entirely plausible that your resulting file could be in the tens of gigabytes.

You may want to consider a storage method other than pickle. For example, what would be wrong with simply storing the files on disk in their native image format, and pickling a list of the file names?

回答2:

I agree that you probably shouldn't be storing tons of pickled images to disk… unless you absolutely have to (for whatever reason). You should probably get a really big disk, with some really good memory, and tons of processing power.

Anyway, if you transfer your image data to a numpy.array, with scipy.ndimage.imread, then you can use the numpy internal format plus compression to store the image to disk.

There are packages like klepto that make this easy for you.

>>> from klepto.archives import dir_archive
>>> from scipy import ndimage
>>> demo = dir_archive('demo', {}, serialized=True, compression=9, cached=False)
>>> demo['image1'] = ndimage.imread('image1')
>>> demo['image2'] = ndimage.imread('image2')

Now you have a dictionary interface to numpy internal representation compressed pickled image files, with one image per file in a directory called demo (maybe you need to add the fast=True flag, I don't remember). All the dictionary methods are pretty much available, so you can access the images as you need for your analysis, then throw the pickled images away with del demo['image1'] or something similar.

You can also use klepto to easily provide custom encodings, so you have fairly cryptographic storage of your data. You can even pick to not encrypt/pickle your data, but just to have a dictionary interface to your files on disk -- that's often handy in itself.

If you don't turn caching off, you might hit the limits of your computer's memory or disk size, unless you are careful about the order you dump and load the image to disk. In the above example, I have caching to memory turned off, so it writes directly to disk. There are other options as well, such as using memory mapped mode, and writing to HDF files. I typically use a scheme like the above for large array data on to be processed on a single machine, and might pick a MySQL archive backend for more smaller data to be accessed by several machines in parallel.

Get klepto here: https://github.com/uqfoundation