Generating batches of images in dask

2019-07-09 20:37发布

问题:

I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF. I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this:

    img_path     labels
0   data/1.JPG   1
1   data/2.JPG   1
2   data/3.JPG   5
...     

Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on images and pass batches to the classifier in a batch size of 32.

  1. Define functions for reading and preprocessing:

    def read_data(idx):
        img = cv2.imread(data['img_path'].iloc[idx])
        label = data['labels'].iloc[idx]
        return img, label
    
    def img_resize(img):
        return cv2.resize(img, (224,224)) 
    
  2. Get delayed dask arrays:

    data = [dd.delayed(read_data)(idx) for idx in range(len(df))]
    images = [d[0] for d in data]
    labels = [d[1] for d in data]
    resized_images = [dd.delayed(img_resize)(img) for img in images]
    resized_images = [dd.array.from_delayed(x, shape=(224,224, 3),dtype=np.float32) for x in resized_images]
    

Now here are my questions:

Q1. How do I get a batch of data, with batch_size=32 from this array? Is this equivalent to a lazy generator now? If not, can it be made to behave like one?

Q2. How to choose effective chunksize for better batch generation? For example, if I have 4 cores, size of images is (224,224,3), how can I make my batch processing efficient?