Generating batches of images in dask

2019-07-09 20:37发布

问题:

I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF. I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this:

    img_path     labels
0   data/1.JPG   1
1   data/2.JPG   1
2   data/3.JPG   5
...

Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on images and pass batches to the classifier in a batch size of 32.

Define functions for reading and preprocessing:

def read_data(idx):
    img = cv2.imread(data['img_path'].iloc[idx])
    label = data['labels'].iloc[idx]
    return img, label

def img_resize(img):
    return cv2.resize(img, (224,224))

Get delayed dask arrays:

data = [dd.delayed(read_data)(idx) for idx in range(len(df))]
images = [d[0] for d in data]
labels = [d[1] for d in data]
resized_images = [dd.delayed(img_resize)(img) for img in images]
resized_images = [dd.array.from_delayed(x, shape=(224,224, 3),dtype=np.float32) for x in resized_images]

Now here are my questions:

Q1. How do I get a batch of data, with batch_size=32 from this array? Is this equivalent to a lazy generator now? If not, can it be made to behave like one?

Q2. How to choose effective chunksize for better batch generation? For example, if I have 4 cores, size of images is (224,224,3), how can I make my batch processing efficient?

Generating batches of images in dask

问题:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮