I just started with dask
because it offers great parallel processing power. I have around 40000
images on my disk which I am going to use for building a classifier using some DL library, say Keras
or TF
. I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this:
img_path labels
0 data/1.JPG 1
1 data/2.JPG 1
2 data/3.JPG 5
...
Now here is my simple task: Use dask to read images and corresponding labels in a lazy
fashion. Do some processing on images and pass batches to the classifier in a batch size
of 32.
Define functions for reading and preprocessing:
def read_data(idx): img = cv2.imread(data['img_path'].iloc[idx]) label = data['labels'].iloc[idx] return img, label def img_resize(img): return cv2.resize(img, (224,224))
Get delayed dask arrays:
data = [dd.delayed(read_data)(idx) for idx in range(len(df))] images = [d[0] for d in data] labels = [d[1] for d in data] resized_images = [dd.delayed(img_resize)(img) for img in images] resized_images = [dd.array.from_delayed(x, shape=(224,224, 3),dtype=np.float32) for x in resized_images]
Now here are my questions:
Q1. How do I get a batch
of data, with batch_size=32
from this array? Is this equivalent to a lazy generator now? If not, can it be made to behave like one?
Q2. How to choose effective chunksize
for better batch generation? For example, if I have 4
cores, size of images is (224,224,3)
, how can I make my batch processing efficient?