I would like to classify pixels of an image to "is street" or "is not street". I have some training data from the KITTI dataset and I have seen that Caffe has an IMAGE_DATA
layer type.
The labels are there in form of images of the same size as the input image.
Besides Caffe, my first idea to solve this problem was by giving image patches around the pixel which should get classified (e.g. 20 pixels to the top / left / right / bottom, resulting in 41×41=1681 features per pixel I want to classify.
However, if I could tell caffe how to use the labels without having to create those image patches manually (and the layer type IMAGE_DATA
seems to suggest that it is possible) I would prefer that.
Can Caffe classify pixels of an image directly? How would such a prototxt network definition look like? How do I give Caffe the information about the labels?
I guess the input layer would be something like
layers {
name: "data"
type: IMAGE_DATA
top: "data"
top: "label"
image_data_param {
source: "path/to/file_list.txt"
mean_file: "path/to/imagenet_mean.binaryproto"
batch_size: 4
crop_size: 41
mirror: false
new_height: 256
new_width: 256
}
}
However, I am not sure what crop_size
exactly means. Is it really centered? How does caffe deal with the corner pixels? What is new_height
and new_width
good for?
Can Caffe classify pixels? in theory I think the answer is Yes. I didn't try it myself, but I don't think there is anything stopping you from doing so.
Inputs:
You need two
IMAGE_DATA
layers: one that loads the RGB image and another that loads the corresponding label-mask image. Note that if you useconvert_imageset
utility you cannot shuffle each set independently - you won't be able to match an image to its label-mask.An
IMAGE_DATA
layer has two "tops" one for "data" and one for "label" I suggest you set the "label"s of both input layers to the index of the image/label-mask and add a utility layer that verifies that the indices always matches, this will prevent you from training on the wrong label-masks ;)Example:
Loss layer:
Now, you can do whatever you like to the input data, but eventually to get pixel-wise labeling you need pixel-wise loss. Therefore, you must have your last layer (before the loss) produce a prediction with the same width and height as the
"label-mask"
Not all loss layers knows how to handle multiple labels, but"EuclideanLoss"
(for example) can, therefore you should have a loss layer something likeI think
"SoftmaxWithLoss"
has a newer version that can be used in this scenario, but you'll have to check it our yourself. In that case"prediction"
should be of shape 2-by-h-by-w (since you have 2 labels).Additional notes:
Once you set the input size in the parameters of the
"ImageData"
you fix the sizes of all blobs of the net. You must set the label size to the same size. You must carefully consider how you are going to deal with images of different shape and sizes.Seems you can try fully convolutional networks for semantic segmentation
Caffe was cited in this paper: https://github.com/BVLC/caffe/wiki/Publications
Also here is the model: https://github.com/BVLC/caffe/wiki/Model-Zoo#fully-convolutional-semantic-segmentation-models-fcn-xs
Also this presentation can be helpfull: http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-pixels.pdf