Can Caffe classify pixels of an image directly?

2020-02-26 00:15发布

问题:

I would like to classify pixels of an image to "is street" or "is not street". I have some training data from the KITTI dataset and I have seen that Caffe has an IMAGE_DATA layer type. The labels are there in form of images of the same size as the input image.

Besides Caffe, my first idea to solve this problem was by giving image patches around the pixel which should get classified (e.g. 20 pixels to the top / left / right / bottom, resulting in 41×41=1681 features per pixel I want to classify.
However, if I could tell caffe how to use the labels without having to create those image patches manually (and the layer type IMAGE_DATA seems to suggest that it is possible) I would prefer that.

Can Caffe classify pixels of an image directly? How would such a prototxt network definition look like? How do I give Caffe the information about the labels?

I guess the input layer would be something like

layers {
  name: "data"
  type: IMAGE_DATA
  top: "data"
  top: "label"
  image_data_param {
    source: "path/to/file_list.txt"
    mean_file: "path/to/imagenet_mean.binaryproto"
    batch_size: 4
    crop_size: 41
    mirror: false
    new_height: 256
    new_width: 256
  }
}

However, I am not sure what crop_size exactly means. Is it really centered? How does caffe deal with the corner pixels? What is new_height and new_width good for?

回答1:

Seems you can try fully convolutional networks for semantic segmentation

Caffe was cited in this paper: https://github.com/BVLC/caffe/wiki/Publications

Also here is the model: https://github.com/BVLC/caffe/wiki/Model-Zoo#fully-convolutional-semantic-segmentation-models-fcn-xs

Also this presentation can be helpfull: http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-pixels.pdf



回答2:

Can Caffe classify pixels? in theory I think the answer is Yes. I didn't try it myself, but I don't think there is anything stopping you from doing so.

Inputs:
You need two IMAGE_DATA layers: one that loads the RGB image and another that loads the corresponding label-mask image. Note that if you use convert_imageset utility you cannot shuffle each set independently - you won't be able to match an image to its label-mask.

An IMAGE_DATA layer has two "tops" one for "data" and one for "label" I suggest you set the "label"s of both input layers to the index of the image/label-mask and add a utility layer that verifies that the indices always matches, this will prevent you from training on the wrong label-masks ;)

Example:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "data-idx"
  # paramters...
}
layer {
  name: "label-mask"
  type: "ImageData"
  top: "label-mask"
  top: "label-idx"
  # paramters...
}
layer {
  name: "assert-idx"
  type: "EuclideanLoss"
  bottom: "data-idx"
  bottom: "label-idx"
  top: "this-must-always-be-zero"
}

Loss layer:
Now, you can do whatever you like to the input data, but eventually to get pixel-wise labeling you need pixel-wise loss. Therefore, you must have your last layer (before the loss) produce a prediction with the same width and height as the "label-mask" Not all loss layers knows how to handle multiple labels, but "EuclideanLoss" (for example) can, therefore you should have a loss layer something like

layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "prediction" # size on image
  bottom: "label-mask"
  top: "loss"
}

I think "SoftmaxWithLoss" has a newer version that can be used in this scenario, but you'll have to check it our yourself. In that case "prediction" should be of shape 2-by-h-by-w (since you have 2 labels).

Additional notes:
Once you set the input size in the parameters of the "ImageData" you fix the sizes of all blobs of the net. You must set the label size to the same size. You must carefully consider how you are going to deal with images of different shape and sizes.