I have followed the entire steps/format of codes(cross checked multiple times to be 100% sure they are correct) and the required data for training custom objects on Tensorflow Object Detection API. I tried using ssd_mobilenet_v1_coco, faster_rcnn_resnet101_coco as well as faster_rcnn_inception_v2_coco models and still haven't got any sort of good result. All I get is misclassification of objects or no bounding box at all.
I am training to detect a single class object with number of training images around 250 and number of validation images 63; and each image of varying size mostly around 300 x 300 pixels or lesser. I am training the models till they sort of converge(not fully). I know this by seeing the eval performance which shows at steps over 15000, the loss gradually decreases(to < 0.04) over time but also fluctuates. I stop my training and export the graph. My question is:
I have a solid doubt about the test video that I have been given to solve the object detection for. The video frames are quite large of the dimension 1370 x 786 pixels in which the object I need to detect is quite small compared to the frame size. Is this causing the problem?, since my training images are small(300 x 300 and smaller), whereas my test video frames are so large compared to the training images? I tried training several times but failed each time with each model and I am stuck to a point where I want to give up on this.
Can somebody put a light on what is happening here? Should I train for more steps? Or should I train similar dimension images as in test frames for training as well? Will this help?
Following is the code of the config file and labelmap.pbtxt I used.
Config File:
fine_tune_checkpoint: ".../ssd_mobilenet_v1_coco_2017_11_17/model.ckpt"
from_detection_checkpoint: true
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: ".../train.record"
}
label_map_path: ".../labelmap.pbtxt"
}
eval_config: {
num_examples: 63
labelmap.pbtxt:
item {
id: 1
name: 'tomato'
}
This is exactly what you need to do, given what you explained.
You should not expect a network trained on 300x300 images to work as intended on 1370x786 images. Specially if the object is already small in the big images.
Your train data has to be as similar as possible to your eval data, without enterning in the dangerous overfit zone. At least the images have to be of similar size and aspect ratio and from the same domain
Once you have solved this, have in mind that small objects are really hard to detect so you will probably need to modify the default model configuration.
If you don't have real-time constraints I would recommend you to start by trying a Faster-RCNN setting the output_stride parameter to 8 instead of 16.