Training darknet finishes immediately

2020-06-17 04:36发布

问题:

I would like to use the yolo architecture for object detection. Before training the network with my custom data, I followed these steps to train it on the Pascal VOC data: https://pjreddie.com/darknet/yolo/

The instructions are very clear. But after the final step

./darknet detector train cfg/voc.data cfg/yolo-voc.cfg darknet19_448.conv.23

darknet immediately stops training and announces that weights have been written to the backups/ directory.

At first I thought that the pretraining was simply too good and that the stopping criteria would be reached at once. So I've used the ./darknet detect command with these weights on one of the test images data/dog. Nothing is found.

If I don't use any pretrained weights, the network does train. I've edited cfg/yolo-voc.cfg to use

# Testing
#batch=1
#subdivisions=1
# Training
batch=32
subdivisions=8

Now the training process has been runnning for many hours and is keeping my gpu warm.

Is this the intended way to train darknet ? How can I use pretrained weights correctly, without training just breaking off ?

Is there any setting to create checkpoints, or get an idea of the progress ?

回答1:

This is an old question so I hope you have your answer by now, but here is mine just in case it helps.

After working with darknet for about a month, I've run into most of the roadblocks that people have asked/posted about on forums. In your case, I'm pretty certain it's because the weights have been trained for the max number of batches already, and when the pre-trained weights were read in darknet assumed training was done.

Relevant personal experience: when I used one of the pretrained weights files, it started from iteration 40101 and ran until 40200 before cutting off.

I would stick to training from scratch if you have custom data, but if you want to try the pre-trained weights again, you might find that changing max batches in the cfg file helps.



回答2:

Adding -clear 1 at the end of your training command will clear the stats of how many images this model has seen in previous training. Then you can fine-tune your model on new data(set).

You can find more info about the usage in the function signature void train_detector(char *datacfg, char *cfgfile, char *weightfile, int *gpus, int ngpus, int clear) at https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/examples/detector.c

I doubt it that increasing the max number of iterations is a good idea, as the learning rates are usually associated with current # of iteration. We usually increase the max # of iterations, when we want to resume a previous training task that ended because of reaching the max # of iterations, but we believe that with more iterations, it will give better results.

FYI, when you have a small dataset, training on it from scratch or from a classification network may not be a great idea. You may still want to re-use the weights from a detection network trained on large dataset like Coco or ImageNet.



回答3:

Also if using AlexeyAB/darknet they might have a problem with -clear option, in detector.c:

if (clear) *nets[k].seen = 0 

should really be:

if (clear) {*nets[k].seen = 0;*nets[k].cur_iteration = 0;}

otherwise the training loop will exit immediately.