How to measure overfitting when train and validati

2020-08-23 04:03发布

问题:

I have the following plot:

The model is created with the following number of samples:

                class1     class2
train             20         20
validate          21         13

In my understanding, the plot show there is no overfitting. But I think, since the sample is very small, I'm not confident if the model is general enough.

Is there any other way to measure overfittingness other than the above plot?

This is my complete code:

library(keras)
library(tidyverse)


train_dir <- "data/train/"
validation_dir <- "data/validate/"



# Making model ------------------------------------------------------------


conv_base <- application_vgg16(
  weights = "imagenet",
  include_top = FALSE,
  input_shape = c(150, 150, 3)
)

# VGG16 based model -------------------------------------------------------

# Works better with regularizer
model <- keras_model_sequential() %>%
  conv_base() %>%
  layer_flatten() %>%
  layer_dense(units = 256, activation = "relu", kernel_regularizer = regularizer_l1(l = 0.01)) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

length(model$trainable_weights)
freeze_weights(conv_base)
length(model$trainable_weights)


# Train model -------------------------------------------------------------
desired_batch_size <- 20 

train_datagen <- image_data_generator(
  rescale = 1 / 255,
  rotation_range = 40,
  width_shift_range = 0.2,
  height_shift_range = 0.2,
  shear_range = 0.2,
  zoom_range = 0.2,
  horizontal_flip = TRUE,
  fill_mode = "nearest"
)

# Note that the validation data shouldn't be augmented!
test_datagen <- image_data_generator(rescale = 1 / 255)


train_generator <- flow_images_from_directory(
  train_dir, # Target directory
  train_datagen, # Data generator
  target_size = c(150, 150), # Resizes all images to 150 × 150
  shuffle = TRUE,
  seed = 1,
  batch_size = desired_batch_size, # was 20
  class_mode = "binary" # binary_crossentropy loss for binary labels
)

validation_generator <- flow_images_from_directory(
  validation_dir,
  test_datagen,
  target_size = c(150, 150),
  shuffle = TRUE,
  seed = 1,
  batch_size = desired_batch_size,
  class_mode = "binary"
)

# Fine tuning -------------------------------------------------------------

unfreeze_weights(conv_base, from = "block3_conv1")

# Compile model -----------------------------------------------------------



model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 2e-5),
  metrics = c("accuracy")
)


# Evaluate  by epochs  ---------------------------------------------------------------


#  # This create plots accuracy of various epochs (slow)
history <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 15, # was 50
  validation_data = validation_generator,
  validation_steps = 50
)

plot(history)

回答1:

So two things here:

  1. Stratify your data w.r.t. classes - your validation data has a completely different class distribution than your training set (train set is balanced whereas validation set - not). This might affect your losses and metrics values. It's better to stratify your results so the class ratio would be the same for both sets.

  2. With a so few data points use more rough validation schemas - as you may see you have only 74 images in total. In this case - it's not a problem to load all images to numpy.array (you still could have data augmentation using flow function) and use validation schemas which are hard to obtain when you have your data in a folder. The schemas (from sklearn) which I advice you to use are:

    • stratified k-fold cross-validation - where you divide your data into k chunks - and for each selection of k - 1 chunks - you first train your model on k - 1 and then compute metrics on the one which was left for validation. The final result is a mean out of results obtained on validation chunks. You could, of course, check not only mean but also other statistics of losses distribution (like e.g. min, max, median, etc.). You could also compare them with results obtained on a training set for each fold.
    • leave-one-out - this is a special case of previous schema - where the number of chunks / folds is equal to the number of examples in your dataset. This method is considered as the roughest way of measuring your model performance. It's rarely used in deep learning because of the fact that training process is usually to slow and datasets are to big in order to accomplish computations in a reasonable time.


回答2:

I recommend looking at the predictions as the next step.

For example, judging from the top plot and the number of provided samples, your validation predictions fluctuates between two accuracies, and the difference between those predictions is exactly one sample guessed right.

So, your model predicts more or less the same results (plus-minus one observation) with no respect to the fitting. This is a bad sign.

Also, the number of features and trainable parameters (weights) is way too high for the provided number of samples. All those weights just have no chance to actually be trained.



回答3:

Your validation loss is constantly lower than the training loss. I would be quite suspicious of your results. If you look at the validation accuracy, it just shouldn't be like that.

The less data you have, the less confidence you can have in anything. So you are right when you are not sure about overfitting. The only thing that works here is to gather more data, either by data augmentation, or combining with another dataset.



回答4:

If you want to measure the overfitness of your current model you could test the model on your small test set, and each time select 34 samples from the validate set i.e. by the function sample with the setting replace=TRUE. By picking samples that you replace from your validate set, you will be able to create more "extreme" datasets and hence getting a better estimate of how much the prediction could vary based on your available data. This method is called bagging or bootstrap aggregating.