Using Conv2DTranspose to output the double of its

2020-07-22 09:19发布

站内文章 / Python

80 0

贼婆χ

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm newbie with Python 3.7.7 and Tensorflow 2.1.0 and I'm trying to understand Conv2DTranspose. I have tried this code:

def vgg16_decoder(input_size = (7, 7, 512)):
    inputs = Input(input_size, name = 'input')

    conv1 = Conv2DTranspose(512, (2, 2), dilation_rate = 2, name = 'conv1')(inputs)

    model = Model(inputs = inputs, outputs = conv1, name = 'vgg-16_decoder')

    opt = Adam(lr=0.001)
    model.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

    return model

And this is its summary:

Model: "vgg-16_decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 7, 7, 512)         0
_________________________________________________________________
conv1 (Conv2DTranspose)      (None, 9, 9, 512)         1049088
=================================================================
Total params: 1,049,088
Trainable params: 1,049,088
Non-trainable params: 0
_________________________________________________________________

But I want an output of (None, 14, 14, 512) from conv1.

I have changed filter size to (3, 3) and I get this summary:

Model: "vgg-16_decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 7, 7, 512)         0
_________________________________________________________________
conv1 (Conv2DTranspose)      (None, 11, 11, 512)       2359808
=================================================================
Total params: 2,359,808
Trainable params: 2,359,808
Non-trainable params: 0
_________________________________________________________________

I'm trying to get to this using Conv2DTranspose:

# A piece of code from U-NET implementation

up6 = Conv2D(512, 2, activation = 'relu', padding = 'same', kernel_initializer = 'he_normal', name = 'up6')(UpSampling2D(size = (2,2), name = 'upsp1')(drop5))

And its summary:

drop5 (Dropout)                 (None, 16, 16, 1024) 0           conv5_2[0][0]
__________________________________________________________________________________________________
upsp1 (UpSampling2D)            (None, 32, 32, 1024) 0           drop5[0][0]
__________________________________________________________________________________________________
up6 (Conv2D)                    (None, 32, 32, 512)  2097664     upsp1[0][0]
__________________________________________________________________________________________________

It upsamples by 2 its input and it changes its number of filters.

How can I do that with Conv2DTranspose?

UPDATE:

I think, or I suppose, I did it, but I don't understand what I did:

conv1 = Conv2DTranspose(512, (2, 2), strides = 2, name = 'conv1')(inputs)

With the previous statement, I get this summary:

Model: "vgg-16_decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 7, 7, 512)         0
_________________________________________________________________
conv1 (Conv2DTranspose)      (None, 14, 14, 512)       1049088
=================================================================
Total params: 1,049,088
Trainable params: 1,049,088
Non-trainable params: 0
_________________________________________________________________

If you want to correct me or explain what I have done here, you are welcome.

UPDATE 2:

By the way, I'm trying to create an VGG-16 decoder. This is the code for my VGG-16 encoder:

def vgg16_encoder(input_size = (224,224,3)):
    inputs = Input(input_size, name = 'input')

    conv1 = Conv2D(64, (3, 3), activation = 'relu', padding = 'same', name ='conv1_1')(inputs)
    conv1 = Conv2D(64, (3, 3), activation = 'relu', padding = 'same', name ='conv1_2')(conv1)
    pool1 = MaxPooling2D(pool_size = (2,2), strides = (2,2), name = 'pool_1')(conv1)

    conv2 = Conv2D(128, (3, 3), activation = 'relu', padding = 'same', name ='conv2_1')(pool1)
    conv2 = Conv2D(128, (3, 3), activation = 'relu', padding = 'same', name ='conv2_2')(conv2)
    pool2 = MaxPooling2D(pool_size = (2,2), strides = (2,2), name = 'pool_2')(conv2)

    conv3 = Conv2D(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_1')(pool2)
    conv3 = Conv2D(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_2')(conv3)
    conv3 = Conv2D(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_3')(conv3)
    pool3 = MaxPooling2D(pool_size = (2,2), strides = (2,2), name = 'pool_3')(conv3)

    conv4 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_1')(pool3)
    conv4 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_2')(conv4)
    conv4 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_3')(conv4)
    pool4 = MaxPooling2D(pool_size = (2,2), strides = (2,2), name = 'pool_4')(conv4)

    conv5 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_1')(pool4)
    conv5 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_2')(conv5)
    conv5 = Conv2D(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_3')(conv5)
    pool5 = MaxPooling2D(pool_size = (2,2), strides = (2,2), name = 'pool_5')(conv5)

    opt = Adam(lr=0.001)

    model = Model(inputs = inputs, outputs = pool5, name = 'vgg-16_encoder')

    model.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

    return model

回答1:

When we design encoder-decoder architecture we need some operation that reverses the operations already done. So, let's say in encoder we have Conv2D, and Pooling (common in architectures like VGG). We use Conv2dTranspose (this can be thought of reverse operation of Conv2D), and Upsampling2D (reverse operation of Pooling (well, not rigorously [pooling is an irreversible operation as information is lost])).

N.B: You don't want to upsample your feature maps with Conv2DTranspose (you can but for VGG I don't think Conv2DTranspose will give the upsampled feature maps the way you want in your decoder), it's not designed that way (it also learns upsampling but it learns the best upsampling parameters which is slightly different). You'll end up with really large kernels which will result in a completely different network than the VGG-encoder you're talking about.

from tensorflow.keras.layers import *
from tensorflow.keras.models import *

def encoder_decoder_conv(input_size = (224,224,3)):
    ip = Input((224,224,3))
    # encoder
    conv = Conv2D(512, (3,3))(ip) # look here, the default padding is used
    # decoder
    inv_conv = Conv2DTranspose(3, (3,3))(conv)
    # simple model
    model = Model(ip, inv_conv)
    return model

model1 = encoder_decoder_conv()
model1.summary()

def encoder_decoder_pooling(input_size = (224,224,3)):
    ip = Input((224,224,3))
    # encoder
    pool = MaxPool2D((2,2))(ip) # look here, the default padding is used
    # decoder
    inv_pool = UpSampling2D((2,2))(pool)
    # simple model
    model = Model(ip, inv_pool)
    return model

model2 = encoder_decoder_pooling()
model2.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 222, 222, 512)     14336     
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 224, 224, 3)       13827     
=================================================================
Total params: 28,163
Trainable params: 28,163
Non-trainable params: 0
_________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 3)       0         
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 224, 224, 3)       0         
=================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0

As, you can see in the first model, with Conv2DTranspose we reverse the operations to get exactly the same shape as input (224,224,3).

For model2, we reverse the operation of Pooling (in terms of feature map shape) with Upsampling.

So, as you're trying to make a VGG-decoder, and VGG mostly consists of Conv2D and Maxpooling2D, all you have to do reverse those operations using Conv2dTranspose and Upsampling so you get the exact input shape (224, 224, 3) from the feature map shape (7, 7, 512).

Finally, there are some variations of the decoder part, but I think you're looking for this VGG-16 decoder.

def vgg16_decoder(input_size = (7,7,512)):
    inputs = Input(input_size, name = 'input')

    pool5 = UpSampling2D((2,2), name = 'pool_5')(inputs)
    conv5 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_3')(pool5)

    conv5 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_2')(conv5)

    conv5 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv5_1')(conv5)

    pool4 = UpSampling2D((2,2), name = 'pool_4')(conv5)

    conv4 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_3')(pool4)

    conv4 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_2')(conv4)
    conv4 = Conv2DTranspose(512, (3, 3), activation = 'relu', padding = 'same', name ='conv4_1')(conv4)
    pool3 = UpSampling2D((2,2), name = 'pool_3')(conv4)

    conv3 = Conv2DTranspose(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_3')(pool3)
    conv3 = Conv2DTranspose(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_2')(conv3)

    conv3 = Conv2DTranspose(256, (3, 3), activation = 'relu', padding = 'same', name ='conv3_1')(conv3)

    pool2 = UpSampling2D((2,2), name = 'pool_2')(conv3)
    conv2 = Conv2DTranspose(128, (3, 3), activation = 'relu', padding = 'same', name ='conv2_2')(pool2)

    conv2 = Conv2DTranspose(128, (3, 3), activation = 'relu', padding = 'same', name ='conv2_1')(conv2)

    pool1 = UpSampling2D((2,2), name = 'pool_1')(conv2)

    conv1 = Conv2DTranspose(64, (3, 3), activation = 'relu', padding = 'same', name ='conv1_2')(pool1)

    conv1 = Conv2DTranspose(3, (3, 3), activation = 'relu', padding = 'same', name ='conv1_1')(conv1) # to get 3 channels

    model = Model(inputs = inputs, outputs = conv1, name = 'vgg-16_encoder')

    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    return model

model = vgg16_decoder()
model.summary()

Model: "vgg-16_encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           [(None, 7, 7, 512)]       0         
_________________________________________________________________
pool_5 (UpSampling2D)        (None, 14, 14, 512)       0         
_________________________________________________________________
conv5_3 (Conv2DTranspose)    (None, 14, 14, 512)       2359808   
_________________________________________________________________
conv5_2 (Conv2DTranspose)    (None, 14, 14, 512)       2359808   
_________________________________________________________________
conv5_1 (Conv2DTranspose)    (None, 14, 14, 512)       2359808   
_________________________________________________________________
pool_4 (UpSampling2D)        (None, 28, 28, 512)       0         
_________________________________________________________________
conv4_3 (Conv2DTranspose)    (None, 28, 28, 512)       2359808   
_________________________________________________________________
conv4_2 (Conv2DTranspose)    (None, 28, 28, 512)       2359808   
_________________________________________________________________
conv4_1 (Conv2DTranspose)    (None, 28, 28, 512)       2359808   
_________________________________________________________________
pool_3 (UpSampling2D)        (None, 56, 56, 512)       0         
_________________________________________________________________
conv3_3 (Conv2DTranspose)    (None, 56, 56, 256)       1179904   
_________________________________________________________________
conv3_2 (Conv2DTranspose)    (None, 56, 56, 256)       590080    
_________________________________________________________________
conv3_1 (Conv2DTranspose)    (None, 56, 56, 256)       590080    
_________________________________________________________________
pool_2 (UpSampling2D)        (None, 112, 112, 256)     0         
_________________________________________________________________
conv2_2 (Conv2DTranspose)    (None, 112, 112, 128)     295040    
_________________________________________________________________
conv2_1 (Conv2DTranspose)    (None, 112, 112, 128)     147584    
_________________________________________________________________
pool_1 (UpSampling2D)        (None, 224, 224, 128)     0         
_________________________________________________________________
conv1_2 (Conv2DTranspose)    (None, 224, 224, 64)      73792     
_________________________________________________________________
conv1_1 (Conv2DTranspose)    (None, 224, 224, 3)       1731      
=================================================================
Total params: 17,037,059
Trainable params: 17,037,059
Non-trainable params: 0

It takes (7, 7, 512) feature shape and reconstructs the original image dimension (224, 224, 3).

In summary, the mechanical way of designing a decoder would be going in the opposite direction (relative to the encoder) while doing reverse operations. As for details of Conv2DTranspose and Upsampling2D, if you want to really understand these concepts in more depth:

https://cs231n.github.io/convolutional-networks/

https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers

https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf

回答2:

To get the desired form you need

conv1 = Conv2DTranspose(512, (8, 8), strides = 1, name = 'conv1')(inputs)

You may find this post on transposed convolution operation useful https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

回答3:

conv1 = Conv2DTranspose(512, (2, 2), strides = 2, name = 'conv1')(inputs) works because you are using a stride of 2. In a normal convolution, this would mean applying the filter only every two steps (skipping one step every time), which would result in an output half the size of the input. However, in a transposed convolution things are essentially reversed, and a stride of 2 gives you double the output size. It does this by basically inserting holes into the input before applying the convolution.

The first snippet (conv1 = Conv2DTranspose(512, (2, 2), dilation_rate = 2, name = 'conv1')(inputs)) doesn't work because you were specifying a dilation of 2, not a stride. This is completely different. Dilation inserts "holes" into your filter, e.g. a filter that looks like [x1 x2 x3] becomes [x1 0 x2 0 x3] with a dilation of 2. However, this filter-with-holes is then applied as normal on the input.

Why does the output size change even when using dilation? This is due to padding. Normally, the output of a convolution will be smaller if no padding is done. In a transposed convolution, it will be larger instead. You can avoid this by using padding=same.

tl;dr: You can double your image size by using Conv2DTranspose(n_filters, filter_size, strides = 2, padding="same").