What's the use of dilated convolutions?

2019-02-03 22:02发布

问题:

I refer to Multi-Scale Context Aggregation by Dilated Convolutions.

  • A 2x2 kernel would have holes in it such that it becomes a 3x3 kernel.
  • A 3x3 kernel would have holes in it such that it becomes a 5x5 kernel.
  • Above assumes interval 1 of course.

I can clearly see that this allows you to effectively use 4 parameters but have a receptive field of 3x3 and 9 parameters but have a receptive field of 5x5.

Is the case of dilated convolution simply to save on parameters while reaping the benefit of a larger receptive field and thus save memory and computations?

回答1:

TLDR

  1. Dilated convolutions have generally improved performance (see the better semantic segmentation results in Multi-Scale Context Aggregation by Dilated Convolutions)
  2. The more important point is that the architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage.

  3. Allows one to have larger receptive field with same computation and memory costs while also preserving resolution.

  4. Pooling and Strided Convolutions are similar concepts but both reduce the resolution.

@Rahul referenced WaveNet, which put it very succinctly in 2.1 Dilated Causal Convolutions. It is also worth looking at Multi-Scale Context Aggregation by Dilated Convolutions I break it down further here:

  • Figure (a) is a 1-dilated 3x3 convolution filter. In other words, it's a standard 3x3 convolution filter.
  • Figure (b) is a 2-dilated 3x3 convolution filter. The red dots are where the weights are and everywhere else is 0. In other words, it's a 5x5 convolution filter with 9 non-zero weights and everywhere else 0, as mentioned in the question. The receptive field in this case is 7x7 because each unit in the previous output has a receptive field of 3x3. The highlighted portions in blue show the receptive field and NOT the convolution filter (you could see it as a convolution filter if you wanted to but it's not helpful).
  • Figure (c) is a 4-dilated 3x3 convolution filter. It's a 9x9 convolution filter with 9 non-zeros weights and everywhere else 0. From (b), we have it that each unit now has a 7x7 receptive field, and hence you can see a 7x7 blue portion around each red dot.

To draw an explicit contrast, consider this:

  • If we use 3 successive layers of 3x3 convolution filters with stride of 1, the effective receptive field will only be 7x7 at the end of it. However, with the same computation and memory costs, we can achieve 15x15 with dilated convolutions. Both operations preserve resolution.
  • If we use 3 successive layers of 3x3 convolution filters with increasing stride at an exponential rate at exactly the same rate as dilated convolutions in the paper, we will get a 15x15 receptive field at the end of it but with loss of coverage eventually as the stride gets larger. What this loss of coverage means is that the effective receptive field at some point will not be what we see above. Some parts will not be overlapping.


回答2:

In addition to the benefits you already mentioned such as larger receptive field, efficient computation and lesser memory consumption, the dilated causal convolutions also has the following benefits:

  • it preserves the resolution/dimensions of data at the output layer. This is because the layers are dilated instead of pooling, hence the name dilated causal convolutions.
  • it maintains the ordering of data. For example, in 1D dilated causal convolutions when the prediction of output depends on previous inputs then the structure of convolution helps in maintaining the ordering of data.

I'd refer you to read this amazing paper WaveNet which applies dilated causal convolutions to raw audio waveform for generating speech, music and even recognize speech from raw audio waveform.

I hope you find this answer helpful.