Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
192 views
in Technique[技术] by (71.8m points)

machine learning - Is there a fundamental limit on how accurate location information is encoded in CNNs

Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.

If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.

Perhaps a toy example will clarify my confusion;

Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].

Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?

question from:https://stackoverflow.com/questions/65920778/is-there-a-fundamental-limit-on-how-accurate-location-information-is-encoded-in

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To add to your question - how about pixel perfect accuracy of segmentation boundary !!

Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.

The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.

Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.

Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.

In TensorFlow, atrous convolutions are implemented with function:

tf.nn.atrous_conv2d

There are many more methods and this is an ongoing research area.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...