To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…