Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify.
By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.
We propose to learn patch descriptors with a Siamese model of two CNNs sharing weights. The CNN is optimized for pairs of corresponding or non-corresponding patches. We propagate the patches through the model to extract the descriptors and then compute their L2 norm, which is a standard similarity measure for image descriptors, optimizing the hinge embedding loss. The objective is to learn a descriptor that places non-corresponding patches far apart and corresponding patches close together. We target floating-point descriptors size 128, as SIFT. We study multiple architectures and units (available in the supplemental material) and use our best performing architecture in the paper, a CNN with three convolutional layers, hyperbolic tangent units, and subtractive normalization.
We rely on Multi-View Stereo dataset of Brown et al to train our data, which contains corresponding patches recovered from Structure From Motion (SFM). A problem in patch retrieval is the large number of samples: e.g. for this case we have about 10^6 positives and 10^12 negatives, a prohibitive number to explore exhaustively, so we must resort to random sampling. However, with randomly selected pairs are easy to separate. We propose to address this with a mining strategy: we propagate randomly sampled pairs through the network and use only the subset with the largest loss for learning. We demonstrate that this strategy allows us to train discriminative models with Siamese networks for patch retrieval.
We benchmark our algorithm against state-of-the-art descriptors. We demonstrate that our approach outperforms both traditional descriptors and modern, learned descriptors. It generalizes very well to other datasets, inclusing wide-baseline stereo, illumination changes and non-rigid deformation.
Note that while we require a Siamese network for training, we can discard it on deployment and use a single CNN to extract descriptors. Notably, in contrast to existing approaches, we do not rely on learning the metric. Instead, our descriptors use the L2 norm for comparison and can thus be seen as a drop-in replacement for traditional descriptors, e.g. SIFT.
Code & Datasets
The code and pre-trained models to extract descriptors can be dowloaded from the following repository: