Abstract
This paper focuses on learning the local image region representation via deep neural networks. Existing works mainly learn from matched corresponding image patches, with which the learned feature is too sensitive to the individual local patch matching result and cannot handle aggregation based tasks such as image level retrieval. Thus, we propose to use both the matched corresponding image patches and the clustering result as labels for the network training. To resolve the inconsistency between the matched correspondences and clustering results, we propose a semi-supervised iterative training scheme together with a dual margins loss. Moreover, a jointly learned spatial transform prediction network is utilized to obtain better spatial transform invariance of the learned local features. Using SIFT as the label initializer, experimental results show the comparable or even better performance than the hand-crafted feature, which sheds lights on learning local feature representation in an unsupervised or weakly supervised manner.
| Original language | English |
|---|---|
| Article number | 102601 |
| Number of pages | 10 |
| Journal | Journal of Visual Communication and Image Representation |
| Volume | 63 |
| DOIs | |
| Publication status | Published - Aug 2019 |
| Externally published | Yes |
Keywords
- Convolutional Neural Network (CNN)
- Local feature learning
- Local image representation
- Semi-supervised learning
- Spatial transform