Pixel-level segmentation has been widely used to improve object detection. Most of the existing methods refine detection features by adding the constraint of the segmentation branch or by simply embedding high-level segmentation features into detection features within the local receptive field. However, noisy segmentation features are unavoidable in real-word applications and can easily cause false positives. To address this problem, we propose a novel hierarchical context embedding module to effectively embed segmentation features into detection features. The idea of this module is to capture hierarchical context information that includes local objects or parts and nonlocal context features by learning multiple attention maps, and subsequently utilize interdependencies between features to recalibrate noisy segmentation features. Furthermore, we use this module in the proposed gated encoder-decoder network that adaptively aggregates feature maps of different resolutions based on the gate mechanism so that we can embed multiscale segmentation feature maps into detection features for more accurate detection of objects of all sizes. Experimental results demonstrate the effectiveness of the proposed method on the Pascal VOC 2012Seg dataset, the Pascal VOC dataset and the MS COCO dataset.
- gated encoder-decoder network
- hierarchical context embedding module
- object detection
- Segmentation features