Modality and component aware feature fusion for RGB-D scene classification

Anran Wang, Jianfei Cai, Jiwen Lu, Tat-Jen Cham

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

30 Citations (Scopus)

Abstract

While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically meant that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity - that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity - within these discriminative components, all modalities have important contribution. In our framework, these are implemented through regularization terms applying group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.

Original languageEnglish
Title of host publicationProceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
EditorsLourdes Agapito, Tamara Berg, Jana Kosecka, Lihi Zelnik-Manor
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages5995-6004
Number of pages10
ISBN (Electronic)9781467388504, 9781467388511
ISBN (Print)9781467388528
DOIs
Publication statusPublished - 2016
Externally publishedYes
EventIEEE Conference on Computer Vision and Pattern Recognition 2016 - Las Vegas, United States of America
Duration: 27 Jun 201630 Jun 2016
Conference number: 29th
http://cvpr2016.thecvf.com/

Conference

ConferenceIEEE Conference on Computer Vision and Pattern Recognition 2016
Abbreviated titleCVPR 2016
CountryUnited States of America
CityLas Vegas
Period27/06/1630/06/16
Internet address

Cite this

Wang, A., Cai, J., Lu, J., & Cham, T-J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In L. Agapito, T. Berg, J. Kosecka, & L. Zelnik-Manor (Eds.), Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 (pp. 5995-6004). Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/CVPR.2016.645
Wang, Anran ; Cai, Jianfei ; Lu, Jiwen ; Cham, Tat-Jen. / Modality and component aware feature fusion for RGB-D scene classification. Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. editor / Lourdes Agapito ; Tamara Berg ; Jana Kosecka ; Lihi Zelnik-Manor. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2016. pp. 5995-6004
@inproceedings{df9acccec5fd4d36a9eba4ea639025c7,
title = "Modality and component aware feature fusion for RGB-D scene classification",
abstract = "While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically meant that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity - that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity - within these discriminative components, all modalities have important contribution. In our framework, these are implemented through regularization terms applying group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.",
author = "Anran Wang and Jianfei Cai and Jiwen Lu and Tat-Jen Cham",
year = "2016",
doi = "10.1109/CVPR.2016.645",
language = "English",
isbn = "9781467388528",
pages = "5995--6004",
editor = "Agapito, {Lourdes } and Berg, {Tamara } and Kosecka, {Jana } and Zelnik-Manor, {Lihi }",
booktitle = "Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
address = "United States of America",

}

Wang, A, Cai, J, Lu, J & Cham, T-J 2016, Modality and component aware feature fusion for RGB-D scene classification. in L Agapito, T Berg, J Kosecka & L Zelnik-Manor (eds), Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE, Institute of Electrical and Electronics Engineers, Piscataway NJ USA, pp. 5995-6004, IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, United States of America, 27/06/16. https://doi.org/10.1109/CVPR.2016.645

Modality and component aware feature fusion for RGB-D scene classification. / Wang, Anran; Cai, Jianfei; Lu, Jiwen; Cham, Tat-Jen.

Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. ed. / Lourdes Agapito; Tamara Berg; Jana Kosecka; Lihi Zelnik-Manor. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2016. p. 5995-6004.

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

TY - GEN

T1 - Modality and component aware feature fusion for RGB-D scene classification

AU - Wang, Anran

AU - Cai, Jianfei

AU - Lu, Jiwen

AU - Cham, Tat-Jen

PY - 2016

Y1 - 2016

N2 - While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically meant that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity - that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity - within these discriminative components, all modalities have important contribution. In our framework, these are implemented through regularization terms applying group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.

AB - While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically meant that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity - that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity - within these discriminative components, all modalities have important contribution. In our framework, these are implemented through regularization terms applying group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we were able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2.

UR - http://www.scopus.com/inward/record.url?scp=84986322926&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2016.645

DO - 10.1109/CVPR.2016.645

M3 - Conference Paper

AN - SCOPUS:84986322926

SN - 9781467388528

SP - 5995

EP - 6004

BT - Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016

A2 - Agapito, Lourdes

A2 - Berg, Tamara

A2 - Kosecka, Jana

A2 - Zelnik-Manor, Lihi

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - Piscataway NJ USA

ER -

Wang A, Cai J, Lu J, Cham T-J. Modality and component aware feature fusion for RGB-D scene classification. In Agapito L, Berg T, Kosecka J, Zelnik-Manor L, editors, Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. 2016. p. 5995-6004 https://doi.org/10.1109/CVPR.2016.645