TY - JOUR
T1 - Memory transformation networks for weakly supervised visual classification
AU - Liu, Huan
AU - Zheng, Qinghua
AU - Luo, Minnan
AU - Chang, Xiaojun
AU - Yan, Caixia
AU - Yao, Lina
PY - 2020/12/27
Y1 - 2020/12/27
N2 - The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.
AB - The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.
KW - Embedded/concentrated memory
KW - Incomplete supervision
KW - Knowledge-based neural networks
KW - Visual classification
UR - http://www.scopus.com/inward/record.url?scp=85092437934&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2020.106432
DO - 10.1016/j.knosys.2020.106432
M3 - Article
AN - SCOPUS:85092437934
SN - 0950-7051
VL - 210
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 106432
ER -