The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.
- Embedded/concentrated memory
- Incomplete supervision
- Knowledge-based neural networks
- Visual classification