Familial clustering For weakly-labeled Android malware using Hybrid REpresentation Learning

Yanxin Zhang, Yulei Sui, Shirui Pan, Zheng Zheng, Baodi Ning, Ivor Tsang, Wanlei Zhou

Research output: Contribution to journalArticleResearchpeer-review

45 Citations (Scopus)


Labeling malware or malware clustering is important for identifying new security threats, triaging and building reference datasets. The state-of-the-art Android malware clustering approaches rely heavily on the raw labels from commercial AntiVirus (AV) vendors, which causes misclustering for a substantial number of weakly-labeled malware due to the inconsistent, incomplete and overly generic labels reported by these closed-source AV engines, whose capabilities vary greatly and whose internal mechanisms are opaque (i.e., intermediate detection results are unavailable for clustering). The raw labels are thus often used as the only important source of information for clustering. To address the limitations of the existing approaches, this paper presents Andre, a new ANDroid Hybrid REpresentation Learning approach to clustering weakly-labeled Android malware by preserving heterogeneous information from multiple sources (including the results of static code analysis, the meta-information of an app, and the raw-labels of the AV vendors) to jointly learn a hybrid representation for accurate clustering. The learned representation is then fed into our outlier-aware clustering to partition the weakly-labeled malware into known and unknown families. The malware whose malicious behaviours are close to those of the existing families on the network, are further classified using a three-layer Deep Neural Network (DNN). The unknown malware are clustered using a standard density-based clustering algorithm. We have evaluated our approach using 5,416 ground-truth malware from Drebin and 9,000 malware from VirusShare (uploaded between Mar. 2017 and Feb. 2018), consisting of 3324 weakly-labeled malware. The evaluation shows that Andre effectively clusters weakly-labeled malware which cannot be clustered by the state-of-the-art approaches, while achieving comparable accuracy with those approaches for clustering ground-truth samples.

Original languageEnglish
Pages (from-to)3401-3414
Number of pages14
JournalIEEE Transactions on Information Forensics and Security
Publication statusPublished - 16 Oct 2020


  • Android malware
  • malware clustering
  • machine learning
  • neural network

Cite this