Scarcity-aware spam detection technique for big data ecosystem

Woo Hyun Park, Isma Farah Siddiqui, Chinmay Chakraborty, Nawab Muhammad Faseeh Qureshi, Dong Ryeol Shin

Research output: Contribution to journalArticleResearchpeer-review

15 Citations (Scopus)

Abstract

To expand their business, companies in the industry use the big data ecosystem for handling enormous amounts of information. For this purpose, text data must be analyzed while ensuring data security and organizing authenticated and valuable data using spam filters. Several methods are available such as Word2Vec, bag-of-words, BERT, and term frequency-inverse document frequency (TF-IDF). However, none of these resolve the data scarcity issue that may result in the presence of incomplete information in collected documents. A technique that groups each document by subject and applies approximation using statistical methods is required to effectively solve this problem. This study proposes a natural language processing-based technique for spam detection that alters topics using a least-squares model and uses gradient-descent and altering-least-squares (AMALS) models to estimate missing data through TF-IDF and uniform-distribution. A performance evaluation demonstrates that the proposed technique outperforms 98% than the existing industrial TF-IDF model in predicting spam in big data ecosystems.

Original languageEnglish
Pages (from-to)67-75
Number of pages9
JournalPattern Recognition Letters
Volume157
DOIs
Publication statusPublished - May 2022
Externally publishedYes

Cite this