TY - JOUR
T1 - Scarcity-aware spam detection technique for big data ecosystem
AU - Park, Woo Hyun
AU - Siddiqui, Isma Farah
AU - Chakraborty, Chinmay
AU - Qureshi, Nawab Muhammad Faseeh
AU - Shin, Dong Ryeol
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/5
Y1 - 2022/5
N2 - To expand their business, companies in the industry use the big data ecosystem for handling enormous amounts of information. For this purpose, text data must be analyzed while ensuring data security and organizing authenticated and valuable data using spam filters. Several methods are available such as Word2Vec, bag-of-words, BERT, and term frequency-inverse document frequency (TF-IDF). However, none of these resolve the data scarcity issue that may result in the presence of incomplete information in collected documents. A technique that groups each document by subject and applies approximation using statistical methods is required to effectively solve this problem. This study proposes a natural language processing-based technique for spam detection that alters topics using a least-squares model and uses gradient-descent and altering-least-squares (AMALS) models to estimate missing data through TF-IDF and uniform-distribution. A performance evaluation demonstrates that the proposed technique outperforms 98% than the existing industrial TF-IDF model in predicting spam in big data ecosystems.
AB - To expand their business, companies in the industry use the big data ecosystem for handling enormous amounts of information. For this purpose, text data must be analyzed while ensuring data security and organizing authenticated and valuable data using spam filters. Several methods are available such as Word2Vec, bag-of-words, BERT, and term frequency-inverse document frequency (TF-IDF). However, none of these resolve the data scarcity issue that may result in the presence of incomplete information in collected documents. A technique that groups each document by subject and applies approximation using statistical methods is required to effectively solve this problem. This study proposes a natural language processing-based technique for spam detection that alters topics using a least-squares model and uses gradient-descent and altering-least-squares (AMALS) models to estimate missing data through TF-IDF and uniform-distribution. A performance evaluation demonstrates that the proposed technique outperforms 98% than the existing industrial TF-IDF model in predicting spam in big data ecosystems.
UR - http://www.scopus.com/inward/record.url?scp=85127486898&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2022.03.021
DO - 10.1016/j.patrec.2022.03.021
M3 - Article
AN - SCOPUS:85127486898
SN - 0167-8655
VL - 157
SP - 67
EP - 75
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -