Bibliography | Ju, Huang Hua: Detecting inauthentic accounts on Twitter: A natural language approach applied on Blacklist. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 24 (2021). 89 pages, english.
|
Abstract | Effective detection of malicious accounts and fraudulent events has become an emergent issue for online social networks (OSNs) given the great and growing popularity among young people. The potential damages from the malicious steering of public opinions on Facebook and Twitter are very high. In particular, the amplification of certain political messages and commercial biased views can steer people’s judgment and undermine social trust. Accurate detection of fake accounts, which could be bots or spams, can help in the prevention of confusion and deception of the general public.
Intuitively, researchers who study this issue tend to extract the existing features from users' profiles directly, such as the timeline, the number of friends, the location and the profile description. Then they quantify and characterize those spam accounts based on pre-defined patterns.
Unlike previous approaches that focus on user profiles, in this thesis, I consider blacklists and actual tweet content. By means of NLP, I use the n-gram model to build a system with the fake-word library. It targets the text technology to reach a higher and acceptable classification accuracy rate. I analyze a large dataset with over 20 thousand entries provided by MIB\cite{MIB1}, and use a series of NLP tools to extract the key text. I also study the characteristics of the malicious accounts and discover that the frequency of interactions has a different tendency against human accounts. In the final experiment, I set up the input data size by sampling roughly 1000 records at once and split them into training and testing data by the ratio of 70\% and 30\%. With N-gram models, the rate of correctly detecting fake accounts is more than 91\% accurate, while the original method without n-gram only reaches 89\%.
Keywords: Machine Learning, Natural Language Processing, Data Classification, Social Network Analysis, Internet Fraud, Feature extraction
|