You may redistribute, republish, and mirror the LIMFADD dataset in any form. However, any use or redistribution of the data must include a citation to the LIMFADD paper in the following link: https://www.techrxiv.org/users/662760/articles/1290068-limfadd-llm-enabled-instagram-multi-class-fake-account-detection-dataset
Please click here to download the dataset
Social networks like Instagram help connect billions of people. Unfortunately, this has resulted in cyber-criminals using social media platforms like Instagram to create fake accounts to target other users, like children and older adults. Detecting these fake accounts is a non-trivial process, and state-of-the-art binary datasets oversimplify the identification of fake accounts. To address this, we propose the LLM-enabled Instagram Multi-Class Fake Account Detection Dataset (LIMFADD) that allows for multi-class classification for fake accounts on Instagram: real users, spam bots, scam accounts, and bot accounts. This allows us to gain additional insights into fake account behavior types on Instagram. Through this work, we aim to continue developing intelligent artificial intelligence solutions that can keep social media platforms like Instagram safer.
Social networks have completely changed the way people talk and share media. Every day, billions of people use these platforms to connect. But despite popularity, many serious security problems have also been introduced in social media platforms like Instagram. Today, cyber-criminals use social media platforms like Instagram to create fake accounts to cheat other users, send spam messages, and even steal identities. To combat these problems, artificial intelligence (AI) is being used to make platforms more secure. AI can analyze user behavior and quickly detect abnormal behavior patterns indicative of fake accounts.
However, a common problem that has been identified with many of the state-of-the-art fake account detection datasets is oversimplication in the detection scheme, as they provide only binary classification. This encompasses all malicious behavior patterns into a single label, therefore limiting the insights that can be gained from multiple malicious behavior patterns. Additionally, due to the evolving tactics employed by fake accounts on social networks, it is difficult for static data sets to keep up with these behavioral changes. This becomes especially difficult if AI models only have a binary view of the behavior.
To address these limitations, we propose the Large Language Model-enabled Instagram Multi-Class Fake Account Detection Dataset (LIMFADD). This dataset is unique from the state-of-the-art datasets in existence as it does not simply label accounts as fake or real; it classifies fake accounts into multiple categories: spam bots, fake followers, and malicious promoters. This allows us to gain unique insights into individual types of malicious activity and their detection on Instagram. Also, the dataset harnesses the ability of large language models (LLMs) to generate dataset features that are similar in magnitude to the intended feature spaces of normal accounts, spam bots, fake followers, and malicious promoters. This provides us with a wider dataset to train AI models.
To ensure a credible baseline, we gathered data from real Instagram accounts. These accounts were carefully chosen by our team and mainly contained the main profiles of known contacts such as friends, peers, co-workers, and family members. This data showed a true representation of how a user behaves by possessing natural follower-follower ratios, authentic posts, and a corresponding level of activity. These accounts were the baseline for comparing and segregating the other fake behaviors.
Spam accounts were identified through repetitive text observation in comments that were sent to many users by automated bots. These profiles regularly posted messages that appeared to be the same or highly similar to those found on different users’ posts, promoting either their external links or their services. Such comments were compiled from the manual scanning of comment sections and direct message requests. These comments displayed the fatal hallmarks of spamming behavior, such as high-frequency messaging and abnormal engagement spikes.
Scammer accounts usually tend to trick users with phishing links or fraudulent promotions. Frequently, these accounts would have suspicious URLs in their bios or comment sections, with the coaching methods that these accounts would use to lure users. We manually scoured the comment threads and searched for profiles that followed such dubious tactics as fake giveaways or impersonation strategies.
Bot accounts are separated from regular accounts by the behavior they exhibit: liking, commenting, and following at a very high frequency. Some bots behave unnaturally, such as commenting on generic phrases on many posts or having extremely high numbers of ’following’ compared to followers. Most of these were found through our inbox or activity feeds, tagged manually. We classified the patterns that signaled automation in our dataset.
Base Dataset
After gathering initial examples from all four categories: real, spam, scam, and bot, we constructed a base dataset by equally selecting accounts from each class. This balanced selection ensured that no single class dominated the model's training process, thereby limiting the impact of any class imbalance on future AI models.
Data Augmentation using LLM
This initial base dataset was not substantial due to the manual annotation. However, it provided us with a stable baseline to further generate synthetic data samples whose magnitudes matched the feature spaces that were already established in our base dataset. To increase the size of the dataset using the established feature spaces, we utilized Chat-GPT to assist with further data generation. Using this LLM, we increased the size of all four datasets. By leveraging LLMs like Chat-GPT, we were able to expand our datasets from a relatively small size to a significantly more developed version. This allows for additional insights into a multi-class classification for fake Instagram accounts.
The following account types are provided as labels in the dataset:
Real: Account represents a true representation of how a user behaves by possessing natural follower-follower ratios, authentic posts, and a corresponding level of activity.
Spam: Account behaves like a spam account and includes behaviors such as repetitive text observation in comments that were sent to many users, posting messages that appeared to be the same or highly similar to those found on different users’ posts, and promoting either their external links or their services.
Scam: Account behaves like a scam account and includes behavior such as tricking users with phishing links or fraudulent promotions, and suspicious URLs in their bios or comment sections.
Bot: Account behaves like a bot account and includes behavior such as liking, commenting, and following at a very high frequency, and having extremely high numbers of ’following’ compared to followers.
These account types were selected for this dataset as they are common fake accounts that can be observed in social media platforms like Instagram.