Create and activate a conda environment named, for example, group23
with the dependencies specified in the file environment.yml
:
conda env create -n group23 --file environment.yml
conda activate group23
-
Create a file named
.env
and set the following environment variables:-
USER_NAME
: your Twitter usernames (Ex.'username1, username2, username3'
) -
PASSWORD
: your Twitter account passwords (Ex.'pass1, pass2, pass3'
) -
EMAIL
: the email addresses associated with the Twitter accounts (Ex.'email1@gmail.com, email2@gmail.com, email3@gmail.com'
) -
EMAIL_PASWORD
: the passwords of your email accounts
-
-
Configure where to save the data and log in the file
config.py
. -
Run the script
twitter_scraper/crawler.py
There are two datasets to preprocess:
-
the data we crawl using the process above, which has no bot/human label
-
data from BotRepository with a human/bot label for each Twitter account, which we're going to use for training and testing our bot detection model
To run a data preprocessing job:
-
Configure input and output locations in the file
config.py
-
Run the script
data_preprocessing/preprocess_our_data.py
ordata_preprocessing/preprocess_bot_repository_data.py
.
Following the paper "Scalable and Generalizable Social Bot Detection through Data Selection (Yang et al., 2020)", we implement a random forest using 19 account metadata features to predict whether the account is a human or bot account. A trained model is available at bot_detection_model/model_storage
.
To use the trained bot detection model on crawled tweets:
-
Configure the preprocessed data location and model output ((id, prediction) rows in parquet format) location in the file
config.py
. -
Run the script
bot_detection_model/detect_bot.py