Newspaper | The Sun | Daily Mail | The Guardian | The Times |
---|---|---|---|---|
type | tabloid | tabloid | broadsheet | broadsheet |
timespan | Sep 22 - Feb 23 | Sep 22 - Feb 23 | Sep 22 - Feb 23 | Sep 22 - Feb 23 |
articles collected | 1270 | 2622 | 2199 | 805 |
mean article length | 433 tokens | 678 tokens | 1304 tokens | 927 tokens |
vocabulary | 17499 lemmas | 24781 lemmas | 39283 lemmas | 22586 lemmas |
complete articles | yes | yes | yes | yes |
tokens complete | 549898 | 1777469 | 2608640 | 746153 |
without stopwords | 307902 | 998000 | 1473926 | 412010 |
without symbols | 271541 | 897411 | 1273947 | 362093 |
lemmas final | 225749 | 792397 | 1138170 | 301143 |
Crawlers are written to scrape articles from September 2022 including February 2023.
Selenium to preload pages + Scrapy to crawl and scrape
Selenium to preload and login; Scrapy and Selenium to crawl and scrape
start crawler in terminal with
scrapy runspider [crawler.py] -o [articles .json or .csv]
Guardian API + BeautifulSoup4 to clean body html
run script in .py file
most functions can be found in preprocessing.py and methods.py
Preprocesses text data from JSON files for four different newspapers (The Times, The Sun, Daily Mail and The Guardian), including tokenisation, removal of stopwords, punctuation, rare tokens and player names, part-of-speech tagging, and lemmatisation. Depending on csv It returns a Pandas DataFrame containing the preprocessed data or writes a csv file; rare toggles whether rare tokens (= tokens appearing less than ten times) are kept in or not.
dataframe = preprocess("sun",csv=False,rare=True)
preprocess("guardian",csv=True)
preprocess("times",csv=True,rare=True)
Converts a pandas DataFrame of preprocessed text data (obtained using preprocessing() ) into a Document-Term Matrix (DTM) using a CountVectorizer.
dtm_dataframe = df_to_dtm(dataframe)
Converts a pandas DataFrame of preprocessed text data (obtained using preprocessing() ) into a Term Frequency-Inverse Document Frequency (TF-IDF) matrix using a CountVectorizer and a TfidfTransformer.
df_tfidf = df_to_tfidf(dataframe)
Converts a CSV file of preprocessed text data (obtained using preprocessing() ) into a Term Frequency-Inverse Document Frequency (TF-IDF) matrix using a CountVectorizer and a TfidfTransformer.
df_tfidf_guardian = csv_to_tfidf("guardian.csv")
Gets the column index of a given term in a TF-IDF matrix represented by a pandas DataFrame.
get_term_position(df_tfidf_guardian, "climate")
Compares the positions of a given term in the TF-IDF matrices of four different CSV files and write the results to a new CSV file.
compare_term_position("climate")
Reads the CSV files for The Guardian, Daily Mail, The Times, and The Sun and returns them as dataframes.
Plots the development of the normalised TF-IDF score for a given term across four UK newspapers: The Times, Daily Mail, The Sun, and The Guardian, for the period between September 2022 and February 2023.
plot_tfidf("lgbt", save=True)
Reads a CSV file containing lemmas and returns a set of unique lemmas (the vocabulary).
times_vocab = get_vocab_from_csv("times_rare.csv")
Calculates the average length of tokens in a vocabulary (input vocabulary must be a set).
get_avg_token_length(times_vocab)
Newspaper | The Sun | Daily Mail | The Guardian | The Times |
---|---|---|---|---|
types | 17499 | 24781 | 39283 | 22586 |
tokens | 271541 | 754718 | 1273947 | 362093 |
type-token ratio | 0,06444 | 0,03283 | 0,03083 | 0,06238 |
Newspaper | The Sun | Daily Mail | The Guardian | The Times |
---|---|---|---|---|
average token length | 7.11898 | 7.32034 | 7.44423 | 7.43956 |
Reads CSV and calculates the Topics for each newspaper.
topics.R
Show the terms that frequently occur together and the structure of the data.
cooccurrence.R
Calculates the data and analyzed it overtime.
timeseries.R