The project is focused on classifying fake news articles using machine learning techniques such as Multilayer Perceptron
and k-Nearest Neighbors
algorithm. It involves preprocessing the data, feature extraction and building predictive models.
The data preprocessing pipeline includes the following steps:
- Text Cleaning: Removing unwanted characters, special symbols, and stopwords to retain meaningful content.
- Normalization: Converting all text to lowercase and applying stemming or lemmatization for consistency.
- Vectorization: Transforming text data into numerical features using techniques like
Term Frequency-Inverse Document Frequency
.
To enhance the model's understanding of the text data, the TF-IDF
method is used, representing text in a high-dimensional vector space. The formula for TF-IDF
is:
where:
- tf(t, d): Term frequency of term
t
in documentd
, - idf(t, D): Inverse document frequency of term
t
, withN
being the total number of documents andd ∈ D : t ∈ d
the number of documents where the termt
appears.
To reduce the complexity of high-dimensional data and improve model efficiency, Principal Component Analysis
is used. PCA
reduces the feature space while retaining the most important information by projecting data onto a set of orthogonal axes (principal components). This transformation helps:
- Minimize redundancy in the dataset.
- Speed up training and prediction times.
- Mitigate the risk of overfitting.
The number of principal components is chosen based on the explained variance ratio, ensuring that the reduced feature set captures a significant proportion of the original data's variance.
The project uses the following machine learning models to classify fake news:
- k-Nearest Neighbors: A simple algorithm that classifies data points based on the majority class of their nearest neighbors. The distance between points is calculated using metrics like Euclidean distance, and the optimal value of
k
is determined through cross-validation. - Multilayer Perceptron: A feedforward neural network that uses backpropagation to adjust weights and biases for optimal performance. The
MLP
model is designed to learn complex patterns in the data through its hidden layers and non-linear activation functions.
The models are evaluated using metrics such as:
- Accuracy
- Precision
- Recall
- F1-score
The project provides:
- A summary of key metrics for each model,
- Database Analysis,
- Visual Division of Real and Fake News,
- Visualisation of Most Frequent words in Titles and Text,
- Graph of division of extracted by
PCA
andTF-IDF
features by labels, - Visualisation of Most Frequent words in Fake News and Real Articles,
- Graph of Co-Occurence frequencies of words in Titles,
- Graph of dependencies of
k
and accuracy inkNN
, - Plot of the Model's Accuracy and Loss