Welcome to the supporting page for the manuscript titled, "Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study." This repository contains all the necessary files, scripts, and data analyses that complement the findings and discussions presented in the paper.
This repository is organized to facilitate easy access to the datasets, scripts, and analytical results used in our study. Below is a guide to the repository's structure:
- 1-CICIDS-2017: This folder contains files and scripts related to the
CICIDS-2017
dataset analysis. - 2-WTMC-2021: This folder contains materials associated with the
WTMC-2021
refinement of theCICIDS-2017
dataset. - 3-CRiSIS-2022: This folder includes files and scripts for the
CRiSIS-2022
refinement of theCICIDS-2017
dataset. - 4-NFS-2023: Contains materials for our refined versions of the
CICIDS-2017
dataset, namelyNFS-2023-nTE
andNFS-2023-TE
. - visual-comparison: This folder hosts Jupyter Notebooks and plots for a visual comparison of results, focusing on RF performance metrics (precision, recall, accuracy, F1 score, and AUC), confusion matrices, and feature importances.
- For insights into the feature importances in binary classification using RF, visit Binary Feature Importances. Additionally, the corresponding confusion matrices are available at Binary Confusion Matrices.
- For an overview of feature importances in RF multi-class classification, refer to Multi-Class Feature Importances. The confusion matrices for this classification can be explored at Multi-Class Confusion Matrices.
Each dataset folder follows a specific naming convention for Jupyter Notebooks:
*-data-analysis*
notebook provide a comprehensive analysis of the dataset, focusing on flow counts, label distributions, occurrences of negative and NaN values, and TCP FIN and RST flag counts.*-without_feat_sel*
notebooks offer supporting material and analyses related to the manuscript.*-with_feat_sel*
notebooks present extended analyses, comparing the performance of DT, RF, and NB algorithms with top 15 features selected by the ExtraTrees algorithm.
Access our refined versions of the CICIDS-2017 dataset, generated using NFStream:
- NFS-2023-nTE: This dataset version does not implement TCP flag-based flow expiration, aligning with the flow generation process in existing dataset versions.
- The code used for generating this dataset is available at No TCP Expiry.
- NFS-2023-TE: This version enables TCP flag-based flow expiration, offering a dataset that closely mirrors real-world network traffic characteristics.
- The code used for generating this dataset is available at TCP Expiry.
- The flow labelling mechanism adapted from CRiSIS-2022 is available at Labeller.
To gain a comprehensive understanding of the methodologies and insights underlying this project, we encourage you to refer to our detailed research paper associated with this repository. The paper delves into the nuances of the data preparation process, the analytical methods employed, and the broader implications of our findings in the field of network anomaly detection. It serves as an essential resource for those looking to explore the depths of machine learning applications in cybersecurity.
For further insights and an in-depth exploration of our methodologies, the research paper provides a rich source of information and context, enhancing the practical and theoretical understanding of the work presented in this repository.