A project in machine learning and digital forensics for the courses DV2578 (Machine Learning) and DV2579 (Advanced Course in Digital Forensics).
In digital forensics data carving is the act of extracting files directly from some memory media - without any metadata or known filesystem. Conventional techniques use simple heuristics such as magic numbers, headers etc. These techniques do not scale well due to a limited number of supported file types, slow processing speeds and insufficient accuracy.
Recently, machine learning has been applied to the subject, achieving state-of-the-art results both in terms of scale, accuracy and speed. These techniques utilize an efficient feature extraction from files that can be turned into a small image or other representation of the features. The images are then fed to convolutional neural networks to learn to identify parts of files.
These techniques focus on generality to identify files such as documents (.txt, .docx, .ppt, .pdf) and images (.jpg, .png). There is a gap in research when it comes to effectively identify compressed files and what algorithm was used. Compression algorithms seek to make data as dense as possible, which will in turn likely yield a higher entropy than a typical file. This in theory could make detection much harder.
This project aims to fill this gap, answering the following questions:
- How do compressed files compare to non-compressed files in terms of entropy?
- How can a machine-learning system be designed and trained to detect compression algorithms?
TL;DR CompDec is a novel approach to automatically detect the compression algorithm used for file fragments using machine learning.
Predicted labels for some randomly chosen samples. Format: prediction (confidence) (label).
Quickstart
Dataset
Development
Development - Quickstart
Development - Quickstart - Setup
Development - Quickstart - Data Preparation
Development - Quickstart - Training and Evaluation
Development - Tools
Note: These instructions are only for inference using the pre-trained model.
First download the latest release from releases. The release contains three files; a pre-trained model, a python script and a Dockerfile.
If you wish not to install all the prerequisites mentioned under Development - Quickstart, build the Docker image instead like so:
cd compdec
docker build -t compdec .
Now you may use the tool natively or via Docker:
# Docker
docker run -it -v "$/path/to/samples:/samples" compdec /samples/unknown-file1.bin /samples/unknown-file2.bin
# Native
python3 ./compdec.py /path/to/samples/unknown-file1.bin /path/to/samples/unknown-file2.bin
The tool will produce output like so:
/path/to/samples/unknown-file1.bin
7z : 0.00%
brotli : 0.00%
bzip2 : 0.00%
compress : 0.00%
gzip : 0.00%
lz4 : 100.00%
rar : 0.00%
zip : 0.00%
/path/to/samples/unknown-file2.bin
7z : 0.00%
brotli : 0.00%
bzip2 : 0.00%
compress : 100.00%
gzip : 0.00%
lz4 : 0.00%
rar : 0.00%
zip : 0.00%
In the samples directory are file chunks, visualizations and NIST Statistical tests performed on the dataset.
Below is an example visualization and NIST test for the 7-zip tool.
...
SUMMARY
-------
monobit_test 0.23712867340389365 PASS
frequency_within_block_test 0.28036273314388394 PASS
runs_test 0.11846733945572493 PASS
longest_run_ones_in_a_block_test 0.5251306363531703 PASS
binary_matrix_rank_test 0.0 FAIL
dft_test 0.753290157881333 PASS
non_overlapping_template_matching_test 0.9999999736364428 PASS
overlapping_template_matching_test 0.0 FAIL
maurers_universal_test 0.0 FAIL
linear_complexity_test 0.0 FAIL
serial_test 0.1862667243373838 PASS
approximate_entropy_test 0.18385318163162168 PASS
cumulative_sums_test 0.17770673343194865 PASS
random_excursion_test 0.24443855795386374 PASS
random_excursion_variant_test 0.013229883923921373 PASS
There are two pseudo-random samples, random
and urandom
taken from /dev/random
and /dev/urandom
respectively. There is also a true random sample, true-random
taken from random.org. These random samples have one NIST test report each, available in the .txt
file with the same name. Each "random" and random sample consists of 4096 bytes.
Prerequisites:
- Ubuntu 20.04 for training and evaluation
- macOS 11 for development and CPU inference
- CuDNN 8.0.4
- Tensorflow 2.4
- CUDA 11.1
- Python 3.8
- matplotlib
- seaborn
- numpy
- pyyaml
- h5py
- PIL
- Docker 19
To start, first clone this repository.
git clone --recurse-submodules https://github.com/AlexGustafsson/compdec.git && cd compdec
To train the model, you'll need some training data. The paper uses the GovDocs dataset, but any larger dataset with a wide variety of files should work fine. For ease of use, a tool is included to download the data. The commands below download a small subset of the dataset, suitable for testing and developing. This procedure can be repeated for any number of available threads.
mkdir -p data
./tools/govdocs.sh download data threads/thread0.zip
unzip -d data/govdocs data/threads/thread0.zip
Given the base data, we can now compress it using the available tools. These tools require Docker and the Docker images available as part of this project. Build and tag them using ./tools/build.sh
.
./tools/create-dataset.sh ./data/govdocs ./data/dataset
Now we'll need an index of the dataset, what files there are and how large they are. This is easily created using the following command. In this case we're picking chunks of maximum 4096 bytes, a common chunk size of commonly used file systems.
python3 ./tools/create_index.py 4096 ./data/dataset > ./data/index.csv
As part of our analysis we want to study the entropy of compressed files. This can be done by first creating a stratified sample.
With the index created, one can perform stratified sampling to extract a sample from the population with the following command. In this case we're picking a strata of 20 samples and we're using the seed seed
.
python3 ./tools/stratified_sampling.py seed ./data/index.csv 20 > ./data/strata.csv
Using the stratified sample, we can run the NIST statistical test suite on them using the following command:
python3 ./tools/nist_test.py ./data/strata.csv > ./data/tests.txt
We can now create two stratas, one for training and one for evaluation. This can be done using the same tool as previously. Note that we're now using even sampling to ensure the same number of samples for each algorithm. This is to ensure that algorithms that perform bad (yield more chunks) are not over-represented.
python3 ./tools/even_sampling.py seed ./data/index.csv 80 > ./data/training-strata.csv
python3 ./tools/even_sampling.py seed ./data/index.csv 20 ./data/evaluation-strata.csv > ./data/test-strata.csv
Make sure that you apply an appropriate split of the data. Although a small number was used in this example, you may use the full sample size of the dataset.
Given the dataset, we can now train a model like so:
python3 ./model/train.py --model-name my-model --training-strata ./data/training-strata.csv --evaluation-strata ./data/evaluation-strata.csv --save-model --enable-tensorboard --enable-gpu
The training will create a checkpoints file under ./data/checkpoints/my-model-name
. The trained model will be created in ./data/models/my-model-name.h5
. The model will overwrite any file by the same name that may exist.
To start TensorBoard run the following command:
# --bind_all optional. Makes the site available to the local network
tensorboard --logdir ./data/tensorboard --bind_all
With the model trained we can predict the algorithm of a file or chunk using the following script:
python3 ./model/predict.py --model ./data/models/my-model.h5 --sample ./data/dataset/000233/compressed.brotli
We'll get an output like so;
7z : 0.34%
brotli : 95.39%
bzip2 : 0.20%
compress : 0.06%
gzip : 3.07%
lz4 : 0.57%
rar : 0.27%
zip : 0.09%
The prediction utility requires at least as many bytes as the model was trained with. By default this is 4096 bytes, but it can be changed.
To evaluate the performance of the model, one can render a confusion matrix like so:
python3 ./model/plot.py --type confusion-matrix --model ./data/models/my-model.h5 --strata ./data/evaluation-strata.csv
An example plot, trained on 2M samples for 5 epochs looks like this:
The network architecture based on the work of Q. Chen et al.
For instructions on how to train and evaluate the model, refer to the quickstart.
The model is defined as a Keras model in model/utilities/model_utilities.py
:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation="relu", padding="same", input_shape=(dataset_utilities.IMAGE_SIZE, dataset_utilities.IMAGE_SIZE, 1)))
model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation="relu", padding="same"))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation="relu", padding="same"))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(tf.keras.layers.Conv2D(filters=126, kernel_size=(3, 3), activation="relu", padding="same"))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(tf.keras.layers.Conv2D(filters=256, kernel_size=(3, 3), activation="relu", padding="same"))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(2048, activation="relu"))
model.add(tf.keras.layers.Dense(2048, activation="relu"))
model.add(tf.keras.layers.Dense(len(dataset_utilities.CLASS_NAMES), activation="softmax"))
Chunking tool for splitting a file into chunks.
Usage:
./tools/chunk.sh <chunk size> <input file> <output directory>
Example:
# Extract 4096B chunks from this file to the output directory
./tools/chunk.sh 4096 ./tools/chunk.sh ./output
Example output:
file,"chunk size",size
"./tools/chunk.sh",4096,999
Create a index for the dataset.
Usage:
python3 tools/create_index.py <chunk size> <input directory>
Example:
python3 tools/create_index.py 4096 ./data/dataset
Example output:
"file path","file size","chunk size","chunks",extension
"/path/to/compdec/data/dataset/thread0.zip",322469174,4096,78728,"application/zip"
"/path/to/compdec/data/dataset/909/909820.pdf",291569,4096,72,"application/pdf"
"/path/to/compdec/data/dataset/135/135778.pdf",14013,4096,4,"application/pdf"
"/path/to/compdec/data/dataset/135/135495.html",18127,4096,5,"text/html"
...
This is a tool to simplify communication with GovDocs: https://digitalcorpora.org/corpora/files.
Usage:
./tools/govdocs.sh download <target-directory> <file 1> [file 2] [file 3] ...
Example:
# Download a single thread (about 300MB)
./tools/govdocs.sh download data threads/thread0.zip
Example output:
[Download started] http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads/thread0.zip -> data/threads/thread0.zip
[Download complete] http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads/thread0.zip -> data/threads/thread0.zip
This is a tool to perform a stratified sampling of a dataset.
Usage:
python3 ./tools/stratified_sampling.py <seed> <index path> <strata size>
Example:
python3 tools/stratified_sampling.py 1.3035772690 index.csv 20
Example output:
extension,samples,frequency
"zip",78728,0.35
"pdf",37438,0.17
"html",3590,0.016
"txt",45112,0.2
"jpeg",9875,0.044
"docx",6659,0.03
"xml",598,0.0027
"ppt",29038,0.13
"gif",580,0.0026
"csv",679,0.003
"xls",6953,0.031
"ps",2535,0.011
"png",604,0.0027
"flash",362,0.0016
Total samples: 224026
Strata size: 20
"file path",offset,"chunk size",extension
"/path/to/compdec/data/dataset/thread0.zip",108646400,4096,"zip"
"/path/to/compdec/data/dataset/191/191969.txt",125845504,4096,"txt"
"/path/to/compdec/data/dataset/354/354930.doc",307200,4096,"docx"
"/path/to/compdec/data/dataset/thread0.zip",34136064,4096,"zip"
...
This is a tool to perform an even sampling of a dataset.
Usage:
python3 ./tools/even_sampling.py <seed> <index path> <strata size>
Example:
python3 tools/even_sampling.py 1.3035772690 index.csv 20
Example output:
extension,samples,frequency
"zip",78728,0.35
"pdf",37438,0.17
"html",3590,0.016
"txt",45112,0.2
"jpeg",9875,0.044
"docx",6659,0.03
"xml",598,0.0027
"ppt",29038,0.13
"gif",580,0.0026
"csv",679,0.003
"xls",6953,0.031
"ps",2535,0.011
"png",604,0.0027
"flash",362,0.0016
Total samples: 224026
Strata size: 20
"file path",offset,"chunk size",extension
"/path/to/compdec/data/dataset/thread0.zip",108646400,4096,"zip"
"/path/to/compdec/data/dataset/191/191969.txt",125845504,4096,"txt"
"/path/to/compdec/data/dataset/354/354930.doc",307200,4096,"docx"
"/path/to/compdec/data/dataset/thread0.zip",34136064,4096,"zip"
...
This is a tool to simplify interfacing with various compression algorithms. Due to its dependencies, it's preferably used via Docker. To build it run: ./tools/build.sh
.
Instead of ./tools/compress.sh
, you may use docker run -it --rm compdec:compress
.
Usage:
# Show versions of used tools
./tools/compress.sh versions
# Show this help dialog
./tools/compress.sh help
# Compress a file with all algorithms
./tools/compress.sh compress <output prefix> <input file>
Example:
./tools/compress.sh compress output/compressed-file input/test-file
This is a tool to simplify creating the dataset (compressing GovDocs).
Usage:
./tools/create-dataset.sh <base-dir> <target-dir>
Examples:
./tools/create-dataset.sh ./data/govdocs ./data/dataset
# Only compress part of the dataset
MAXIMUM_FILES=10 ./tools/create-dataset.sh ./data/govdocs ./data/dataset
This is a tool to perform the NIST statistical test suite on samples.
Usage:
python3 ./tools/nist_test.py ./data/strata.csv