From e4a825976e56bde128629e2240dc41ddc919838e Mon Sep 17 00:00:00 2001
From: Fran <tomaselli.fr@gmail.com>
Date: Fri, 27 Oct 2023 19:13:38 +0200
Subject: [PATCH] updated readme

---
 README.md                                     | 125 +++++----
 out.html                                      | 264 ++++++++++++++++++
 out.md                                        | 264 ++++++++++++++++++
 .../tomfran/lsm/tree/LSMTreeAddBenchmark.java |   2 +-
 .../tomfran/lsm/tree/LSMTreeGetBenchmark.java |   3 +-
 .../com/tomfran/lsm/utils/BenchmarkUtils.java |   3 +-
 src/main/java/com/tomfran/lsm/Main.java       |  41 +--
 .../java/com/tomfran/lsm/tree/LSMTree.java    |   2 +-
 8 files changed, 624 insertions(+), 80 deletions(-)
 create mode 100644 out.html
 create mode 100644 out.md

diff --git a/README.md b/README.md
index 2b481b8..f56701d 100644
--- a/README.md
+++ b/README.md
@@ -12,31 +12,33 @@ An implementation of the Log-Structured Merge Tree (LSM tree) data structure in
     1. [SSTable](#sstable-1)
     2. [Skip-List](#skip-list-1)
     3. [Tree](#tree-1)
-5. [Implementation status](#Implementation-status)
+5. [Possible future improvements](#possible-improvements)
+6. [References](#references)
 
-## Console
+### Console
 
 To interact with a toy tree you can use `./gradlew run -q` to spawn a console.
 
 ![console.png](misc%2Fconsole.png)
 
----
-
 # Architecture
 
+Architecture overview, from SSTables, which are the disk-resident portion of the database, Skip Lists, used
+as memory buffers, and finally to the combination of the twos to create insertion, lookup and deletion primitives.
+
 ## SSTable
 
 Sorted String Table (SSTable) is a collection of files modelling key-value pairs in sorted order by key.
 It is used as a persistent storage for the LSM tree.
 
-### Components
+**Components**
 
 - _Data_: key-value pairs in sorted order by key, stored in a file;
 - _Sparse index_: sparse index containing key and offset of the corresponding key-value pair in the data;
 - _Bloom filter_: a [probabilistic data structure](https://en.wikipedia.org/wiki/Bloom_filter) used to test whether a
   key is in the SSTable.
 
-### Key lookup
+**Key lookup**
 
 The basic idea is to use the sparse index to find the key-value pair in the data file.
 The steps are:
@@ -49,7 +51,7 @@ The steps are:
 The search is as lazy as possible, meaning that we read the minimum amount of data from disk,
 for instance, if the next key length is smaller than the one we are looking for, we can skip the whole key-value pair.
 
-### Persistence
+**Persistence**
 
 A table is persisted to disk when it is created. A base filename is defined, and three files are present:
 
@@ -57,12 +59,12 @@ A table is persisted to disk when it is created. A base filename is defined, and
 - `<base_filename>.index`: index file;
 - `<base_filename>.bloom`: bloom filter file.
 
-**Data format**
+Data format:
 
 - `n`: number of key-value pairs;
 - `<key_len_1, value_len_1, key_1, value_1, ... key_n, value_n>`: key-value pairs.
 
-**Index format**
+Index format:
 
 - `s`: number of entries in the whole table;
 - `n`: number of entries in the index;
@@ -71,7 +73,7 @@ A table is persisted to disk when it is created. A base filename is defined, and
 - `s_1, s_2, ..., s_n`: remaining keys after a sparse index entry, used to exit from search;
 - `<key_len_1, key_1, ... key_len_n, key_n>`: keys in the index.
 
-**Filter format**
+Filter format:
 
 - `m`: number of bits in the bloom filter;
 - `k`: number of hash functions;
@@ -82,8 +84,6 @@ To save space, all integers are stored
 in [variable-length encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html),
 and offsets in the index are stored as [deltas](https://en.wikipedia.org/wiki/Delta_encoding).
 
----
-
 ## Skip-List
 
 A [skip-list](https://en.wikipedia.org/wiki/Skip_list) is a probabilistic data structure that allows fast search,
@@ -92,7 +92,7 @@ insertion and deletion of elements in a sorted sequence.
 In the LSM tree, it is used as an in-memory data structure to store key-value pairs in sorted order by key.
 Once the skip-list reaches a certain size, it is flushed to disk as an SSTable.
 
-### Operations details
+**Operations details**
 
 The idea of a skip list is similar to a classic linked list. We have nodes with forward pointers, but also levels. We
 can think about a
@@ -108,40 +108,57 @@ we are looking for. Then we move down to the next level and repeat the process u
 Insertions, deletions, and updates are done by first locating the element, then performing
 the operation on the node. All of them have an average time complexity of `O(log(n))`.
 
----
-
 ## Tree
 
-...
+Having defined SSTables and Skip Lists we can obtain the final structure as a combination of the two.
+The main idea is to use the latter as an in-memory buffer, while the former efficiently stores flushed
+buffers.
+
+**Insertion**
 
-### Components
+Each insert goes directly to a Memtable, which is a Skip List under the hood, so the response time is quite fast.
+There exists a threshold, over which the mutable structure is made immutable by appending it to the _immmutable
+memtables LIFO list_ and replaced with a new mutable list.
 
-...
+The immutable memtable list is asynchronously consumed by a background thread, which takes the next available
+list and create a disk-resident SSTable with its content.
 
-### Insertion
+**Lookup**
 
-...
+While looking for a key, we proceed as follows:
 
-### Lookup
+1. Look into the in-memory buffer, if the key is recently written it is likely here, if not present continue;
+2. Look into the immutable memtables list, iterating from the most recent to the oldest, if not present continue;
+3. Look into disk tables, iterating from the most recent one to the oldest, if not present return null.
 
-...
+**Deletions**
 
-### Write-ahead logging
+To delete a key, we do not need to delete all its replicas, from the on-disk tables, we just need a special
+value called _tombstone_. Hence a deletion is the same as an insertion, but with a value set to null. While looking for
+a key, if we encounter a null value we simply return null as a result.
 
-...
+**SSTable Compaction**
 
----
+The most expensive operation while looking for a key is certainly the disk search, and this is why bloom filters are
+crucial for negative
+lookup on SSTables. But no bloom filter can save us if too many tables are available to search, hence we need
+_compaction_.
+
+When flushing a Memtable, we create an SSTable of level one. When the first level reaches a certain threshold,
+all its tables are merged into a level-two table, and so on. This permits us to save storage and query fewer
+tables in lookups.
+
+Note that this style of compaction is not standard, there are various sophisticated techniques, but for the sake of
+this project this simple level-like compaction works wonders.
 
 # Benchmarks
 
 I am using [JMH](https://openjdk.java.net/projects/code-tools/jmh/) to run benchmarks,
 the results are obtained on AMD Ryzen™ 5 4600H with 16GB of RAM and 512GB SSD.
 
-> Take those with a grain of salt, development is still in progress.
-
 To run them use `./gradlew jmh`.
 
-## SSTable
+**SSTable**
 
 - Negative access: the key is not present in the table, hence the Bloom filter will likely stop the search;
 - Random access: the key is present in the table, the order of the keys is random.
@@ -154,7 +171,7 @@ c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40
 
 ```
 
-## Bloom filter
+**Bloom filter**
 
 - Add: add keys to a 1M keys Bloom filter with 0.01 false positive rate;
 - Contains: test whether the keys are present in the Bloom filter.
@@ -166,7 +183,7 @@ c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377
 
 ```
 
-## Skip-List
+**Skip-List**
 
 - Get: get keys from a 100k keys skip-list;
 - Add/Remove: add and remove keys from a 100k keys skip-list.
@@ -179,7 +196,7 @@ c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201
 
 ```
 
-## Tree
+**Tree**
 
 - Get: get elements from a tree with 1M keys;
 - Add: add 1M distinct elements to a tree with a memtable size of 2^18
@@ -191,28 +208,24 @@ c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241
 
 ```
 
----
-
-## Implementation status
-
-- [x] SSTable
-    - [x] Init
-    - [x] Read
-    - [x] Compaction
-    - [x] Ints compression
-    - [x] Bloom filter
-    - [x] Indexes persistence
-    - [x] File initialization
-- [x] Skip-List
-    - [x] Operations
-    - [x] Iterator
-- [x] Tree
-    - [x] Operations
-    - [x] Background flush
-    - [x] Background compaction
-    - [ ] Write ahead log
-- [x] Benchmarks
-    - [x] SSTable
-    - [x] Bloom filter
-    - [x] Skip-List
-    - [x] Tree
+## Possible improvements
+
+There is certainly space for improvement on this project:
+
+1. Blocked bloom filters: its a variant of a classic array-like bloom filter which is more cache efficient;
+2. Search fingers in the Skip list: the idea is to keep a pointer to the last search, and start from there with
+   subsequent queries;
+3. Proper level compaction in the LSM tree;
+4. Write ahead log for the insertions, without this, a crash makes all the in-memory writes disappear;
+5. Proper recovery: handle crashes and reboots, using existing SSTables and the write-ahead log.
+
+I don't have the practical time to do all of this, perhaps the first two points will be handled in the future.
+
+## References
+
+- [Database Internals](https://www.databass.dev/) by Alex Petrov, specifically chapters about Log-Structured Storage and
+  File Formats;
+- [A Skip List Cookbook](https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content)
+  by William Pugh.
+
+_If you found this useful or interesting do not hesitate to ask clarifying questions or get in touch!_ 
diff --git a/out.html b/out.html
new file mode 100644
index 0000000..9c6e7c0
--- /dev/null
+++ b/out.html
@@ -0,0 +1,264 @@
+<p>An implementation of the Log-Structured Merge Tree (LSM tree) data
+structure in Java.</p>
+<p><strong>Table of Contents</strong></p>
+<ol type="1">
+<li><a href="#Architecture">Architecture</a>
+<ol type="1">
+<li><a href="#SSTable">SSTable</a></li>
+<li><a href="#Skip-List">Skip-List</a></li>
+<li><a href="#Tree">Tree</a></li>
+</ol></li>
+<li><a href="#Benchmarks">Benchmarks</a>
+<ol type="1">
+<li><a href="#sstable-1">SSTable</a></li>
+<li><a href="#skip-list-1">Skip-List</a></li>
+<li><a href="#tree-1">Tree</a></li>
+</ol></li>
+<li><a href="#possible-improvements">Possible future
+improvements</a></li>
+<li><a href="#references">References</a></li>
+</ol>
+<p>To interact with a toy tree you can use <code>./gradlew run -q</code>
+to spawn a console.</p>
+<figure>
+<img src="misc%2Fconsole.png" alt="console.png" />
+<figcaption aria-hidden="true">console.png</figcaption>
+</figure>
+<hr />
+<h1 data-number="1" id="architecture"><span
+class="header-section-number">1</span> Architecture</h1>
+<p>Architecture overview, from SSTables, which are the disk-resident
+portion of the database, Skip Lists, used as memory buffers, and finally
+to the combination of the twos to create insertion, lookup and deletion
+primitives.</p>
+<h2 data-number="1.1" id="sstable"><span
+class="header-section-number">1.1</span> SSTable</h2>
+<p>Sorted String Table (SSTable) is a collection of files modelling
+key-value pairs in sorted order by key. It is used as a persistent
+storage for the LSM tree.</p>
+<h3 data-number="1.1.1" id="components"><span
+class="header-section-number">1.1.1</span> Components</h3>
+<ul>
+<li><em>Data</em>: key-value pairs in sorted order by key, stored in a
+file;</li>
+<li><em>Sparse index</em>: sparse index containing key and offset of the
+corresponding key-value pair in the data;</li>
+<li><em>Bloom filter</em>: a <a
+href="https://en.wikipedia.org/wiki/Bloom_filter">probabilistic data
+structure</a> used to test whether a key is in the SSTable.</li>
+</ul>
+<h3 data-number="1.1.2" id="key-lookup"><span
+class="header-section-number">1.1.2</span> Key lookup</h3>
+<p>The basic idea is to use the sparse index to find the key-value pair
+in the data file. The steps are:</p>
+<ol type="1">
+<li>Use the Bloom filter to test whether the key might be in the
+table;</li>
+<li>If the key might be present, use binary search on the index to find
+the maximum lower bound of the key;</li>
+<li>Scan the data from the position found in the previous step to find
+the key-value pair. The search can stop when we are seeing a key greater
+than the one we are looking for, or when we reach the end of the
+table.</li>
+</ol>
+<p>The search is as lazy as possible, meaning that we read the minimum
+amount of data from disk, for instance, if the next key length is
+smaller than the one we are looking for, we can skip the whole key-value
+pair.</p>
+<h3 data-number="1.1.3" id="persistence"><span
+class="header-section-number">1.1.3</span> Persistence</h3>
+<p>A table is persisted to disk when it is created. A base filename is
+defined, and three files are present:</p>
+<ul>
+<li><code>&lt;base_filename&gt;.data</code>: data file;</li>
+<li><code>&lt;base_filename&gt;.index</code>: index file;</li>
+<li><code>&lt;base_filename&gt;.bloom</code>: bloom filter file.</li>
+</ul>
+<p><strong>Data format</strong></p>
+<ul>
+<li><code>n</code>: number of key-value pairs;</li>
+<li><code>&lt;key_len_1, value_len_1, key_1, value_1, ... key_n, value_n&gt;</code>:
+key-value pairs.</li>
+</ul>
+<p><strong>Index format</strong></p>
+<ul>
+<li><code>s</code>: number of entries in the whole table;</li>
+<li><code>n</code>: number of entries in the index;</li>
+<li><code>o_1, o_2 - o_1, ..., o_n - o_n-1</code>: offsets of the
+key-value pairs in the data file, skipping the first one;</li>
+<li><code>s_1, s_2, ..., s_n</code>: remaining keys after a sparse index
+entry, used to exit from search;</li>
+<li><code>&lt;key_len_1, key_1, ... key_len_n, key_n&gt;</code>: keys in
+the index.</li>
+</ul>
+<p><strong>Filter format</strong></p>
+<ul>
+<li><code>m</code>: number of bits in the bloom filter;</li>
+<li><code>k</code>: number of hash functions;</li>
+<li><code>n</code>: size of underlying long array;</li>
+<li><code>b_1, b_2, ..., b_n</code>: bits of the bloom filter.</li>
+</ul>
+<p>To save space, all integers are stored in <a
+href="https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html">variable-length
+encoding</a>, and offsets in the index are stored as <a
+href="https://en.wikipedia.org/wiki/Delta_encoding">deltas</a>.</p>
+<hr />
+<h2 data-number="1.2" id="skip-list"><span
+class="header-section-number">1.2</span> Skip-List</h2>
+<p>A <a href="https://en.wikipedia.org/wiki/Skip_list">skip-list</a> is
+a probabilistic data structure that allows fast search, insertion and
+deletion of elements in a sorted sequence.</p>
+<p>In the LSM tree, it is used as an in-memory data structure to store
+key-value pairs in sorted order by key. Once the skip-list reaches a
+certain size, it is flushed to disk as an SSTable.</p>
+<h3 data-number="1.2.1" id="operations-details"><span
+class="header-section-number">1.2.1</span> Operations details</h3>
+<p>The idea of a skip list is similar to a classic linked list. We have
+nodes with forward pointers, but also levels. We can think about a level
+as a fast lane between nodes. By carefully constructing them at
+insertion time, searches are faster, as they can use higher levels to
+skip unwanted nodes.</p>
+<p>Given <code>n</code> elements, a skip list has <code>log(n)</code>
+levels, the first level containing all the elements. By increasing the
+level, the number of elements is cut roughly by half.</p>
+<p>To locate an element, we start from the top level and move forward
+until we find an element greater than the one we are looking for. Then
+we move down to the next level and repeat the process until we find the
+element.</p>
+<p>Insertions, deletions, and updates are done by first locating the
+element, then performing the operation on the node. All of them have an
+average time complexity of <code>O(log(n))</code>.</p>
+<hr />
+<h2 data-number="1.3" id="tree"><span
+class="header-section-number">1.3</span> Tree</h2>
+<p>Having defined SSTables and Skip Lists we can obtain the final
+structure as a combination of the two. The main idea is to use the
+latter as an in-memory buffer, while the former efficiently stores
+flushed buffers.</p>
+<h3 data-number="1.3.1" id="insertion"><span
+class="header-section-number">1.3.1</span> Insertion</h3>
+<p>Each insert goes directly to a Memtable, which is a Skip List under
+the hood, so the response time is quite fast. There exists a threshold,
+over which the mutable structure is made immutable by appending it to
+the <em>immmutable memtables LIFO list</em> and replaced with a new
+mutable list.</p>
+<p>The immutable memtable list is asynchronously consumed by a
+background thread, which takes the next available list and create a
+disk-resident SSTable with its content.</p>
+<h3 data-number="1.3.2" id="lookup"><span
+class="header-section-number">1.3.2</span> Lookup</h3>
+<p>While looking for a key, we proceed as follows:</p>
+<ol type="1">
+<li>Look into the in-memory buffer, if the key is recently written it is
+likely here, if not present continue;</li>
+<li>Look into the immutable memtables list, iterating from the most
+recent to the oldest, if not present continue;</li>
+<li>Look into disk tables, iterating from the most recent one to the
+oldest, if not present return null.</li>
+</ol>
+<h3 data-number="1.3.3" id="deletions"><span
+class="header-section-number">1.3.3</span> Deletions</h3>
+<p>To delete a key, we do not need to delete all its replicas, from the
+on-disk tables, we just need a special value called <em>tombstone</em>.
+Hence a deletion is the same as an insertion, but with a value set to
+null. While looking for a key, if we encounter a null value we simply
+return null as a result.</p>
+<h3 data-number="1.3.4" id="sstable-compaction"><span
+class="header-section-number">1.3.4</span> SSTable Compaction</h3>
+<p>The most expensive operation while looking for a key is certainly the
+disk search, and this is why bloom filters are crucial for negative
+lookup on SSTables. But no bloom filter can save us if too many tables
+are available to search, hence we need <em>compaction</em>.</p>
+<p>When flushing a Memtable, we create an SSTable of level one. When the
+first level reaches a certain threshold, all its tables are merged into
+a level-two table, and so on. This permits us to save storage and query
+fewer tables in lookups.</p>
+<p>Note that this style of compaction is not standard, there are various
+sophisticated techniques, but for the sake of this project this simple
+level-like compaction works wonders.</p>
+<hr />
+<h1 data-number="2" id="benchmarks"><span
+class="header-section-number">2</span> Benchmarks</h1>
+<p>I am using <a
+href="https://openjdk.java.net/projects/code-tools/jmh/">JMH</a> to run
+benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of
+RAM and 512GB SSD.</p>
+<p>To run them use <code>./gradlew jmh</code>.</p>
+<h2 data-number="2.1" id="sstable-1"><span
+class="header-section-number">2.1</span> SSTable</h2>
+<ul>
+<li>Negative access: the key is not present in the table, hence the
+Bloom filter will likely stop the search;</li>
+<li>Random access: the key is present in the table, the order of the
+keys is random.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s
+c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s
+</code></pre>
+<h2 data-number="2.2" id="bloom-filter"><span
+class="header-section-number">2.2</span> Bloom filter</h2>
+<ul>
+<li>Add: add keys to a 1M keys Bloom filter with 0.01 false positive
+rate;</li>
+<li>Contains: test whether the keys are present in the Bloom
+filter.</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.bloom.BloomFilterBenchmark.add           thrpt    5  3190753.307 ±  74744.764  ops/s
+c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377.613  ops/s
+</code></pre>
+<h2 data-number="2.3" id="skip-list-1"><span
+class="header-section-number">2.3</span> Skip-List</h2>
+<ul>
+<li>Get: get keys from a 100k keys skip-list;</li>
+<li>Add/Remove: add and remove keys from a 100k keys skip-list.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.memtable.SkipListBenchmark.addRemove     thrpt    5   430239.471 ±   4825.990  ops/s
+c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201.227  ops/s
+</code></pre>
+<h2 data-number="2.4" id="tree-1"><span
+class="header-section-number">2.4</span> Tree</h2>
+<ul>
+<li>Get: get elements from a tree with 1M keys;</li>
+<li>Add: add 1M distinct elements to a tree with a memtable size of
+2^18</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.tree.LSMTreeAddBenchmark.add             thrpt    5   540788.751 ±  54491.134  ops/s
+c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241.190  ops/s
+</code></pre>
+<hr />
+<h2 data-number="2.5" id="possible-improvements"><span
+class="header-section-number">2.5</span> Possible improvements</h2>
+<p>There is certainly space for improvement on this project:</p>
+<ol type="1">
+<li>Blocked bloom filters: its a variant of a classic array-like bloom
+filter which is more cache efficient;</li>
+<li>Search fingers in the Skip list: the idea is to keep a pointer to
+the last search, and start from there with subsequent queries;</li>
+<li>Proper level compaction in the LSM tree;</li>
+<li>Write ahead log for the insertions, without this, a crash makes all
+the in-memory writes disappear;</li>
+<li>Proper recovery: handle crashes and reboots, using existing SSTables
+and the write-ahead log.</li>
+</ol>
+<p>I don’t have the practical time to do all of this, perhaps the first
+two points will be handled in the future.</p>
+<hr />
+<h2 data-number="2.6" id="references"><span
+class="header-section-number">2.6</span> References</h2>
+<ul>
+<li><a href="https://www.databass.dev/">Database Internals</a> by Alex
+Petrov, specifically chapters about Log-Structured Storage and File
+Formats;</li>
+<li><a
+href="https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content">A
+Skip List Cookbook</a> by William Pugh.</li>
+</ul>
+<hr />
+<p><em>If you found this useful or interesting do not hesitate to ask
+clarifying questions or get in touch!</em></p>
diff --git a/out.md b/out.md
new file mode 100644
index 0000000..9c6e7c0
--- /dev/null
+++ b/out.md
@@ -0,0 +1,264 @@
+<p>An implementation of the Log-Structured Merge Tree (LSM tree) data
+structure in Java.</p>
+<p><strong>Table of Contents</strong></p>
+<ol type="1">
+<li><a href="#Architecture">Architecture</a>
+<ol type="1">
+<li><a href="#SSTable">SSTable</a></li>
+<li><a href="#Skip-List">Skip-List</a></li>
+<li><a href="#Tree">Tree</a></li>
+</ol></li>
+<li><a href="#Benchmarks">Benchmarks</a>
+<ol type="1">
+<li><a href="#sstable-1">SSTable</a></li>
+<li><a href="#skip-list-1">Skip-List</a></li>
+<li><a href="#tree-1">Tree</a></li>
+</ol></li>
+<li><a href="#possible-improvements">Possible future
+improvements</a></li>
+<li><a href="#references">References</a></li>
+</ol>
+<p>To interact with a toy tree you can use <code>./gradlew run -q</code>
+to spawn a console.</p>
+<figure>
+<img src="misc%2Fconsole.png" alt="console.png" />
+<figcaption aria-hidden="true">console.png</figcaption>
+</figure>
+<hr />
+<h1 data-number="1" id="architecture"><span
+class="header-section-number">1</span> Architecture</h1>
+<p>Architecture overview, from SSTables, which are the disk-resident
+portion of the database, Skip Lists, used as memory buffers, and finally
+to the combination of the twos to create insertion, lookup and deletion
+primitives.</p>
+<h2 data-number="1.1" id="sstable"><span
+class="header-section-number">1.1</span> SSTable</h2>
+<p>Sorted String Table (SSTable) is a collection of files modelling
+key-value pairs in sorted order by key. It is used as a persistent
+storage for the LSM tree.</p>
+<h3 data-number="1.1.1" id="components"><span
+class="header-section-number">1.1.1</span> Components</h3>
+<ul>
+<li><em>Data</em>: key-value pairs in sorted order by key, stored in a
+file;</li>
+<li><em>Sparse index</em>: sparse index containing key and offset of the
+corresponding key-value pair in the data;</li>
+<li><em>Bloom filter</em>: a <a
+href="https://en.wikipedia.org/wiki/Bloom_filter">probabilistic data
+structure</a> used to test whether a key is in the SSTable.</li>
+</ul>
+<h3 data-number="1.1.2" id="key-lookup"><span
+class="header-section-number">1.1.2</span> Key lookup</h3>
+<p>The basic idea is to use the sparse index to find the key-value pair
+in the data file. The steps are:</p>
+<ol type="1">
+<li>Use the Bloom filter to test whether the key might be in the
+table;</li>
+<li>If the key might be present, use binary search on the index to find
+the maximum lower bound of the key;</li>
+<li>Scan the data from the position found in the previous step to find
+the key-value pair. The search can stop when we are seeing a key greater
+than the one we are looking for, or when we reach the end of the
+table.</li>
+</ol>
+<p>The search is as lazy as possible, meaning that we read the minimum
+amount of data from disk, for instance, if the next key length is
+smaller than the one we are looking for, we can skip the whole key-value
+pair.</p>
+<h3 data-number="1.1.3" id="persistence"><span
+class="header-section-number">1.1.3</span> Persistence</h3>
+<p>A table is persisted to disk when it is created. A base filename is
+defined, and three files are present:</p>
+<ul>
+<li><code>&lt;base_filename&gt;.data</code>: data file;</li>
+<li><code>&lt;base_filename&gt;.index</code>: index file;</li>
+<li><code>&lt;base_filename&gt;.bloom</code>: bloom filter file.</li>
+</ul>
+<p><strong>Data format</strong></p>
+<ul>
+<li><code>n</code>: number of key-value pairs;</li>
+<li><code>&lt;key_len_1, value_len_1, key_1, value_1, ... key_n, value_n&gt;</code>:
+key-value pairs.</li>
+</ul>
+<p><strong>Index format</strong></p>
+<ul>
+<li><code>s</code>: number of entries in the whole table;</li>
+<li><code>n</code>: number of entries in the index;</li>
+<li><code>o_1, o_2 - o_1, ..., o_n - o_n-1</code>: offsets of the
+key-value pairs in the data file, skipping the first one;</li>
+<li><code>s_1, s_2, ..., s_n</code>: remaining keys after a sparse index
+entry, used to exit from search;</li>
+<li><code>&lt;key_len_1, key_1, ... key_len_n, key_n&gt;</code>: keys in
+the index.</li>
+</ul>
+<p><strong>Filter format</strong></p>
+<ul>
+<li><code>m</code>: number of bits in the bloom filter;</li>
+<li><code>k</code>: number of hash functions;</li>
+<li><code>n</code>: size of underlying long array;</li>
+<li><code>b_1, b_2, ..., b_n</code>: bits of the bloom filter.</li>
+</ul>
+<p>To save space, all integers are stored in <a
+href="https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html">variable-length
+encoding</a>, and offsets in the index are stored as <a
+href="https://en.wikipedia.org/wiki/Delta_encoding">deltas</a>.</p>
+<hr />
+<h2 data-number="1.2" id="skip-list"><span
+class="header-section-number">1.2</span> Skip-List</h2>
+<p>A <a href="https://en.wikipedia.org/wiki/Skip_list">skip-list</a> is
+a probabilistic data structure that allows fast search, insertion and
+deletion of elements in a sorted sequence.</p>
+<p>In the LSM tree, it is used as an in-memory data structure to store
+key-value pairs in sorted order by key. Once the skip-list reaches a
+certain size, it is flushed to disk as an SSTable.</p>
+<h3 data-number="1.2.1" id="operations-details"><span
+class="header-section-number">1.2.1</span> Operations details</h3>
+<p>The idea of a skip list is similar to a classic linked list. We have
+nodes with forward pointers, but also levels. We can think about a level
+as a fast lane between nodes. By carefully constructing them at
+insertion time, searches are faster, as they can use higher levels to
+skip unwanted nodes.</p>
+<p>Given <code>n</code> elements, a skip list has <code>log(n)</code>
+levels, the first level containing all the elements. By increasing the
+level, the number of elements is cut roughly by half.</p>
+<p>To locate an element, we start from the top level and move forward
+until we find an element greater than the one we are looking for. Then
+we move down to the next level and repeat the process until we find the
+element.</p>
+<p>Insertions, deletions, and updates are done by first locating the
+element, then performing the operation on the node. All of them have an
+average time complexity of <code>O(log(n))</code>.</p>
+<hr />
+<h2 data-number="1.3" id="tree"><span
+class="header-section-number">1.3</span> Tree</h2>
+<p>Having defined SSTables and Skip Lists we can obtain the final
+structure as a combination of the two. The main idea is to use the
+latter as an in-memory buffer, while the former efficiently stores
+flushed buffers.</p>
+<h3 data-number="1.3.1" id="insertion"><span
+class="header-section-number">1.3.1</span> Insertion</h3>
+<p>Each insert goes directly to a Memtable, which is a Skip List under
+the hood, so the response time is quite fast. There exists a threshold,
+over which the mutable structure is made immutable by appending it to
+the <em>immmutable memtables LIFO list</em> and replaced with a new
+mutable list.</p>
+<p>The immutable memtable list is asynchronously consumed by a
+background thread, which takes the next available list and create a
+disk-resident SSTable with its content.</p>
+<h3 data-number="1.3.2" id="lookup"><span
+class="header-section-number">1.3.2</span> Lookup</h3>
+<p>While looking for a key, we proceed as follows:</p>
+<ol type="1">
+<li>Look into the in-memory buffer, if the key is recently written it is
+likely here, if not present continue;</li>
+<li>Look into the immutable memtables list, iterating from the most
+recent to the oldest, if not present continue;</li>
+<li>Look into disk tables, iterating from the most recent one to the
+oldest, if not present return null.</li>
+</ol>
+<h3 data-number="1.3.3" id="deletions"><span
+class="header-section-number">1.3.3</span> Deletions</h3>
+<p>To delete a key, we do not need to delete all its replicas, from the
+on-disk tables, we just need a special value called <em>tombstone</em>.
+Hence a deletion is the same as an insertion, but with a value set to
+null. While looking for a key, if we encounter a null value we simply
+return null as a result.</p>
+<h3 data-number="1.3.4" id="sstable-compaction"><span
+class="header-section-number">1.3.4</span> SSTable Compaction</h3>
+<p>The most expensive operation while looking for a key is certainly the
+disk search, and this is why bloom filters are crucial for negative
+lookup on SSTables. But no bloom filter can save us if too many tables
+are available to search, hence we need <em>compaction</em>.</p>
+<p>When flushing a Memtable, we create an SSTable of level one. When the
+first level reaches a certain threshold, all its tables are merged into
+a level-two table, and so on. This permits us to save storage and query
+fewer tables in lookups.</p>
+<p>Note that this style of compaction is not standard, there are various
+sophisticated techniques, but for the sake of this project this simple
+level-like compaction works wonders.</p>
+<hr />
+<h1 data-number="2" id="benchmarks"><span
+class="header-section-number">2</span> Benchmarks</h1>
+<p>I am using <a
+href="https://openjdk.java.net/projects/code-tools/jmh/">JMH</a> to run
+benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of
+RAM and 512GB SSD.</p>
+<p>To run them use <code>./gradlew jmh</code>.</p>
+<h2 data-number="2.1" id="sstable-1"><span
+class="header-section-number">2.1</span> SSTable</h2>
+<ul>
+<li>Negative access: the key is not present in the table, hence the
+Bloom filter will likely stop the search;</li>
+<li>Random access: the key is present in the table, the order of the
+keys is random.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s
+c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s
+</code></pre>
+<h2 data-number="2.2" id="bloom-filter"><span
+class="header-section-number">2.2</span> Bloom filter</h2>
+<ul>
+<li>Add: add keys to a 1M keys Bloom filter with 0.01 false positive
+rate;</li>
+<li>Contains: test whether the keys are present in the Bloom
+filter.</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.bloom.BloomFilterBenchmark.add           thrpt    5  3190753.307 ±  74744.764  ops/s
+c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377.613  ops/s
+</code></pre>
+<h2 data-number="2.3" id="skip-list-1"><span
+class="header-section-number">2.3</span> Skip-List</h2>
+<ul>
+<li>Get: get keys from a 100k keys skip-list;</li>
+<li>Add/Remove: add and remove keys from a 100k keys skip-list.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.memtable.SkipListBenchmark.addRemove     thrpt    5   430239.471 ±   4825.990  ops/s
+c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201.227  ops/s
+</code></pre>
+<h2 data-number="2.4" id="tree-1"><span
+class="header-section-number">2.4</span> Tree</h2>
+<ul>
+<li>Get: get elements from a tree with 1M keys;</li>
+<li>Add: add 1M distinct elements to a tree with a memtable size of
+2^18</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.tree.LSMTreeAddBenchmark.add             thrpt    5   540788.751 ±  54491.134  ops/s
+c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241.190  ops/s
+</code></pre>
+<hr />
+<h2 data-number="2.5" id="possible-improvements"><span
+class="header-section-number">2.5</span> Possible improvements</h2>
+<p>There is certainly space for improvement on this project:</p>
+<ol type="1">
+<li>Blocked bloom filters: its a variant of a classic array-like bloom
+filter which is more cache efficient;</li>
+<li>Search fingers in the Skip list: the idea is to keep a pointer to
+the last search, and start from there with subsequent queries;</li>
+<li>Proper level compaction in the LSM tree;</li>
+<li>Write ahead log for the insertions, without this, a crash makes all
+the in-memory writes disappear;</li>
+<li>Proper recovery: handle crashes and reboots, using existing SSTables
+and the write-ahead log.</li>
+</ol>
+<p>I don’t have the practical time to do all of this, perhaps the first
+two points will be handled in the future.</p>
+<hr />
+<h2 data-number="2.6" id="references"><span
+class="header-section-number">2.6</span> References</h2>
+<ul>
+<li><a href="https://www.databass.dev/">Database Internals</a> by Alex
+Petrov, specifically chapters about Log-Structured Storage and File
+Formats;</li>
+<li><a
+href="https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content">A
+Skip List Cookbook</a> by William Pugh.</li>
+</ul>
+<hr />
+<p><em>If you found this useful or interesting do not hesitate to ask
+clarifying questions or get in touch!</em></p>
diff --git a/src/jmh/java/com/tomfran/lsm/tree/LSMTreeAddBenchmark.java b/src/jmh/java/com/tomfran/lsm/tree/LSMTreeAddBenchmark.java
index ba4b08e..3b94187 100644
--- a/src/jmh/java/com/tomfran/lsm/tree/LSMTreeAddBenchmark.java
+++ b/src/jmh/java/com/tomfran/lsm/tree/LSMTreeAddBenchmark.java
@@ -28,7 +28,7 @@ public void setup() throws IOException {
     }
 
     @TearDown
-    public void teardown() throws IOException, InterruptedException {
+    public void teardown() throws InterruptedException {
         tree.stop();
         Thread.sleep(5000);
         BenchmarkUtils.deleteDir(DIR);
diff --git a/src/jmh/java/com/tomfran/lsm/tree/LSMTreeGetBenchmark.java b/src/jmh/java/com/tomfran/lsm/tree/LSMTreeGetBenchmark.java
index 53639e0..3443f10 100644
--- a/src/jmh/java/com/tomfran/lsm/tree/LSMTreeGetBenchmark.java
+++ b/src/jmh/java/com/tomfran/lsm/tree/LSMTreeGetBenchmark.java
@@ -35,8 +35,9 @@ public void setup() throws IOException {
     }
 
     @TearDown
-    public void teardown() throws IOException {
+    public void teardown() throws InterruptedException {
         tree.stop();
+        Thread.sleep(5000);
         BenchmarkUtils.deleteDir(DIR);
     }
 
diff --git a/src/jmh/java/com/tomfran/lsm/utils/BenchmarkUtils.java b/src/jmh/java/com/tomfran/lsm/utils/BenchmarkUtils.java
index 4de9e23..7806492 100644
--- a/src/jmh/java/com/tomfran/lsm/utils/BenchmarkUtils.java
+++ b/src/jmh/java/com/tomfran/lsm/utils/BenchmarkUtils.java
@@ -4,7 +4,6 @@
 import com.tomfran.lsm.types.ByteArrayPair;
 
 import java.io.File;
-import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
 import java.util.Random;
@@ -13,7 +12,7 @@
 
 public class BenchmarkUtils {
 
-    public static LSMTree initTree(Path dir, int memSize, int levelSize) throws IOException {
+    public static LSMTree initTree(Path dir, int memSize, int levelSize) {
         // setup directory
         if (Files.exists(dir))
             deleteDir(dir);
diff --git a/src/main/java/com/tomfran/lsm/Main.java b/src/main/java/com/tomfran/lsm/Main.java
index b7a7048..a710c71 100644
--- a/src/main/java/com/tomfran/lsm/Main.java
+++ b/src/main/java/com/tomfran/lsm/Main.java
@@ -13,7 +13,7 @@ public class Main {
 
     static final String DIRECTORY = "LSM-data";
 
-    public static void main(String[] args) {
+    public static void main(String[] args) throws InterruptedException {
 
         if (new File(DIRECTORY).exists())
             deleteDir();
@@ -48,25 +48,28 @@ public static void main(String[] args) {
             System.out.print("> ");
             String command = scanner.nextLine();
 
-            String[] parts = command.split(" ");
-
-            String msg;
-            switch (parts[0]) {
-                case "s", "set" -> {
-                    tree.add(new ByteArrayPair(parts[1].getBytes(), parts[2].getBytes()));
-                    System.out.println("ok");
-                }
-                case "d", "del" -> {
-                    tree.delete(parts[1].getBytes());
-                    System.out.println("ok");
-                }
-                case "g", "get" -> {
-                    byte[] value = tree.get(parts[1].getBytes());
-                    System.out.println((value == null || value.length == 0) ? "not found" : new String(value));
+            try {
+                String[] parts = command.split(" ");
+
+                switch (parts[0]) {
+                    case "s", "set" -> {
+                        tree.add(new ByteArrayPair(parts[1].getBytes(), parts[2].getBytes()));
+                        System.out.println("ok");
+                    }
+                    case "d", "del" -> {
+                        tree.delete(parts[1].getBytes());
+                        System.out.println("ok");
+                    }
+                    case "g", "get" -> {
+                        byte[] value = tree.get(parts[1].getBytes());
+                        System.out.println((value == null || value.length == 0) ? "not found" : new String(value));
+                    }
+                    case "h", "help" -> System.out.println(help);
+                    case "e", "exit" -> exit = true;
+                    default -> System.out.println("Unknown command");
                 }
-                case "h", "help" -> System.out.println(help);
-                case "e", "exit" -> exit = true;
-                default -> System.out.println("Unknown command");
+            } catch (Exception e) {
+                System.out.printf("### error while executing command: \"%s\"\n", command);
             }
         }
         tree.stop();
diff --git a/src/main/java/com/tomfran/lsm/tree/LSMTree.java b/src/main/java/com/tomfran/lsm/tree/LSMTree.java
index 6382cfc..82e9ca1 100644
--- a/src/main/java/com/tomfran/lsm/tree/LSMTree.java
+++ b/src/main/java/com/tomfran/lsm/tree/LSMTree.java
@@ -132,7 +132,7 @@ public byte[] get(byte[] key) {
     /**
      * Stop the background threads.
      */
-    public void stop() {
+    public void stop() throws InterruptedException {
         memtableFlusher.shutdownNow();
         tableCompactor.shutdownNow();
     }