update benchmark and readme

dreulavelle · Mar 28, 2024 · 762bf2e · 762bf2e
1 parent 9d83b4b
commit 762bf2e
Show file tree

Hide file tree

Showing 6 changed files with 184 additions and 48 deletions.
diff --git a/Makefile b/Makefile
@@ -36,7 +36,7 @@ coverage: clean
 	@poetry run pytest --cov=$(SRC_DIR) --cov-report=xml --cov-report=html --cov-report=term
 
 benchmark:
-	@poetry run python benchmarks/rank.py
+	@poetry run python benchmarks/rank.py --quiet
 
 pr-ready: clean format lint check test
 

diff --git a/README.md b/README.md
@@ -390,6 +390,87 @@ This will continue to grow though as we expand on functionality, so keep checkin
 
 > :warning: Don't see something you want in the list? Submit a [Feature Request](https://github.com/dreulavelle/rank-torrent-name/issues/new?assignees=dreulavelle&labels=kind%2Ffeature%2Cstatus%2Ftriage&projects=&template=---feature-request.yml) to have it added!
 
+## Performance Benchmarks
+
+Here, we dive into the heart of RTN's efficiency, showcasing how it performs under various loads. Whether you're parsing a single title or ranking thousands, understanding these benchmarks will help you optimize your use of RTN.
+
+### Benchmark Categories
+
+We categorize benchmarks into two main processes:
+- **Parsing**: Measures the time to parse a title and return a `ParsedData` object.
+- **Ranking**: A comprehensive process that includes parsing and then evaluates the title based on defined criteria. This represents a more "real-world" scenario and is crucial for developers looking to integrate RTN effectively.
+
+### Benchmark Results
+
+To facilitate comparison, we've compiled the results into a single table:
+
+| Operation                                  | Items Count | Mean Time    | Standard Deviation |
+|--------------------------------------------|-------------|--------------|--------------------|
+| **Parsing Benchmark (Single item)**        | 1           | 620 us       | 35 us              |
+| **Batch Parse Benchmark (Small batch)**    | 10          | 6.06 ms      | 0.11 ms            |
+| **Batch Parse Benchmark (Large batch)**    | 1000        | 640 ms       | 8 ms               |
+| **Ranking Benchmark (Single item)**        | 1           | 660 us       | 38 us              |
+| **Batch Rank Benchmark (Small batch)**     | 10          | 24.6 ms      | 4.1 ms             |
+| **Batch Rank Benchmark (Large batch)**     | 1000        | 3.13 s       | 0.15 s             |
+
+### Benchmark Settings
+
+- **Small batch parsing** used a `chunk_size` of `10`.
+- **Large batch parsing** handled `chunk_size` of `500`.
+- **Small batch ranking** operated with the default `max_workers` of `4`.
+- **Large batch ranking** escalated concurrency with `max_workers` of `8`.
+
+This data underscores RTN's robust capability to efficiently process both small and extensive datasets.
+
+To help developers optimize their use of RTN based on the performance benchmarks, consider adding a section on performance tweaking. Here's how you might include it in your README.md:
+
+## Optimizing RTN Performance
+
+The performance benchmarks provided give a glimpse into how RTN handles different loads, from parsing single titles to ranking thousands. For developers looking to integrate RTN into their applications efficiently, here are some tips on tweaking performance:
+
+### 1. Adjusting Chunk Size for Batch Parsing
+The `batch_parse` function allows you to parse titles in batches, significantly reducing processing time for large datasets. However, the optimal `chunk_size` can vary depending on the dataset size and your system's resources.
+
+- For smaller datasets, a lower `chunk_size` might suffice, keeping overhead low.
+- For larger datasets, increasing `chunk_size` can reduce the number of batches processed and potentially lower overall processing time.
+
+Experiment with different `chunk_size` values to find the sweet spot for your particular use case.
+
+### 2. Tuning Concurrency in Batch Ranking
+The `batch_rank` function uses multiple threads to rank torrents in parallel, which can significantly speed up processing for large numbers of torrents.
+
+- The default `max_workers` value is set to `4`, but this might not be optimal for all systems.
+- Systems with higher CPU core counts might benefit from increasing `max_workers`, allowing more torrents to be processed simultaneously.
+- However, setting `max_workers` too high can lead to diminishing returns and increased overhead. Monitor your system's resource utilization to find an optimal setting.
+
+### 3. Leveraging ThreadPoolExecutor
+Both `batch_parse` and `batch_rank` utilize `ThreadPoolExecutor` for parallel processing. Adjusting the `max_workers` parameter can help manage how many threads are used for these operations, impacting performance and resource utilization.
+
+### 4. Custom Settings and Ranking Models
+Customizing `SettingsModel` and `RankingModel` allows you to tailor the parsing and ranking criteria to your needs, potentially streamlining the processing by focusing only on relevant data.
+
+- Evaluate which torrent attributes are essential for your application and adjust your settings model accordingly.
+- Consider disabling unnecessary custom ranks or attributes in the ranking model to simplify the ranking process.
+
+### Example: Tweaking Performance for Large Datasets
+
+Suppose you're processing a dataset of 10,000 torrent titles. You might start with a default `chunk_size` of `50` and `max_workers` of `4`. Through experimentation, you find that increasing `chunk_size` to `500` and `max_workers` to `8` cuts your processing time in half.
+
+```python
+from RTN import RTN, SettingsModel, DefaultRanking, batch_parse
+
+# Setup
+settings = SettingsModel()
+ranking_model = DefaultRanking()
+rtn = RTN(settings=settings, ranking_model=ranking_model)
+
+# Optimized batch parsing
+optimized_titles = ["Title 1", "Title 2", ..., "Title 10000"]
+parsed_data = batch_parse(optimized_titles, chunk_size=500, max_workers=8)
+```
+
+By monitoring performance and adjusting parameters based on your specific requirements and system capabilities, you can significantly enhance RTN's efficiency in your projects.
+
 ## Contributing
 
 Contributions to RTN are welcomed! Feel free to submit pull requests or open issues to suggest features or report bugs. As we grow, more features will be coming to RTN, there's already a lot planned!

diff --git a/RTN/parser.py b/RTN/parser.py
@@ -100,7 +100,43 @@ def rank(self, raw_title: str, infohash: str) -> Torrent:
         )
 
     def batch_rank(self, torrents: List[Tuple[str, str]], max_workers: int = 4) -> List[Torrent]:
-        """Ranks a batch of torrents in parallel using multiple threads."""
+        """
+        Ranks a batch of torrents in parallel using multiple threads.
+
+        Parameters:
+            `torrents` (List[Tuple[str, str]]): A list of tuples containing the raw title and infohash of each torrent.
+            `max_workers` (int, optional): The maximum number of worker threads to use for parallel processing. Defaults to 4.
+
+        Returns:
+            List[Torrent]: A list of Torrent objects representing the ranked torrents.
+
+        Raises:
+            ValueError: If the title or infohash is not provided for any torrent.
+            TypeError: If the title or infohash is not a string.
+            ValueError: If the infohash is not a valid SHA-1 hash and 40 characters in length.
+
+        Example:
+            >>> torrents = [
+            ...     ("The Walking Dead S05E03 720p HDTV x264-ASAP[ettv]", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e7"),
+            ...     ("Example.Movie.2020.1080p.BluRay.x264-Example", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e8"),
+            ...     ("Example.Series.S2.2020", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e9"),
+            ... ]
+
+            >>> rtn = RTN(settings_model, ranking_model)
+            >>> ranked_torrents = rtn.batch_rank(torrents)
+            >>> len(ranked_torrents)
+            3
+            >>> isinstance(ranked_torrents[0], Torrent)
+            True
+            >>> isinstance(ranked_torrents[0].data, ParsedData)
+            True
+            >>> ranked_torrents[0].fetch
+            True
+            >>> ranked_torrents[0].rank > 0
+            True
+            >>> ranked_torrents[0].lev_ratio > 0.0
+            True
+        """
         with ThreadPoolExecutor(max_workers=max_workers) as executor:
             return list(executor.map(lambda t: self.rank(t[0], t[1]), torrents))
 
@@ -128,7 +164,15 @@ def parse(raw_title: str) -> ParsedData:
 
 
 def parse_chunk(chunk: List[str]) -> List[ParsedData]:
-    """Parses a chunk of torrent titles."""
+    """
+    Parses a chunk of torrent titles.
+
+    Args:
+        chunk (List[str]): A list of torrent titles to parse.
+
+    Returns:
+        List[ParsedData]: A list of ParsedData objects containing the parsed metadata from the torrent titles.
+    """
     return [parse(title) for title in chunk]
 
 

diff --git a/benchmarks/rank.py b/benchmarks/rank.py
@@ -1,51 +1,44 @@
 import pyperf
 
-from RTN import RTN, DefaultRanking, SettingsModel, parse
+from RTN import RTN, DefaultRanking, SettingsModel, batch_parse, parse
 
+# Setup
 settings = SettingsModel()
 ranking_model = DefaultRanking()
 rtn = RTN(settings=settings, ranking_model=ranking_model)
 
-def single_parse_benchmark_run():
-    parse("The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264")
-
-def multi_parse_benchmark_run():
-    titles = [
-        "The.Matrix.1999.1080p.BluRay.x264",
-        "Inception.2010.720p.BRRip.x264",
-        "Avengers.Endgame.2019.2160p.UHD.BluRay.x265",
-        "Interstellar.2014.IMAX.BDRip.x264",
-        "Game.of.Thrones.S01E01.1080p.WEB-DL.x264",
-        "Breaking.Bad.S05E14.720p.HDTV.x264",
-        "The.Witcher.S02E05.2160p.NF.WEBRip.x265",
-        "The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264",
-        "1917.2019.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1",
-        "Joker.2019.720p.BluRay.x264"
-    ]
-    for title in titles:
-        parse(title)
-
-def single_benchmark_run():
-    rtn.rank("The.Matrix.1999.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746")
-
-def multi_benchmark_run():
-    titles_infohashes = [
-        ("The.Matrix.1999.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746"),
-        ("Inception.2010.720p.BRRip.x264", "c9b4c5e5789c91823c2117b3550663c6bdd9b965"),
-        ("Avengers.Endgame.2019.2160p.UHD.BluRay.x265", "1ba1a10a4409727e85cdba10591a58558a615f13"),
-        ("Interstellar.2014.IMAX.BDRip.x264", "7c2c1525a61c6b1377ecbf3c1a3995285ebcd8f7"),
-        ("Game.of.Thrones.S01E01.1080p.WEB-DL.x264", "1205555d9771e3a32a065d96dd582d09495661dc"),
-        ("Breaking.Bad.S05E14.720p.HDTV.x264", "659cb95e90a52b0ad1bdeaed764716a715ad7599"),
-        ("The.Witcher.S02E05.2160p.NF.WEBRip.x265", "c379a0247a6068ce5fb2092cfd7851ac08d8487c"),
-        ("The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264", "59a34ce306ec1332bf216b531bbc6a014e23e415"),
-        ("1917.2019.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1", "1e510d4b0f82eaf552bf7b24e4bba6bd3693341e"),
-        ("Joker.2019.720p.BluRay.x264", "fde14883bc2de07ae883bf1449eabc7c4e1a9b84")
-    ]
-    for title, infohash in titles_infohashes:
-        rtn.rank(title, infohash)
-
-runner = pyperf.Runner()
-runner.bench_func("Parsing Benchmark (1x)", single_parse_benchmark_run)
-runner.bench_func("Parsing Benchmark (10x)", multi_parse_benchmark_run)
-runner.bench_func("Ranking Benchmark (1x)", single_benchmark_run)
-runner.bench_func("Ranking Benchmark (10x)", multi_benchmark_run)
+
+titles_infohashes = [
+    (f"Movie.Title.{i}.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746")
+    for i in range(1, 1001)
+]
+titles = [title for title, _ in titles_infohashes]
+
+# Benchmark Functions
+def single_parse_benchmark():
+    parse(titles[0])
+
+def batch_parse_small_benchmark():
+    batch_parse(titles[:10], chunk_size=10)
+
+def batch_parse_large_benchmark():
+    batch_parse(titles, chunk_size=500)
+
+def single_rank_benchmark():
+    rtn.rank(*titles_infohashes[0])
+
+def batch_rank_small_benchmark():
+    rtn.batch_rank(titles_infohashes[:10], max_workers=4) # type: ignore
+
+def batch_rank_large_benchmark():
+    rtn.batch_rank(titles_infohashes, max_workers=8) # type: ignore
+
+
+runner = pyperf.Runner(loops=1)
+
+runner.bench_func("Parsing Benchmark (1x item)", single_parse_benchmark)
+runner.bench_func("Batch Parse Benchmark - Small - (10x items)", batch_parse_small_benchmark)
+runner.bench_func("Batch Parse Benchmark - Large - (1000 items)", batch_parse_large_benchmark)
+runner.bench_func("Ranking Benchmark (1x item)", single_rank_benchmark)
+runner.bench_func("Batch Rank Benchmark - Small - (10x items)", batch_rank_small_benchmark)
+runner.bench_func("Batch Rank Benchmark - Large -  (1000 items)", batch_rank_large_benchmark)
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "rank-torrent-name"
-version = "0.1.5"
+version = "0.1.6"
 description = "Parse Torrents using PTN and Rank them according to your preferences!"
 authors = ["Spoked <dreu.lavelle@gmail.com>"]
 license = "MIT"
@@ -50,6 +50,7 @@ exclude = '''
   | buck-out
   | build
   | dist
+  | tests
 )/
 '''
 

diff --git a/tests/test_ranker.py b/tests/test_ranker.py
@@ -141,6 +141,23 @@ def test_rank_calculation_accuracy(settings_model, ranking_model):
     rank = get_rank(parsed_data, settings_model, ranking_model)
     assert rank == 273, f"Expected rank did not match, got {rank}"
 
+def test_batch_ranking(settings_model, ranking_model):
+    rtn = RTN(settings_model, ranking_model)
+    torrents = [
+        ("The Walking Dead S05E03 720p HDTV x264-ASAP[ettv]", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e7"),
+        ("Example.Movie.2020.1080p.BluRay.x264-Example", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e8"),
+        ("Example.Series.S2.2020", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e9"),
+    ]
+
+    ranked_torrents = rtn.batch_rank(torrents)
+    assert len(ranked_torrents) == 3
+    for torrent in ranked_torrents:
+        assert isinstance(torrent, Torrent)
+        assert isinstance(torrent.data, ParsedData)
+        assert torrent.fetch is True
+        assert torrent.rank > 0, f"Rank was {torrent.rank} instead of > 0"
+        assert torrent.lev_ratio > 0.0, f"Levenshtein ratio was {torrent.lev_ratio} instead of > 0.0"
+
 def test_preference_handling(custom_settings_model, ranking_model):
     # Test with preferred title with a preference for Season number in title
     # to make sure we can check before-after case. User should be able to set