Skip to content

Commit

Permalink
update benchmark and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
dreulavelle committed Mar 28, 2024
1 parent 9d83b4b commit 762bf2e
Show file tree
Hide file tree
Showing 6 changed files with 184 additions and 48 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ coverage: clean
@poetry run pytest --cov=$(SRC_DIR) --cov-report=xml --cov-report=html --cov-report=term

benchmark:
@poetry run python benchmarks/rank.py
@poetry run python benchmarks/rank.py --quiet

pr-ready: clean format lint check test

Expand Down
81 changes: 81 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -390,6 +390,87 @@ This will continue to grow though as we expand on functionality, so keep checkin

> :warning: Don't see something you want in the list? Submit a [Feature Request](https://github.com/dreulavelle/rank-torrent-name/issues/new?assignees=dreulavelle&labels=kind%2Ffeature%2Cstatus%2Ftriage&projects=&template=---feature-request.yml) to have it added!
## Performance Benchmarks

Here, we dive into the heart of RTN's efficiency, showcasing how it performs under various loads. Whether you're parsing a single title or ranking thousands, understanding these benchmarks will help you optimize your use of RTN.

### Benchmark Categories

We categorize benchmarks into two main processes:
- **Parsing**: Measures the time to parse a title and return a `ParsedData` object.
- **Ranking**: A comprehensive process that includes parsing and then evaluates the title based on defined criteria. This represents a more "real-world" scenario and is crucial for developers looking to integrate RTN effectively.

### Benchmark Results

To facilitate comparison, we've compiled the results into a single table:

| Operation | Items Count | Mean Time | Standard Deviation |
|--------------------------------------------|-------------|--------------|--------------------|
| **Parsing Benchmark (Single item)** | 1 | 620 us | 35 us |
| **Batch Parse Benchmark (Small batch)** | 10 | 6.06 ms | 0.11 ms |
| **Batch Parse Benchmark (Large batch)** | 1000 | 640 ms | 8 ms |
| **Ranking Benchmark (Single item)** | 1 | 660 us | 38 us |
| **Batch Rank Benchmark (Small batch)** | 10 | 24.6 ms | 4.1 ms |
| **Batch Rank Benchmark (Large batch)** | 1000 | 3.13 s | 0.15 s |

### Benchmark Settings

- **Small batch parsing** used a `chunk_size` of `10`.
- **Large batch parsing** handled `chunk_size` of `500`.
- **Small batch ranking** operated with the default `max_workers` of `4`.
- **Large batch ranking** escalated concurrency with `max_workers` of `8`.

This data underscores RTN's robust capability to efficiently process both small and extensive datasets.

To help developers optimize their use of RTN based on the performance benchmarks, consider adding a section on performance tweaking. Here's how you might include it in your README.md:

## Optimizing RTN Performance

The performance benchmarks provided give a glimpse into how RTN handles different loads, from parsing single titles to ranking thousands. For developers looking to integrate RTN into their applications efficiently, here are some tips on tweaking performance:

### 1. Adjusting Chunk Size for Batch Parsing
The `batch_parse` function allows you to parse titles in batches, significantly reducing processing time for large datasets. However, the optimal `chunk_size` can vary depending on the dataset size and your system's resources.

- For smaller datasets, a lower `chunk_size` might suffice, keeping overhead low.
- For larger datasets, increasing `chunk_size` can reduce the number of batches processed and potentially lower overall processing time.

Experiment with different `chunk_size` values to find the sweet spot for your particular use case.

### 2. Tuning Concurrency in Batch Ranking
The `batch_rank` function uses multiple threads to rank torrents in parallel, which can significantly speed up processing for large numbers of torrents.

- The default `max_workers` value is set to `4`, but this might not be optimal for all systems.
- Systems with higher CPU core counts might benefit from increasing `max_workers`, allowing more torrents to be processed simultaneously.
- However, setting `max_workers` too high can lead to diminishing returns and increased overhead. Monitor your system's resource utilization to find an optimal setting.

### 3. Leveraging ThreadPoolExecutor
Both `batch_parse` and `batch_rank` utilize `ThreadPoolExecutor` for parallel processing. Adjusting the `max_workers` parameter can help manage how many threads are used for these operations, impacting performance and resource utilization.

### 4. Custom Settings and Ranking Models
Customizing `SettingsModel` and `RankingModel` allows you to tailor the parsing and ranking criteria to your needs, potentially streamlining the processing by focusing only on relevant data.

- Evaluate which torrent attributes are essential for your application and adjust your settings model accordingly.
- Consider disabling unnecessary custom ranks or attributes in the ranking model to simplify the ranking process.

### Example: Tweaking Performance for Large Datasets

Suppose you're processing a dataset of 10,000 torrent titles. You might start with a default `chunk_size` of `50` and `max_workers` of `4`. Through experimentation, you find that increasing `chunk_size` to `500` and `max_workers` to `8` cuts your processing time in half.

```python
from RTN import RTN, SettingsModel, DefaultRanking, batch_parse

# Setup
settings = SettingsModel()
ranking_model = DefaultRanking()
rtn = RTN(settings=settings, ranking_model=ranking_model)

# Optimized batch parsing
optimized_titles = ["Title 1", "Title 2", ..., "Title 10000"]
parsed_data = batch_parse(optimized_titles, chunk_size=500, max_workers=8)
```

By monitoring performance and adjusting parameters based on your specific requirements and system capabilities, you can significantly enhance RTN's efficiency in your projects.

## Contributing

Contributions to RTN are welcomed! Feel free to submit pull requests or open issues to suggest features or report bugs. As we grow, more features will be coming to RTN, there's already a lot planned!
Expand Down
48 changes: 46 additions & 2 deletions RTN/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,43 @@ def rank(self, raw_title: str, infohash: str) -> Torrent:
)

def batch_rank(self, torrents: List[Tuple[str, str]], max_workers: int = 4) -> List[Torrent]:
"""Ranks a batch of torrents in parallel using multiple threads."""
"""
Ranks a batch of torrents in parallel using multiple threads.
Parameters:
`torrents` (List[Tuple[str, str]]): A list of tuples containing the raw title and infohash of each torrent.
`max_workers` (int, optional): The maximum number of worker threads to use for parallel processing. Defaults to 4.
Returns:
List[Torrent]: A list of Torrent objects representing the ranked torrents.
Raises:
ValueError: If the title or infohash is not provided for any torrent.
TypeError: If the title or infohash is not a string.
ValueError: If the infohash is not a valid SHA-1 hash and 40 characters in length.
Example:
>>> torrents = [
... ("The Walking Dead S05E03 720p HDTV x264-ASAP[ettv]", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e7"),
... ("Example.Movie.2020.1080p.BluRay.x264-Example", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e8"),
... ("Example.Series.S2.2020", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e9"),
... ]
>>> rtn = RTN(settings_model, ranking_model)
>>> ranked_torrents = rtn.batch_rank(torrents)
>>> len(ranked_torrents)
3
>>> isinstance(ranked_torrents[0], Torrent)
True
>>> isinstance(ranked_torrents[0].data, ParsedData)
True
>>> ranked_torrents[0].fetch
True
>>> ranked_torrents[0].rank > 0
True
>>> ranked_torrents[0].lev_ratio > 0.0
True
"""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
return list(executor.map(lambda t: self.rank(t[0], t[1]), torrents))

Expand Down Expand Up @@ -128,7 +164,15 @@ def parse(raw_title: str) -> ParsedData:


def parse_chunk(chunk: List[str]) -> List[ParsedData]:
"""Parses a chunk of torrent titles."""
"""
Parses a chunk of torrent titles.
Args:
chunk (List[str]): A list of torrent titles to parse.
Returns:
List[ParsedData]: A list of ParsedData objects containing the parsed metadata from the torrent titles.
"""
return [parse(title) for title in chunk]


Expand Down
81 changes: 37 additions & 44 deletions benchmarks/rank.py
Original file line number Diff line number Diff line change
@@ -1,51 +1,44 @@
import pyperf

from RTN import RTN, DefaultRanking, SettingsModel, parse
from RTN import RTN, DefaultRanking, SettingsModel, batch_parse, parse

# Setup
settings = SettingsModel()
ranking_model = DefaultRanking()
rtn = RTN(settings=settings, ranking_model=ranking_model)

def single_parse_benchmark_run():
parse("The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264")

def multi_parse_benchmark_run():
titles = [
"The.Matrix.1999.1080p.BluRay.x264",
"Inception.2010.720p.BRRip.x264",
"Avengers.Endgame.2019.2160p.UHD.BluRay.x265",
"Interstellar.2014.IMAX.BDRip.x264",
"Game.of.Thrones.S01E01.1080p.WEB-DL.x264",
"Breaking.Bad.S05E14.720p.HDTV.x264",
"The.Witcher.S02E05.2160p.NF.WEBRip.x265",
"The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264",
"1917.2019.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1",
"Joker.2019.720p.BluRay.x264"
]
for title in titles:
parse(title)

def single_benchmark_run():
rtn.rank("The.Matrix.1999.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746")

def multi_benchmark_run():
titles_infohashes = [
("The.Matrix.1999.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746"),
("Inception.2010.720p.BRRip.x264", "c9b4c5e5789c91823c2117b3550663c6bdd9b965"),
("Avengers.Endgame.2019.2160p.UHD.BluRay.x265", "1ba1a10a4409727e85cdba10591a58558a615f13"),
("Interstellar.2014.IMAX.BDRip.x264", "7c2c1525a61c6b1377ecbf3c1a3995285ebcd8f7"),
("Game.of.Thrones.S01E01.1080p.WEB-DL.x264", "1205555d9771e3a32a065d96dd582d09495661dc"),
("Breaking.Bad.S05E14.720p.HDTV.x264", "659cb95e90a52b0ad1bdeaed764716a715ad7599"),
("The.Witcher.S02E05.2160p.NF.WEBRip.x265", "c379a0247a6068ce5fb2092cfd7851ac08d8487c"),
("The.Mandalorian.S01E02.1080p.DSNP.WEB-DL.x264", "59a34ce306ec1332bf216b531bbc6a014e23e415"),
("1917.2019.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1", "1e510d4b0f82eaf552bf7b24e4bba6bd3693341e"),
("Joker.2019.720p.BluRay.x264", "fde14883bc2de07ae883bf1449eabc7c4e1a9b84")
]
for title, infohash in titles_infohashes:
rtn.rank(title, infohash)

runner = pyperf.Runner()
runner.bench_func("Parsing Benchmark (1x)", single_parse_benchmark_run)
runner.bench_func("Parsing Benchmark (10x)", multi_parse_benchmark_run)
runner.bench_func("Ranking Benchmark (1x)", single_benchmark_run)
runner.bench_func("Ranking Benchmark (10x)", multi_benchmark_run)

titles_infohashes = [
(f"Movie.Title.{i}.1080p.BluRay.x264", "30bfd9a796679bbeb0e110c17f32148ab8fd5746")
for i in range(1, 1001)
]
titles = [title for title, _ in titles_infohashes]

# Benchmark Functions
def single_parse_benchmark():
parse(titles[0])

def batch_parse_small_benchmark():
batch_parse(titles[:10], chunk_size=10)

def batch_parse_large_benchmark():
batch_parse(titles, chunk_size=500)

def single_rank_benchmark():
rtn.rank(*titles_infohashes[0])

def batch_rank_small_benchmark():
rtn.batch_rank(titles_infohashes[:10], max_workers=4) # type: ignore

def batch_rank_large_benchmark():
rtn.batch_rank(titles_infohashes, max_workers=8) # type: ignore


runner = pyperf.Runner(loops=1)

runner.bench_func("Parsing Benchmark (1x item)", single_parse_benchmark)
runner.bench_func("Batch Parse Benchmark - Small - (10x items)", batch_parse_small_benchmark)
runner.bench_func("Batch Parse Benchmark - Large - (1000 items)", batch_parse_large_benchmark)
runner.bench_func("Ranking Benchmark (1x item)", single_rank_benchmark)
runner.bench_func("Batch Rank Benchmark - Small - (10x items)", batch_rank_small_benchmark)
runner.bench_func("Batch Rank Benchmark - Large - (1000 items)", batch_rank_large_benchmark)
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "rank-torrent-name"
version = "0.1.5"
version = "0.1.6"
description = "Parse Torrents using PTN and Rank them according to your preferences!"
authors = ["Spoked <dreu.lavelle@gmail.com>"]
license = "MIT"
Expand Down Expand Up @@ -50,6 +50,7 @@ exclude = '''
| buck-out
| build
| dist
| tests
)/
'''

Expand Down
17 changes: 17 additions & 0 deletions tests/test_ranker.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,23 @@ def test_rank_calculation_accuracy(settings_model, ranking_model):
rank = get_rank(parsed_data, settings_model, ranking_model)
assert rank == 273, f"Expected rank did not match, got {rank}"

def test_batch_ranking(settings_model, ranking_model):
rtn = RTN(settings_model, ranking_model)
torrents = [
("The Walking Dead S05E03 720p HDTV x264-ASAP[ettv]", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e7"),
("Example.Movie.2020.1080p.BluRay.x264-Example", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e8"),
("Example.Series.S2.2020", "c08a9ee8ce3a5c2c08865e2b05406273cabc97e9"),
]

ranked_torrents = rtn.batch_rank(torrents)
assert len(ranked_torrents) == 3
for torrent in ranked_torrents:
assert isinstance(torrent, Torrent)
assert isinstance(torrent.data, ParsedData)
assert torrent.fetch is True
assert torrent.rank > 0, f"Rank was {torrent.rank} instead of > 0"
assert torrent.lev_ratio > 0.0, f"Levenshtein ratio was {torrent.lev_ratio} instead of > 0.0"

def test_preference_handling(custom_settings_model, ranking_model):
# Test with preferred title with a preference for Season number in title
# to make sure we can check before-after case. User should be able to set
Expand Down

0 comments on commit 762bf2e

Please sign in to comment.