🔎 File Processing Benchmark

This project provides a comprehensive benchmark for comparing the performance of different file formats (CSV, Parquet, and Arrow) and data processing libraries (Polars, DuckDB, and Pandas) in Python.

📝 Introduction

The project consists of three main Python scripts:

generate_data.py: Generates fake transaction data.
export_data.py: Exports the generated data to CSV, Parquet, and Arrow formats.
methods.py: Contains the methods used to benchmark the file processing.
benchmark.py: Performs the benchmark tests on the exported files.

👨🏻‍💻 Prerequisites

Docker Desktop

🔨 Usage

Build the Docker image:

docker build -t file-reading-benchmark .

You can customize the number of records generated by setting the NUM_RECORDS environment variable:
```
docker run -e NUM_RECORDS=500000 -it file-reading-benchmark
```

This example will generate and benchmark 500.000 records instead of the default 10.000. Be patient if going above 100.000 rows since it will take time to generate the files.

This command will:

Generate a fake transaction dataset with foreign key relationships to user/product datasets.
Export the data to CSV, Parquet, and Arrow formats.
Run the benchmark tests on each file format using Polars, DuckDB, and Pandas.

You can check the .csv files generated in the output folder.

Also, you can read conclusions in 🗂️ Pandas vs. Polars vs. DuckDb. Who "wins"?

😎 Follow me on Linkedin

Get tips, learnings and tricks for your Data career!

📩 Subscribe to The Pipe & The Line

Join the Substack newsletter to get similar content to this one and more to improve your Data career!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
output		output
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
benchmark.py		benchmark.py
export_data.py		export_data.py
generate_data.py		generate_data.py
methods.py		methods.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 File Processing Benchmark

📝 Introduction

👨🏻‍💻 Prerequisites

🔨 Usage

Also, you can read conclusions in 🗂️ Pandas vs. Polars vs. DuckDb. Who "wins"?

😎 Follow me on Linkedin

📩 Subscribe to The Pipe & The Line

About

Releases

Packages

Languages

aboyalejandro/benchmark-file-processing

Folders and files

Latest commit

History

Repository files navigation

🔎 File Processing Benchmark

📝 Introduction

👨🏻‍💻 Prerequisites

🔨 Usage

Also, you can read conclusions in 🗂️ Pandas vs. Polars vs. DuckDb. Who "wins"?

😎 Follow me on Linkedin

📩 Subscribe to The Pipe & The Line

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages