Search2.0 allows users to search their local files using natural language, powered by LLM agents. It use tools like PDF and code search to find relevant files. For example, a user might ask, “Find the CSYE7230 project proposal,” and the app will return the most relevant file paths, a summary, and allow them to view the content. We will build our agentic system using Llama Stack, developed by Meta.
We plan to scale the PDF search tool using ColPali for production. ColPali eliminates the need for complex and fragile layout recognition or OCR pipelines by using a single model that understands both the text and visual content (e.g., layout, charts) of a document. It has delivered the best results so far. We will experiment with the following ideas:
- Use smaller models such as Llama3.2 and key value cache to reduce latency
- Cache warming when users are typing
- KV cache compression
- Binary Quantization
- Model distillation
- Sync local files and remote indexes using hash and hierarchical file traversal
TODO
-
We use LLama3.2-vision to generate a synthetic evaluation dataset.
-
We use the CVPR 2019 Papers dataset as our source of PDF documents.
-
We use Langchain to generate synthetic data and create a baseline for our evaluation.
We use the CVPR 2019 Papers dataset from Kaggle, containing over 1,000 academic papers from the CVPR 2019 conference. From this dataset, 5 papers were randomly selected to generate 10 test sets (cvpr2019_5papers_testset_12q.csv
) using the Ragas framework. The dataset can be found at friedahuang/cvpr2019_5papers_testset_12q See ragas_evaluate.py
for the implementation.
- Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf
- Aakur_A_Perceptual_Prediction_Framework_for_Self_Supervised_Event_Segmentation_CVPR_2019_paper.pdf
- Abati_Latent_Space_Autoregression_for_Novelty_Detection_CVPR_2019_paper.pdf
- Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.pdf
- Abbasnejad_A_Generative_Adversarial_Density_Estimator_CVPR_2019_paper.pdf
Additionally, we use the Huggingface dataset (m-ric/huggingface_doc) to generate 347 question-answer pairs (QA couples). The synthetically generated QA couples can be found at friedahuang/m-ric_huggingface_doc_347. We will focus on this evaluation dataset because it has a larger volume and we've already established a baseline RAG system benchmarked against it. See benchmark_rag.py
for the implementation.
- Next.js
- Typescript
- shadcn/ui
- Python
- SQLAlchemy: Python SQL toolkit and Object Relational Mapper
- Llama Stack: Standardize the building blocks needed to bring generative AI applications to market
- pgvector: An extension of PostgreSQL with the ability to store and search vector embeddings alongside regular data
- LangChain: Framework for LLM applications (It is only used for evaluation purpose)
- PostgreSQL 17
- pgvector
- Psycopg3: PostgreSQL database adapter for Python
- Alembic: Database migrations tool
- ColPali: A vision retriever based on the ColBERT architecture and the PaliGemma model
- Llama3.2: llama3.2:latest
- Vercel
- GCP
- Loguru: Simplified Python logging
- pre-commit: Multi-language pre-commit hooks manager
- Ruff: Fast Python linter and formatter
Instructions on how to set up the project locally. For example:
-
Clone the repository:
git clone https://github.com/frieda-huang/csye7230.git
-
Install dependencies:
poetry install
-
Set up pre-commit hooks:
pre-commit install
- Create a Database user:
CREATE USER searchagent_user WITH PASSWORD 'your_secure_password';
- Create a new database:
CREATE DATABASE searchagent OWNER searchagent_user;
- Grant necessary privileges:
GRANT ALL PRIVILEGES ON DATABASE searchagent TO searchagent_user;
- Add PostgreSQL connection to .env
DATABASE_URL=postgresql+psycopg://searchagent_user:your_secure_password@localhost:5432/searchagent
- Use the searchagent_user in psql
psql -U searchagent_user -d searchagent
- Ensure searchagent_user has superuser privileges by logging as the superuser
ALTER USER searchagent_user WITH SUPERUSER;
- Enable the pgvector extension
CREATE EXTENSION vector;
- Find config file with
SHOW config_file
; in mac, it's in/opt/homebrew/var/postgresql@17/postgresql.conf
- Use PgTune to set initial values for Postgres server parameters
For example, on my machine (Apple M2 Pro), I have the following initial settings
# DB Version: 17
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 32 GB
# CPUs num: 12
# Connections num: 100
# Data Storage: ssd
max_connections = 100
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 20971kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 12
max_parallel_workers_per_gather = 4
max_parallel_workers = 12
max_parallel_maintenance_workers = 4
We use pg_stat_statements to monitor overall query performance in PostgreSQL
When querying high I/O activity, top time-consuming queries, or high memory usage, remember to use pg_stat_statements_reset()
to flush data from pg_stat_statements
VACUUM ANALYZE;
to reduce bloat and refresh query planner statistics
- Create new migration by running
alembic revision --autogenerate -m "YOUR MSG"
- Apply new migration by running
alembic upgrade head
When updating column types in the ORM, ensure the database schema reflects these changes.
For instance, if changing the vector type in the query table from full precision to half-precision (halfvec), apply the corresponding database migration.
ALTER TABLE query
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)[]
USING vector_embedding::HALFVEC(128)[];
For flattened_embedding
table
ALTER TABLE flattened_embedding
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)
USING vector_embedding::HALFVEC(128);
To deal with invalid password after the computer didn't complete the shutdown process, we need to execute the following commands:
cd /opt/homebrew/var/postgresql
rm postmaster.pid
brew services restart postgresql@17
We will only access the user’s home directory, which contains most user-accessible files and data. The home directory includes:
- Desktop
- Documents
- Downloads
- Pictures
>>> response = searchagent.query("find csye7230 project proposal")
>>> response.documents
Output:
Document(metadata={'source': '../proposals/csye7230_project_proposal_part_a.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_proposal_part_b.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_benchmarking_report.pdf'}, page_content='...')
>>> response.answer
Output:
The following files match your query:
1. csye7230_project_proposal_part_a.pdf
`../proposals/csye7230_project_proposal_part_a.pdf`
This PDF contains the project proposal for CSYE7230, detailing objectives, methodologies, and expected outcomes.
2. csye7230_project_proposal_part_b.pdf
`../proposals/csye7230_project_proposal_part_b.pdf`
This PDF outlines the implementation plan for project proposal part B, focusing on architecture and design choices.
3. csye7230_project_benchmarking_report.pdf
`../proposals/csye7230_project_benchmarking_report.pdf`
This PDF presents the benchmarking report for CSYE7230, evaluating the performance metrics and analysis of the project components.
- Python: Production-Level Coding Practices
- RAG Evaluation (LLM-as-a-judge)
- Analyze file system and folder structures with Python
- Pytest Best Practices
- Implement semantic cache to improve a RAG system
- Set up eval pipeline
- Reranking
- Evaluating Chunking Strategies for Retrieval
- A Reddit post on a new chunking algorithm
- 5 levels of text splitting
- A blog on ripgrep—a line-oriented search tool that recursively searches the current directory for a regex pattern
- microsearch—a search engine in 80 lines of Python
- Agent architectures
- Embedding Quantization
- Llama3.2 is here
- The The Ultimate Guide to Vector Database Landscape — 2024 and Beyond
- pgvector: Multi-vector support
- pg_stat_statements