Search2.0

Search2.0 allows users to search their local files using natural language, powered by LLM agents. It use tools like PDF and code search to find relevant files. For example, a user might ask, “Find the CSYE7230 project proposal,” and the app will return the most relevant file paths, a summary, and allow them to view the content. We will build our agentic system using Llama Stack, developed by Meta.

PDFSearchTool

We plan to scale the PDF search tool using ColPali for production. ColPali eliminates the need for complex and fragile layout recognition or OCR pipelines by using a single model that understands both the text and visual content (e.g., layout, charts) of a document. It has delivered the best results so far. We will experiment with the following ideas:

Use smaller models such as Llama3.2 and key value cache to reduce latency
Cache warming when users are typing
KV cache compression
Binary Quantization
Model distillation
Sync local files and remote indexes using hash and hierarchical file traversal

CodeSearchTool

TODO

Evaluation

We use LLama3.2-vision to generate a synthetic evaluation dataset.
We use the CVPR 2019 Papers dataset as our source of PDF documents.
We use Langchain to generate synthetic data and create a baseline for our evaluation.

Dataset

We use the CVPR 2019 Papers dataset from Kaggle, containing over 1,000 academic papers from the CVPR 2019 conference. From this dataset, 5 papers were randomly selected to generate 10 test sets (cvpr2019_5papers_testset_12q.csv) using the Ragas framework. The dataset can be found at friedahuang/cvpr2019_5papers_testset_12q See ragas_evaluate.py for the implementation.

Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf
Aakur_A_Perceptual_Prediction_Framework_for_Self_Supervised_Event_Segmentation_CVPR_2019_paper.pdf
Abati_Latent_Space_Autoregression_for_Novelty_Detection_CVPR_2019_paper.pdf
Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.pdf
Abbasnejad_A_Generative_Adversarial_Density_Estimator_CVPR_2019_paper.pdf

Additionally, we use the Huggingface dataset (m-ric/huggingface_doc) to generate 347 question-answer pairs (QA couples). The synthetically generated QA couples can be found at friedahuang/m-ric_huggingface_doc_347. We will focus on this evaluation dataset because it has a larger volume and we've already established a baseline RAG system benchmarked against it. See benchmark_rag.py for the implementation.

Tech Stack

Frontend

Next.js
Typescript
shadcn/ui

Backend

Python
SQLAlchemy: Python SQL toolkit and Object Relational Mapper
Llama Stack: Standardize the building blocks needed to bring generative AI applications to market
pgvector: An extension of PostgreSQL with the ability to store and search vector embeddings alongside regular data
LangChain: Framework for LLM applications (It is only used for evaluation purpose)

Database

PostgreSQL 17
pgvector
Psycopg3: PostgreSQL database adapter for Python
Alembic: Database migrations tool

Models

ColPali: A vision retriever based on the ColBERT architecture and the PaliGemma model
Llama3.2: llama3.2:latest

Devops

Vercel
GCP

MLops

Unsloth: Finetune & train LLMs
RunPod: Cloud computing platform for ML apps
Ollama: Run LLM locally

Code Quality & Tooling

Loguru: Simplified Python logging
pre-commit: Multi-language pre-commit hooks manager
Ruff: Fast Python linter and formatter

Setup

Instructions on how to set up the project locally. For example:

Clone the repository:

git clone https://github.com/frieda-huang/csye7230.git

Install dependencies:
```
poetry install
```
Set up pre-commit hooks:
```
pre-commit install
```

Database Setup

Create a Database user: CREATE USER searchagent_user WITH PASSWORD 'your_secure_password';
Create a new database: CREATE DATABASE searchagent OWNER searchagent_user;
Grant necessary privileges: GRANT ALL PRIVILEGES ON DATABASE searchagent TO searchagent_user;
Add PostgreSQL connection to .env DATABASE_URL=postgresql+psycopg://searchagent_user:your_secure_password@localhost:5432/searchagent
Use the searchagent_user in psql psql -U searchagent_user -d searchagent
Ensure searchagent_user has superuser privileges by logging as the superuser ALTER USER searchagent_user WITH SUPERUSER;
Enable the pgvector extension CREATE EXTENSION vector;

Tune Postgres Server Performance

Find config file with SHOW config_file; in mac, it's in /opt/homebrew/var/postgresql@17/postgresql.conf
Use PgTune to set initial values for Postgres server parameters

For example, on my machine (Apple M2 Pro), I have the following initial settings

# DB Version: 17
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 32 GB
# CPUs num: 12
# Connections num: 100
# Data Storage: ssd

max_connections = 100
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 20971kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 12
max_parallel_workers_per_gather = 4
max_parallel_workers = 12
max_parallel_maintenance_workers = 4

Monitor Performance

We use pg_stat_statements to monitor overall query performance in PostgreSQL

Caveat

When querying high I/O activity, top time-consuming queries, or high memory usage, remember to use pg_stat_statements_reset() to flush data from pg_stat_statements

Maintenance

VACUUM ANALYZE; to reduce bloat and refresh query planner statistics

Run Database Migration

Create new migration by running alembic revision --autogenerate -m "YOUR MSG"
Apply new migration by running alembic upgrade head

Caveat on PostgreSQL Column Type Updates

When updating column types in the ORM, ensure the database schema reflects these changes.

For instance, if changing the vector type in the query table from full precision to half-precision (halfvec), apply the corresponding database migration.

ALTER TABLE query
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)[]
USING vector_embedding::HALFVEC(128)[];

For flattened_embedding table

ALTER TABLE flattened_embedding
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)
USING vector_embedding::HALFVEC(128);

To deal with invalid password after the computer didn't complete the shutdown process, we need to execute the following commands:

cd /opt/homebrew/var/postgresql

rm postmaster.pid

brew services restart postgresql@17

File Access Scope

We will only access the user’s home directory, which contains most user-accessible files and data. The home directory includes:

Desktop
Documents
Downloads
Pictures

Example

>>> response = searchagent.query("find csye7230 project proposal")
>>> response.documents

Output:

Document(metadata={'source': '../proposals/csye7230_project_proposal_part_a.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_proposal_part_b.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_benchmarking_report.pdf'}, page_content='...')

>>> response.answer

Output:

The following files match your query:

1. csye7230_project_proposal_part_a.pdf
`../proposals/csye7230_project_proposal_part_a.pdf`

This PDF contains the project proposal for CSYE7230, detailing objectives, methodologies, and expected outcomes.

2. csye7230_project_proposal_part_b.pdf
`../proposals/csye7230_project_proposal_part_b.pdf`

This PDF outlines the implementation plan for project proposal part B, focusing on architecture and design choices.

3. csye7230_project_benchmarking_report.pdf
`../proposals/csye7230_project_benchmarking_report.pdf`

This PDF presents the benchmarking report for CSYE7230, evaluating the performance metrics and analysis of the project components.

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
.github		.github
alembic		alembic
claude-agent		claude-agent
code-search		code-search
colpali-search		colpali-search
data		data
eval_output		eval_output
examples		examples
frontend/searchagent_web_app		frontend/searchagent_web_app
performance_analysis		performance_analysis
performance_logs		performance_logs
searchagent		searchagent
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SCRUM_GUIDE.md		SCRUM_GUIDE.md
alembic.ini		alembic.ini
example.env		example.env
get_started_with_llama_stack.md		get_started_with_llama_stack.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search2.0

PDFSearchTool

CodeSearchTool

Evaluation

Dataset

Tech Stack

Frontend

Backend

Database

Models

Devops

MLops

Code Quality & Tooling

Setup

Database Setup

Tune Postgres Server Performance

Monitor Performance

Caveat

Maintenance

Run Database Migration

Caveat on PostgreSQL Column Type Updates

To deal with invalid password after the computer didn't complete the shutdown process, we need to execute the following commands:

File Access Scope

Example

Links to Docs

Resources

About

Releases

Packages

Contributors 2

Languages

License

frieda-huang/csye7230

Folders and files

Latest commit

History

Repository files navigation

Search2.0

PDFSearchTool

CodeSearchTool

Evaluation

Dataset

Tech Stack

Frontend

Backend

Database

Models

Devops

MLops

Code Quality & Tooling

Setup

Database Setup

Tune Postgres Server Performance

Monitor Performance

Caveat

Maintenance

Run Database Migration

Caveat on PostgreSQL Column Type Updates

To deal with invalid password after the computer didn't complete the shutdown process, we need to execute the following commands:

File Access Scope

Example

Links to Docs

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages