Skip to content

frieda-huang/csye7230

Repository files navigation

Search2.0

Search2.0 allows users to search their local files using natural language, powered by LLM agents. It use tools like PDF and code search to find relevant files. For example, a user might ask, “Find the CSYE7230 project proposal,” and the app will return the most relevant file paths, a summary, and allow them to view the content. We will build our agentic system using Llama Stack, developed by Meta.

PDFSearchTool

We plan to scale the PDF search tool using ColPali for production. ColPali eliminates the need for complex and fragile layout recognition or OCR pipelines by using a single model that understands both the text and visual content (e.g., layout, charts) of a document. It has delivered the best results so far. We will experiment with the following ideas:

  • Use smaller models such as Llama3.2 and key value cache to reduce latency
  • Cache warming when users are typing
  • KV cache compression
  • Binary Quantization
  • Model distillation
  • Sync local files and remote indexes using hash and hierarchical file traversal

CodeSearchTool

TODO

Evaluation

  1. We use LLama3.2-vision to generate a synthetic evaluation dataset.

  2. We use the CVPR 2019 Papers dataset as our source of PDF documents.

  3. We use Langchain to generate synthetic data and create a baseline for our evaluation.

Dataset

We use the CVPR 2019 Papers dataset from Kaggle, containing over 1,000 academic papers from the CVPR 2019 conference. From this dataset, 5 papers were randomly selected to generate 10 test sets (cvpr2019_5papers_testset_12q.csv) using the Ragas framework. The dataset can be found at friedahuang/cvpr2019_5papers_testset_12q See ragas_evaluate.py for the implementation.

  1. Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf
  2. Aakur_A_Perceptual_Prediction_Framework_for_Self_Supervised_Event_Segmentation_CVPR_2019_paper.pdf
  3. Abati_Latent_Space_Autoregression_for_Novelty_Detection_CVPR_2019_paper.pdf
  4. Abavisani_Improving_the_Performance_of_Unimodal_Dynamic_Hand-Gesture_Recognition_With_Multimodal_CVPR_2019_paper.pdf
  5. Abbasnejad_A_Generative_Adversarial_Density_Estimator_CVPR_2019_paper.pdf

Additionally, we use the Huggingface dataset (m-ric/huggingface_doc) to generate 347 question-answer pairs (QA couples). The synthetically generated QA couples can be found at friedahuang/m-ric_huggingface_doc_347. We will focus on this evaluation dataset because it has a larger volume and we've already established a baseline RAG system benchmarked against it. See benchmark_rag.py for the implementation.

Tech Stack

Frontend

Backend

  • Python
  • SQLAlchemy: Python SQL toolkit and Object Relational Mapper
  • Llama Stack: Standardize the building blocks needed to bring generative AI applications to market
  • pgvector: An extension of PostgreSQL with the ability to store and search vector embeddings alongside regular data
  • LangChain: Framework for LLM applications (It is only used for evaluation purpose)

Database

Models

  • ColPali: A vision retriever based on the ColBERT architecture and the PaliGemma model
  • Llama3.2: llama3.2:latest

Devops

  • Vercel
  • GCP

MLops

  • Unsloth: Finetune & train LLMs
  • RunPod: Cloud computing platform for ML apps
  • Ollama: Run LLM locally

Code Quality & Tooling

  • Loguru: Simplified Python logging
  • pre-commit: Multi-language pre-commit hooks manager
  • Ruff: Fast Python linter and formatter

Setup

Instructions on how to set up the project locally. For example:

  1. Clone the repository:

    git clone https://github.com/frieda-huang/csye7230.git
    
  2. Install dependencies:

    poetry install
    
  3. Set up pre-commit hooks:

    pre-commit install
    

Database Setup

  1. Create a Database user: CREATE USER searchagent_user WITH PASSWORD 'your_secure_password';
  2. Create a new database: CREATE DATABASE searchagent OWNER searchagent_user;
  3. Grant necessary privileges: GRANT ALL PRIVILEGES ON DATABASE searchagent TO searchagent_user;
  4. Add PostgreSQL connection to .env DATABASE_URL=postgresql+psycopg://searchagent_user:your_secure_password@localhost:5432/searchagent
  5. Use the searchagent_user in psql psql -U searchagent_user -d searchagent
  6. Ensure searchagent_user has superuser privileges by logging as the superuser ALTER USER searchagent_user WITH SUPERUSER;
  7. Enable the pgvector extension CREATE EXTENSION vector;

Tune Postgres Server Performance

  1. Find config file with SHOW config_file; in mac, it's in /opt/homebrew/var/postgresql@17/postgresql.conf
  2. Use PgTune to set initial values for Postgres server parameters

For example, on my machine (Apple M2 Pro), I have the following initial settings

# DB Version: 17
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 32 GB
# CPUs num: 12
# Connections num: 100
# Data Storage: ssd

max_connections = 100
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 20971kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 12
max_parallel_workers_per_gather = 4
max_parallel_workers = 12
max_parallel_maintenance_workers = 4

Monitor Performance

We use pg_stat_statements to monitor overall query performance in PostgreSQL

Caveat

When querying high I/O activity, top time-consuming queries, or high memory usage, remember to use pg_stat_statements_reset() to flush data from pg_stat_statements

Maintenance

VACUUM ANALYZE; to reduce bloat and refresh query planner statistics

Run Database Migration

  1. Create new migration by running alembic revision --autogenerate -m "YOUR MSG"
  2. Apply new migration by running alembic upgrade head

Caveat on PostgreSQL Column Type Updates

When updating column types in the ORM, ensure the database schema reflects these changes.

For instance, if changing the vector type in the query table from full precision to half-precision (halfvec), apply the corresponding database migration.

ALTER TABLE query
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)[]
USING vector_embedding::HALFVEC(128)[];

For flattened_embedding table

ALTER TABLE flattened_embedding
ALTER COLUMN vector_embedding
TYPE HALFVEC(128)
USING vector_embedding::HALFVEC(128);

To deal with invalid password after the computer didn't complete the shutdown process, we need to execute the following commands:

cd /opt/homebrew/var/postgresql

rm postmaster.pid

brew services restart postgresql@17

File Access Scope

We will only access the user’s home directory, which contains most user-accessible files and data. The home directory includes:

  • Desktop
  • Documents
  • Downloads
  • Pictures

Example

>>> response = searchagent.query("find csye7230 project proposal")
>>> response.documents

Output:

Document(metadata={'source': '../proposals/csye7230_project_proposal_part_a.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_proposal_part_b.pdf'}, page_content='...')
Document(metadata={'source': '../proposals/csye7230_project_benchmarking_report.pdf'}, page_content='...')

>>> response.answer

Output:

The following files match your query:

1. csye7230_project_proposal_part_a.pdf
`../proposals/csye7230_project_proposal_part_a.pdf`

This PDF contains the project proposal for CSYE7230, detailing objectives, methodologies, and expected outcomes.

2. csye7230_project_proposal_part_b.pdf
`../proposals/csye7230_project_proposal_part_b.pdf`

This PDF outlines the implementation plan for project proposal part B, focusing on architecture and design choices.

3. csye7230_project_benchmarking_report.pdf
`../proposals/csye7230_project_benchmarking_report.pdf`

This PDF presents the benchmarking report for CSYE7230, evaluating the performance metrics and analysis of the project components.

Links to Docs

Resources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published