GitHub Ideation Bot: A LLMOps Pipeline with LlamaIndex, Airflow and Docker

Author: Shefali Shrivastava

Project Overview

This project features a semi-automated LLMOps pipeline designed to scrape README.md files from popular GitHub repositories in the fields of data science and machine learning to identify innovative project ideas and technical insights. The data pipeline, orchestrated entirely on Airflow, ingests README.md files and generates embeddings using LlamaIndex and OpenAI's legacy embedding model. It then deploys a chatbot with retrieval-augmented generation (RAG) capabilities powered by OpenAI's GPT-3.5 to interact with the embeddings, providing users with relevant responses to queries. The entire pipeline is containerized using Docker and hosted on an AWS EC2 instance (t2.xlarge) with PostgreSQL as primary database and Google Cloud Storage as intermediate database.

Notice: All information provided is intended solely for educational purposes. When generating ideas using the bot, it is your responsibility to conduct thorough research and ensure the inclusion of accurate citations where applicable. Do not imitate any repository unless it is legally permissible under the license provided. The author takes no responsibility for any conflicts arising therefrom.

Purpose

This project is designed to empower students and amateur data enthusiasts by providing a platform to discover creative project ideas and gain technical insights in data science and machine learning. Common LLMs, trained on generic data, often suggest basic projects like house price prediction or movie recommendation systems, which are generally insufficient for demonstrating technical skills by industry standards. This platform, through its chatbot ("GitHub Project Ideas") with retrieval-augmented generation (RAG) capabilities, helps users find more advanced and relevant projects by searching specific topics within README.md files from top GitHub repositories.

Beyond its primary educational purpose, this project also presents business potential. While it is designed to be a resource for learning and development in data science, the growing demand for advanced resources highlights its potential market value. Although not created for commercial use, it presents a unique resource that could cater to a broad and eager audience in the field.

Tech stack

Key features

Automated Data Collection: Scraped README.md files from popular GitHub repositories on a weekly basis.
Data Processing and Embeddings: Utilized LlamaIndex framework and OpenAI models for embeddings generation, and PostgreSQL vector database for storage.
Interactive Chatbot: Deployed a RAG chatbot using GPT-3.5 on Streamlit.*
Containerized Deployment: Used Docker for seamless deployment and scalability.
Cloud Hosting: Hosted on AWS EC2 t2.xlarge.
Modularized code: Aimed to design the codebase with modular components to enhance maintainability and reusability.

*The web app is currently unavailable for open access due to resource constraints, but can be replicated by using this repository. Artifacts such as screenshots and videos can be found in the src/assets folder.

Architecture

Figure 1: Overall architecture of the model

Figure 2: Data ingestion pipeline

Installation and Usage

Prerequisites

To run this repository, you will need the following set-up:

AWS EC2
Docker
Google Cloud Storage (with read and write access, and the ability to set up external connections)
GitHub PAT
OpenAI API token
PostgreSQL database (with read and write access)

Steps

Clone the repository:

git clone https://github.com/shefalishr95/RAG-app-using-LlamaIndex.git

Create a .env file in the root directory with the following content:

AAIRFLOW_UID=50000
AIRFLOW_GID=0
OPENAI_API_KEY='your_api_key'
GITHUB_API_KEY='your_api_key'

Note: If you use Docker Swarm or any other tool to secure secrets such as API keys and connections, you do not need to store API keys in .env file.

Build the Airflow and PostgreSQL containers: Follow the instructions to build and run the containers for Airflow and PostgreSQL.

Figure 3: Sample Docker containers set-up
Set up external connections: Configure Google Cloud Storage and PostgreSQL connections in Airflow. This can be done using the UI or in-line code. If you choose the latter, you will need to add the code manually.
Download pgvector: Ensure that pgvector is installed in your PostgreSQL container. This extension is necessary for storing vectors and conducting vector similarity search.

Figure 4: 'data-embeddings' table with vectors (dim: 1536)
Open the Airflow UI: Access the Airflow UI at http://your-host:8080.

Figure 5: Airflow UI with DAGs set up
Trigger the DAG: Trigger both DAGs, making sure to adjust the start date and frequency according to your requirements.

Figure 6: Airflow DAG: github-data-scraping-etl

Figure 7: Airflow DAG: generate-and-load-embeddings

Directory Structure

RAG-app-using-LlamaIndex
├─ config
│  └─ config.json
├─ dags
│  ├─ etl_dag.py
│  └─ generate_embeddings.py
├─ logs
│  └─ scheduler
│     └─ latest
├─ plugins
├─ src
│  └─ assets
│     ├─ diagrams
│     │  ├─ data_ingestion.JPG
│     │  └─ overall_architecture.JPG
│     ├─ screenshots
│     │  ├─ DAGs-Airflow.png
│     │  ├─ docker-containers.png
│     │  ├─ generate_and_load_embeddings-Grid-Airflow.png
│     │  ├─ github_data_scraping_etl-Grid-Airflow.png
│     │  ├─ overview.png
│     │  ├─ pgadmin-1.png
│     │  └─ pgadmin-2.png
│     └─ videos
│        ├─ GitHub Project Ideas-anamoly detection-papers only.webm
│        ├─ GitHub Project Ideas-CV.mp4
│        ├─ GitHub Project Ideas-time series forecasting.webm
│        ├─ GitHub Project Ideas-time series.mp4
│        ├─ GitHub Project Ideas-zero shot learning.webm
│        └─ main-page.gif
├─ streamlit
│  ├─ app.py
│  └─ requirements.txt
├─ utils
│  ├─ __pycache__
│  │  ├─ embed.cpython-311.pyc
│  │  ├─ extract.cpython-311.pyc
│  │  ├─ process.cpython-311.pyc
│  │  └─ __init__.cpython-311.pyc
│  ├─ embed.py
│  ├─ extract.py
│  ├─ process.py
│  └─ __init__.py
├─ docker-compose.yaml
├─ Dockerfile
├─ LICENSE
├─ README.md
└─ requirements.txt

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Ideation Bot: A LLMOps Pipeline with LlamaIndex, Airflow and Docker

Author: Shefali Shrivastava

Project Overview

Purpose

Tech stack

Key features

Architecture

Installation and Usage

Prerequisites

Steps

Directory Structure

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
dags		dags
logs/scheduler		logs/scheduler
src/assets		src/assets
streamlit		streamlit
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

License

shefalishr95/RAG-app-using-LlamaIndex

Folders and files

Latest commit

History

Repository files navigation

GitHub Ideation Bot: A LLMOps Pipeline with LlamaIndex, Airflow and Docker

Author: Shefali Shrivastava

Project Overview

Purpose

Tech stack

Key features

Architecture

Installation and Usage

Prerequisites

Steps

Directory Structure

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages