Author: Shefali Shrivastava
This project features a semi-automated LLMOps pipeline designed to scrape README.md files from popular GitHub repositories in the fields of data science and machine learning to identify innovative project ideas and technical insights. The data pipeline, orchestrated entirely on Airflow, ingests README.md files and generates embeddings using LlamaIndex and OpenAI's legacy embedding model. It then deploys a chatbot with retrieval-augmented generation (RAG) capabilities powered by OpenAI's GPT-3.5
to interact with the embeddings, providing users with relevant responses to queries. The entire pipeline is containerized using Docker and hosted on an AWS EC2 instance (t2.xlarge
) with PostgreSQL as primary database and Google Cloud Storage as intermediate database.
Notice: All information provided is intended solely for educational purposes. When generating ideas using the bot, it is your responsibility to conduct thorough research and ensure the inclusion of accurate citations where applicable. Do not imitate any repository unless it is legally permissible under the license provided. The author takes no responsibility for any conflicts arising therefrom.
This project is designed to empower students and amateur data enthusiasts by providing a platform to discover creative project ideas and gain technical insights in data science and machine learning. Common LLMs, trained on generic data, often suggest basic projects like house price prediction or movie recommendation systems, which are generally insufficient for demonstrating technical skills by industry standards. This platform, through its chatbot ("GitHub Project Ideas") with retrieval-augmented generation (RAG) capabilities, helps users find more advanced and relevant projects by searching specific topics within README.md files from top GitHub repositories.
Beyond its primary educational purpose, this project also presents business potential. While it is designed to be a resource for learning and development in data science, the growing demand for advanced resources highlights its potential market value. Although not created for commercial use, it presents a unique resource that could cater to a broad and eager audience in the field.
- Automated Data Collection: Scraped README.md files from popular GitHub repositories on a weekly basis.
- Data Processing and Embeddings: Utilized LlamaIndex framework and OpenAI models for embeddings generation, and PostgreSQL vector database for storage.
- Interactive Chatbot: Deployed a RAG chatbot using GPT-3.5 on Streamlit.*
- Containerized Deployment: Used Docker for seamless deployment and scalability.
- Cloud Hosting: Hosted on AWS EC2
t2.xlarge
. - Modularized code: Aimed to design the codebase with modular components to enhance maintainability and reusability.
*The web app is currently unavailable for open access due to resource constraints, but can be replicated by using this repository. Artifacts such as screenshots and videos can be found in the src/assets
folder.
Figure 1: Overall architecture of the model
Figure 2: Data ingestion pipeline
To run this repository, you will need the following set-up:
- AWS EC2
- Docker
- Google Cloud Storage (with read and write access, and the ability to set up external connections)
- GitHub PAT
- OpenAI API token
- PostgreSQL database (with read and write access)
-
Clone the repository:
git clone https://github.com/shefalishr95/RAG-app-using-LlamaIndex.git
-
Create a
.env
file in the root directory with the following content:AAIRFLOW_UID=50000 AIRFLOW_GID=0 OPENAI_API_KEY='your_api_key' GITHUB_API_KEY='your_api_key'
Note: If you use Docker Swarm or any other tool to secure secrets such as API keys and connections, you do not need to store API keys in .env
file.
-
Build the Airflow and PostgreSQL containers: Follow the instructions to build and run the containers for Airflow and PostgreSQL.
-
Set up external connections: Configure Google Cloud Storage and PostgreSQL connections in Airflow. This can be done using the UI or in-line code. If you choose the latter, you will need to add the code manually.
-
Download pgvector: Ensure that pgvector is installed in your PostgreSQL container. This extension is necessary for storing vectors and conducting vector similarity search.
-
Open the Airflow UI: Access the Airflow UI at
http://your-host:8080
. -
Trigger the DAG: Trigger both DAGs, making sure to adjust the start date and frequency according to your requirements.
RAG-app-using-LlamaIndex
├─ config
│ └─ config.json
├─ dags
│ ├─ etl_dag.py
│ └─ generate_embeddings.py
├─ logs
│ └─ scheduler
│ └─ latest
├─ plugins
├─ src
│ └─ assets
│ ├─ diagrams
│ │ ├─ data_ingestion.JPG
│ │ └─ overall_architecture.JPG
│ ├─ screenshots
│ │ ├─ DAGs-Airflow.png
│ │ ├─ docker-containers.png
│ │ ├─ generate_and_load_embeddings-Grid-Airflow.png
│ │ ├─ github_data_scraping_etl-Grid-Airflow.png
│ │ ├─ overview.png
│ │ ├─ pgadmin-1.png
│ │ └─ pgadmin-2.png
│ └─ videos
│ ├─ GitHub Project Ideas-anamoly detection-papers only.webm
│ ├─ GitHub Project Ideas-CV.mp4
│ ├─ GitHub Project Ideas-time series forecasting.webm
│ ├─ GitHub Project Ideas-time series.mp4
│ ├─ GitHub Project Ideas-zero shot learning.webm
│ └─ main-page.gif
├─ streamlit
│ ├─ app.py
│ └─ requirements.txt
├─ utils
│ ├─ __pycache__
│ │ ├─ embed.cpython-311.pyc
│ │ ├─ extract.cpython-311.pyc
│ │ ├─ process.cpython-311.pyc
│ │ └─ __init__.cpython-311.pyc
│ ├─ embed.py
│ ├─ extract.py
│ ├─ process.py
│ └─ __init__.py
├─ docker-compose.yaml
├─ Dockerfile
├─ LICENSE
├─ README.md
└─ requirements.txt
This project is licensed under the MIT License. See the LICENSE file for more details.