Skip to content

A powerful Reddit data scraping tool with a user-friendly Streamlit interface. Extract posts and comments from subreddits or specific posts with ease.

License

Notifications You must be signed in to change notification settings

pakagronglb/reddit-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Data Scraper 📊

Screenshot 2024-12-10 162954

Python Streamlit PRAW Pandas

A powerful Reddit data scraping tool with a user-friendly Streamlit interface. Extract posts and comments from subreddits or specific posts with ease.

🚀 Features

  • 📱 User-friendly web interface
  • 🔍 Scrape posts from any subreddit
  • 💬 Extract comments from specific posts
  • 📊 Export data to CSV
  • ⏱️ Time-based filtering
  • 🔄 Caching for better performance

🛠️ Tech Stack

  • Python - Core programming language
  • Streamlit - Web interface framework
  • PRAW - Reddit API wrapper
  • Pandas - Data manipulation and analysis
  • python-dotenv - Environment variable management

📋 Prerequisites

⚙️ Installation

  1. Clone the repository:
git clone https://github.com/pakagronglb/reddit-scraper.git
cd reddit-scraper
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables: Create a .env file in the project root:
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USER_AGENT=your_user_agent

🚀 Usage

  1. Start the application:
streamlit run main.py
  1. Access the web interface at http://localhost:8501

  2. Choose your scraping option:

    • Subreddit Posts: Enter subreddit name, post limit, and time filter
    • Specific Post: Enter the Reddit post URL
  3. Click "Scrape" and download the results as CSV

🌐 Deployment

Streamlit Cloud

  1. Push your code to GitHub
  2. Visit share.streamlit.io
  3. Connect your repository
  4. Add your Reddit API credentials in Streamlit secrets

Heroku

  1. Create a Heroku app:
heroku create your-app-name
  1. Set environment variables:
heroku config:set REDDIT_CLIENT_ID=your_client_id
heroku config:set REDDIT_CLIENT_SECRET=your_client_secret
heroku config:set REDDIT_USER_AGENT=your_user_agent
  1. Deploy:
git push heroku main

📝 Configuration

  • requirements.txt - Project dependencies
  • .env - Local environment variables
  • Procfile - Heroku deployment configuration
  • runtime.txt - Python runtime specification

🔒 Security

  • Never commit your .env file or .streamlit/secrets.toml
  • Use environment variables for sensitive data
  • Keep your Reddit API credentials secure

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👏 Acknowledgments

📧 Contact

Your Name - @pakagronglb

Project Link: https://github.com/pakagronglb/reddit-scraper

About

A powerful Reddit data scraping tool with a user-friendly Streamlit interface. Extract posts and comments from subreddits or specific posts with ease.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published