GitHub Scraper - Pure Selenium

Web Scraping with Selenium allows you to gather all the required data using Selenium Webdriver Browser Automation. Selenium crawls the target URL webpage and gathers data at scale. This article demonstrates how to do web scraping using Selenium.

Aim of Project

We should give one url of GitHub domain. It could be of anything having github.com domain like user repository, user home page, search page, advanced search page, etc.

This scraper will analyse the given url and fetch related urls. For example, if we give url https://github.com/Parth971 then it will fetch all repositories url of Parth971 user.

Project Setup

Python Version: 3.8+

1. Set Python Virtual Environment (recommended)

Install Python Virtual Environment

To Activate virtual environment

# For Linux
source myenv/bin/activate 

# For Windows
myenv\Scripts\activate

2. Install requirements

pip install -r requirements.txt

3. Steps to run

Script 0: Get all repositories link

python script0.py

To see all repositories links, open /outputs/collected_links.txt
To see logs of getting repositories, open /outputs/scraping_links.log

Script 1: Download all repositories as zip file

python script1.py

To see all downloaded repositories as zip file, open /RepoDownloads/
To see all downloaded repositories links, open /outputs/downloaded_link.txt
To see failed repositories links, open /outputs/failed_link.txt
To see logs of downloading repositories, open /outputs/downloading_links.log

Script 2. Unzip all zip repositories

python script2.py

To see list of unzipped files names, open /outputs/unzipped_repositories.txt
To see list of failed unzipped files names, open /outputs/unzip_failed_link.txt
To see logs of unzipping repositories, open /outputs/unzip.log

Parameters in script0.py file

To set number for downloading repositories

# 1. set integer to download specific number of links
# 2. Set None to download all links
links_to_download = 15

To set initial link (links of : users repository, searches [ repositories, commits, issues, discussion, wikis ] )
```
LINK = 'https://github.com/search?q=django+celery+drf'
```
To set ban waiting time (seconds)
```
ban_waiting_time = 30
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Scraper - Pure Selenium

Aim of Project

Project Setup

1. Set Python Virtual Environment (recommended)

To Activate virtual environment

2. Install requirements

3. Steps to run

Script 0: Get all repositories link

Script 1: Download all repositories as zip file

Script 2. Unzip all zip repositories

Parameters in script0.py file

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
RepoDownloads		RepoDownloads
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
script0.py		script0.py
script1.py		script1.py
script2.py		script2.py

License

Parth971/Github-Scraper-Pure-Selenium

Folders and files

Latest commit

History

Repository files navigation

GitHub Scraper - Pure Selenium

Aim of Project

Project Setup

1. Set Python Virtual Environment (recommended)

To Activate virtual environment

2. Install requirements

3. Steps to run

Script 0: Get all repositories link

Script 1: Download all repositories as zip file

Script 2. Unzip all zip repositories

Parameters in script0.py file

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages