Web Scraping with Selenium allows you to gather all the required data using Selenium Webdriver Browser Automation. Selenium crawls the target URL webpage and gathers data at scale. This article demonstrates how to do web scraping using Selenium.
We should give one url of GitHub domain. It could be of anything having github.com
domain like user repository, user
home page, search page, advanced search page, etc.
This scraper will analyse the given url and fetch related urls. For example, if we give
url https://github.com/Parth971
then it will fetch all repositories url of Parth971
user.
Python Version: 3.8+
Install Python Virtual Environment
# For Linux
source myenv/bin/activate
# For Windows
myenv\Scripts\activate
pip install -r requirements.txt
python script0.py
To see all repositories links, open /outputs/collected_links.txt
To see logs of getting repositories, open /outputs/scraping_links.log
python script1.py
To see all downloaded repositories as zip file, open /RepoDownloads/
To see all downloaded repositories links, open /outputs/downloaded_link.txt
To see failed repositories links, open /outputs/failed_link.txt
To see logs of downloading repositories, open /outputs/downloading_links.log
python script2.py
To see list of unzipped files names, open /outputs/unzipped_repositories.txt
To see list of failed unzipped files names, open /outputs/unzip_failed_link.txt
To see logs of unzipping repositories, open /outputs/unzip.log
-
To set number for downloading repositories
# 1. set integer to download specific number of links # 2. Set None to download all links links_to_download = 15
-
To set initial link (links of : users repository, searches [ repositories, commits, issues, discussion, wikis ] )
LINK = 'https://github.com/search?q=django+celery+drf'
-
To set ban waiting time (seconds)
ban_waiting_time = 30