This project is a Python-based web scraper and data analyzer that extracts quotes, authors, and associated tags from the website Quotes to Scrape. It processes the data to create structured CSV files and includes functionalities for filtering and analyzing quotes by tags.
- Scrapes quotes, authors, and tags from multiple pages.
- Cleans and processes the data for further analysis.
- Identifies the top 10 most frequent tags on the website.
- Creates CSV files:
data1.csv
: Contains all scraped quotes with binary columns for the top 10 tags.data2.csv
: Filters quotes by selected tags and sorts them by the author's name.
-
Clone this repository:
git clone https://github.com/yourusername/yourrepository.git
-
Navigate to the project directory:
cd yourrepository
-
Install the required Python libraries:
pip install -r requirements.txt
(Make sure to include a
requirements.txt
file with libraries such asrequests
,beautifulsoup4
,pandas
, andnltk
.) -
Download the NLTK stopwords and tokenizer resources:
import nltk nltk.download('stopwords') nltk.download('punkt')
-
Run the Jupyter Notebook or script:
jupyter notebook Script.ipynb
or execute it as a Python script:
python Script.ipynb
-
The output CSV files (
data1.csv
anddata2.csv
) will be saved in the project directory.
This project was developed by:
This project is licensed under the MIT License.