A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
- Extracts and saves text content from HTML and PDF files
- Adds metadata (URL and timestamp) to each saved file
- Concurrent crawling with configurable workers
- Robust error handling and detailed logging
- Configurable through YAML files
- URL sanitization and normalization
- State preservation and recovery
- Rate limiting and polite crawling
- Command-line interface
- Saves images (PNG, JPG, JPEG) in the
image
folder with their respective formats - Displays ASCII art ("POWERED", "BY", "M-LAI") every 5 steps during the crawl
-
Clone the repository:
git clone https://github.com/simonpierreboucher/Crawler.git cd Crawler
-
Create and activate a virtual environment:
# On Unix/MacOS python3 -m venv venv source venv/bin/activate # On Windows python -m venv venv .\venv\Scripts\activate
-
Install the package and dependencies:
pip install -e .
Each extracted page is saved as a text file with the following format:
URL: https://www.example.com/page
Timestamp: 2024-11-12 23:45:12
====================================================================================================
[Extracted content from the page]
====================================================================================================
End of content from: https://www.example.com/page
Images are saved in the image
folder with the respective format (PNG, JPG, JPEG).
Configure the crawler through config/settings.yaml
:
domain:
name: "www.example.com"
start_url: "https://www.example.com"
timeouts:
connect: 10
read: 30
max_retries: 3
max_redirects: 5
crawler:
max_workers: 5
max_queue_size: 10000
chunk_size: 8192
delay_min: 1
delay_max: 3
files:
max_length: 200
max_url_length: 2000
max_log_size: 10485760 # 10MB
max_log_backups: 5
excluded:
extensions:
- ".jpg"
- ".jpeg"
- ".png"
- ".gif"
- ".css"
- ".js"
- ".ico"
- ".xml"
patterns:
- "login"
- "logout"
- "signin"
- "signup"
python run.py
python run.py --config path/to/config.yaml --output path/to/output
python run.py --resume
--config, -c
: Path to configuration file (default: config/settings.yaml)--output, -o
: Output directory for crawled content (default: text)--resume, -r
: Resume from previous crawl state
crawler/
│
├── config/
│ ├── __init__.py
│ └── settings.yaml
│
├── src/
│ ├── __init__.py
│ ├── constants.py
│ ├── session.py
│ ├── extractors.py
│ ├── processors.py
│ ├── crawler.py
│ └── utils.py
│
├── requirements.txt
├── setup.py
└── run.py
- requests>=2.31.0
- beautifulsoup4>=4.12.2
- PyPDF2>=3.0.1
- fake-useragent>=1.1.1
- tldextract>=5.0.1
- urllib3>=2.0.7
- pyyaml>=6.0.1
- click>=8.1.7
The crawler includes:
- Automatic retries for failed requests
- Detailed logging of all errors
- Graceful shutdown on interruption
- State preservation on errors
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Simon-Pierre Boucher - Initial work - Github
-
0.2
- Added metadata to saved files
- Improved error handling
- Enhanced logging system
- Display ASCII art at regular intervals (every 5 steps)
-
0.1
- Initial Release
- Basic functionality with HTML and PDF support
- Configurable crawling parameters
Project Link: https://github.com/simonpierreboucher/Crawler