Web Crawler for Text Extraction

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

Features

Extracts and saves text content from HTML and PDF files
Adds metadata (URL and timestamp) to each saved file
Concurrent crawling with configurable workers
Robust error handling and detailed logging
Configurable through YAML files
URL sanitization and normalization
State preservation and recovery
Rate limiting and polite crawling
Command-line interface
Saves images (PNG, JPG, JPEG) in the image folder with their respective formats
Displays ASCII art ("POWERED", "BY", "M-LAI") every 5 steps during the crawl

Installation

Clone the repository:

git clone https://github.com/simonpierreboucher/Crawler.git
cd Crawler

Create and activate a virtual environment:

# On Unix/MacOS
python3 -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
.\venv\Scripts\activate

Install the package and dependencies:
```
pip install -e .
```

Output Format

Each extracted page is saved as a text file with the following format:

URL: https://www.example.com/page
Timestamp: 2024-11-12 23:45:12
====================================================================================================

[Extracted content from the page]

====================================================================================================
End of content from: https://www.example.com/page

Images are saved in the image folder with the respective format (PNG, JPG, JPEG).

Configuration

Configure the crawler through config/settings.yaml:

domain:
  name: "www.example.com"
  start_url: "https://www.example.com"

timeouts:
  connect: 10
  read: 30
  max_retries: 3
  max_redirects: 5

crawler:
  max_workers: 5
  max_queue_size: 10000
  chunk_size: 8192
  delay_min: 1
  delay_max: 3

files:
  max_length: 200
  max_url_length: 2000
  max_log_size: 10485760  # 10MB
  max_log_backups: 5

excluded:
  extensions:
    - ".jpg"
    - ".jpeg"
    - ".png"
    - ".gif"
    - ".css"
    - ".js"
    - ".ico"
    - ".xml"
  
  patterns:
    - "login"
    - "logout"
    - "signin"
    - "signup"

Usage

Basic Usage

python run.py

With Custom Configuration

python run.py --config path/to/config.yaml --output path/to/output

Resume Previous Crawl

python run.py --resume

Command-line Options

--config, -c: Path to configuration file (default: config/settings.yaml)
--output, -o: Output directory for crawled content (default: text)
--resume, -r: Resume from previous crawl state

Project Structure

crawler/
│
├── config/
│   ├── __init__.py
│   └── settings.yaml
│
├── src/
│   ├── __init__.py
│   ├── constants.py
│   ├── session.py
│   ├── extractors.py
│   ├── processors.py
│   ├── crawler.py
│   └── utils.py
│
├── requirements.txt
├── setup.py
└── run.py

Dependencies

requests>=2.31.0
beautifulsoup4>=4.12.2
PyPDF2>=3.0.1
fake-useragent>=1.1.1
tldextract>=5.0.1
urllib3>=2.0.7
pyyaml>=6.0.1
click>=8.1.7

Error Handling

The crawler includes:

Automatic retries for failed requests
Detailed logging of all errors
Graceful shutdown on interruption
State preservation on errors

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Simon-Pierre Boucher - Initial work - Github

Version History

0.2
- Added metadata to saved files
- Improved error handling
- Enhanced logging system
- Display ASCII art at regular intervals (every 5 steps)
0.1
- Initial Release
- Basic functionality with HTML and PDF support
- Configurable crawling parameters

Contact

Project Link: https://github.com/simonpierreboucher/Crawler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler for Text Extraction

Features

Installation

Output Format

Configuration

Usage

Basic Usage

With Custom Configuration

Resume Previous Crawl

Command-line Options

Project Structure

Dependencies

Error Handling

Contributing

License

Authors

Version History

Contact

About

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
config		config
src		src
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

simonpierreboucher/Crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler for Text Extraction

Features

Installation

Output Format

Configuration

Usage

Basic Usage

With Custom Configuration

Resume Previous Crawl

Command-line Options

Project Structure

Dependencies

Error Handling

Contributing

License

Authors

Version History

Contact

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages