Skip to content

The Firecrawl Toolkit is the easiest way for developers to interact with web content through crawling, scraping, and mapping capabilities.

License

Notifications You must be signed in to change notification settings

RMNCLDYO/firecrawl-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Firecrawl

Firecrawl Toolkit

maintained - yes contributions - welcome

Firecrawl

The Firecrawl Toolkit is the easiest way for developers to interact with web content through crawling, scraping, and mapping capabilities. It offers seamless integration for web crawling, content extraction, and site mapping, allowing you to process websites with advanced features like custom actions, multiple output formats, and batch processingβ€”all in one comprehensive package with minimal dependencies.

πŸš€ Features

  • πŸ•·οΈ Web Crawling: Traverse websites with customizable depth and path controls, supporting both internal and external link processing
  • πŸ“„ Content Extraction: Extract content in multiple formats (Markdown, HTML, raw HTML) with smart content filtering
  • πŸ—ΊοΈ Site Mapping: Generate comprehensive site maps with advanced search and subdomain capabilities
  • πŸ”„ Batch Processing: Process multiple URLs simultaneously with unified configurations
  • πŸ€– Custom Actions: Automate complex interactions (clicking, scrolling, form filling) during scraping
  • πŸ“± Device Emulation: Switch between mobile and desktop views with customizable headers
  • 🌎 Geolocation: Simulate different locations with country and language preferences
  • ⚑ Smart Retry Logic: Built-in retry mechanism with real-time status monitoring and webhooks
  • πŸͺΆ Lightweight Design: Minimal dependencies powered by requests for easy setup and deployment
  • πŸ”’ Robust Error Handling: Comprehensive error catching and validation system
  • 🎯 Parameter Validation: Extensive validation for all API inputs and configurations
  • πŸ“Š Multiple Output Formats: Support for various output types (Markdown, HTML, screenshots, etc.)

πŸ“‹ Table of Contents

πŸ›  Installation

  1. Clone the repository:

    git clone https://github.com/RMNCLDYO/firecrawl-toolkit.git
  2. Navigate to the repository folder:

    cd firecrawl-toolkit
  3. Install the required dependencies:

    pip install -r requirements.txt

πŸ”‘ Configuration

  1. Obtain an API key from Firecrawl.

  2. You have three options for managing your API key:

    Click here to view the API key configuration options
    • Setting it as an environment variable on your device (recommended for everyday use)

      • Navigate to your terminal.
      • Add your API key like so:
        export FIRECRAWL_API_KEY=your_api_key

      This method allows the API key to be loaded automatically when using the wrapper.

    • Using an .env file (recommended for development):

      • Install python-dotenv if you haven't already: pip install python-dotenv.
      • Create a .env file in the project's root directory.
      • Add your API key to the .env file like so:
        FIRECRAWL_API_KEY=your_api_key

      This method allows the API key to be loaded automatically when using the wrapper.

    • Direct Input:

      • If you prefer not to use a .env file, you can directly pass your API key as an argument to the wrapper function.

        Wrapper

        api_key="your_api_key"

      This method requires manually inputting your API key each time you initiate an API call.

πŸ’» Usage

Web Crawling

For traversing websites and extracting content with customizable depth and path controls.

import firecrawl

# Basic crawling
firecrawl.crawl(
    url="https://example.com",
    formats=["markdown", "html"]
)

Content Scraping

For extracting content from specific URLs with custom actions and formatting.

import firecrawl

# Single URL scraping
firecrawl.scrape(
    url="https://example.com",
    formats=["markdown", "html"],
    onlyMainContent=True
)

Batch Scraping

For processing multiple URLs simultaneously with shared configurations.

import firecrawl

# Batch scraping
firecrawl.batch_scrape(
    urls=["https://example.com", "https://sitemaps.org"],
    formats=["markdown", "html"]
)

Site Mapping

For generating comprehensive site maps with search capabilities.

import firecrawl

firecrawl.map(
    url="https://example.com",
    includeSubdomains=True,
    limit=1000
)

βš™οΈ Advanced Configuration

Description Parameter Type Example
Output Formats formats List ["markdown", "html", "rawHtml"]
Main Content Only onlyMainContent Boolean True
Include Tags includeTags List ["article", "main"]
Exclude Tags excludeTags List ["nav", "footer"]
Custom Headers headers Dict {"User-Agent": "Custom"}
Wait Time waitFor Integer 1000
Mobile View mobile Boolean False
Custom Actions actions List [{"type": "click", "selector": "#btn"}]
Location location Dict {"country": "US", "languages": ["en-US"]}

πŸ“Š Available Formats

  • markdown: Formatted Markdown content
  • html: Clean HTML content
  • rawHtml: Original HTML content
  • links: Extracted links
  • extract: Custom content extraction
  • screenshot: Page screenshot
  • screenshot@fullPage: Full page screenshot

πŸ“ Supported Actions

The toolkit supports various page interactions:

Action Type Description Parameters
wait Wait for element/time milliseconds, selector
click Click elements selector
write Input text selector, text
press Press keyboard keys key
scroll Scroll page direction, amount
screenshot Take screenshots fullPage
scrape Get current page state None

πŸ”’ Error Handling and Safety

Error Type Description Solution
ConfigurationError Missing or invalid configuration Check config.yaml and API key
ValidationError Invalid request parameters Verify parameter values
APIError API-related issues Check error message for details
NetworkError Connection problems Verify internet connection
ResponseError Invalid API response Check response format expectations

🀝 Contributing

Contributions are welcome!

Please refer to CONTRIBUTING.md for detailed guidelines on how to contribute to this project.

πŸ› Issues and Support

Encountered a bug? We'd love to hear about it. Please follow these steps to report any issues:

  1. Check if the issue has already been reported.
  2. Use the Bug Report template to create a detailed report.
  3. Submit the report here.

Your report will help us make the project better for everyone.

πŸ’‘ Feature Requests

Got an idea for a new feature? Feel free to suggest it. Here's how:

  1. Check if the feature has already been suggested or implemented.
  2. Use the Feature Request template to create a detailed request.
  3. Submit the request here.

Your suggestions for improvements are always welcome.

πŸ” Versioning and Changelog

Stay up-to-date with the latest changes and improvements in each version:

  • CHANGELOG.md provides detailed descriptions of each release.

πŸ” Security

Your security is important to us. If you discover a security vulnerability, please follow our responsible disclosure guidelines found in SECURITY.md. Please refrain from disclosing any vulnerabilities publicly until said vulnerability has been reported and addressed.

πŸ“„ License

Licensed under the MIT License. See LICENSE for details.

About

The Firecrawl Toolkit is the easiest way for developers to interact with web content through crawling, scraping, and mapping capabilities.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages