Skip to content

The WebCrawler project is a concurrent web crawler written in Go, designed to efficiently traverse websites, extract links, and build a hierarchical sitemap of the entire site.

Notifications You must be signed in to change notification settings

karan56625/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webcrawler and Webcrawler-Client

Overview

Webcrawler: A Go application that crawls websites and generates a sitemap. Webcrawler-Client: A Go client application that interacts with the Webcrawler service to initiate crawling and retrieve sitemaps.

Prerequisites

Go (Golang)

Docker

Kubernetes

Access to a Kubernetes Cluster

  • You need access to a Kubernetes cluster to deploy the applications. This can be a local cluster (like Minikube) or a managed Kubernetes service (like GKE, EKS, or AKS).

Git

  • Required for cloning the repositories.
  • Install from Git Official Site.

How to run

Running Locally

Webcrawler

Clone the Repository
git clone https://github.com/yourusername/webcrawler.git
cd webcrawler
Install Dependencies

Ensure Go is installed, then run:

go mod download
Build the Project

Build the executable:

make build
Run the Application

Start the Webcrawler server:

./bin/webcrawler

Now, you can access the crawler at http://localhost:8081/crawl?url=

There are few configuration that you can provide through environment variables.

NUMBER_OF_WORKER - To define how many workers should be used to crawl the url.

WORKER_QUEUE_LENGTH - To define the length of the worker queue. All crawled url are stored in a queue so that any worker can pick that from there.

Webcrawler-Client

Clone the Repository
git clone https://github.com/karan56625/webcrawler.git
cd webcrawler
Install Dependencies

Ensure Go is installed, then run:

go mod download
Build the Webcrawler Client

Build the executable:

make build-client
Run the Application

Make sure, Webcrawler server is running. You can run the webcrawler in separate terminal.

Run the Webcrawler client to crawl any URL:

./bin/webcrawler-client -url <target-url>

e.g.

./bin/webcrawler-client -url https://redhat.com

There are few configuration that you can provide through environment variables.

WEBCRAWLER_HOST - Host of the Webcrawler server. By default, it is http://localhost.

WEBCRAWLER_PORT - Port of the Webcrawler server. By default, it is 8081.

Using Docker

Webcrawler

Build the Docker Image

Navigate to the Webcrawler project directory and build the webcrawler server image:

make docker-webcrawler

or

docker build -f docker/webcrawler/Dockerfile -t webcrawler:1.0 .
Run the Docker Container

Run the container:

make docker-webcrawler-run

or

docker run -p 8081:8081 webcrawler:1.0

Now, you can access the crawler at http://localhost:8081/crawl?url=

Webcrawler-Client

Build the Docker Image

Navigate to the Webcrawler project directory and build the webcrawler client image:

make docker-webcrawler-client

or

docker build -f docker/webcrawler-client/Dockerfile -t webcrawler-client:1.0 .
Run the container:
docker run --network="host" --rm webcrawler-client:1.0 -url <target-url>

e.g.

docker run --network="host" --rm webcrawler-client:1.0 -url https://redhat.com

Ensure to configure environment variables to connect to the Webcrawler service.

Using Kubernetes

Webcrawler

Apply the Kubernetes configuration files:

kubectl apply -f k8s/webcrawler-deployment.yaml
kubectl apply -f k8s/webcrawler-service.yaml
Access the Service

If you are using an Ingress controller to manage external access, verify that it is correctly configured and functioning. Create the required Ingress to access the service. See: https://kubernetes.io/docs/concepts/services-networking/ingress/.

By default, k8s/webcrawler-service.yaml service of type LoadBalancer. You can change it based on the requirement. If you don't have configured load-balancer or Ingress controller, you might not able to access the service as there would be no external IP would be assigned.

In that case, you can access the service by doing the port-forwarding.

kubectl port-forward svc/webcrawler-service 8081:80

Now you can access the crawler at http://localhost:8081/crawl?url= Run the Webcrawler client to crawl any URL: e.g.

./bin/webcrawler-client -url https://redhat.com

About

The WebCrawler project is a concurrent web crawler written in Go, designed to efficiently traverse websites, extract links, and build a hierarchical sitemap of the entire site.

Resources

Stars

Watchers

Forks

Packages