Skip to content

Scooter-sharing system use case: This project demonstrates a local and cloud execution of automated data collection and cleaning pipelines.

Notifications You must be signed in to change notification settings

sumitdeole/data-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Case Study: Gans Escooter-sharing System

Overview

Gans is a startup dedicated to developing an escooter-sharing system, with aspirations to operate in major European cities. The success of Gans hinges on strategically placing scooters where users need them, and the data engineering team has been tasked with gathering weather and flight information to make data-driven predictions about scooter locations.

Project Execution

The project will unfold in three phases, involving the creation of a local pipeline and its subsequent migration to the cloud. The ultimate goal is to establish a fully automated data pipeline, outlined in the following flowchart: Flowchart

Phase 1: Local Pipeline

In this initial phase, a local data pipeline is built, encompassing the collection, transformation, and storage of data in an SQL database. The entire process is detailed below.

1. Create an Empty SQL Database

An empty SQL database named gans_cities is created. The Entity-Relationship (EER) diagram for the database is illustrated here: EER_diagram

Code to Create SQL DB

2. Select Major European Cities using Ninja API

A list of countries where Gans operates is predefined, and the Ninja API is employed to select large cities in these countries with populations exceeding 1,000,000.

Code for City Selection using Ninja API, City Data File

3. Web Scraping using BeautifulSoup

Web scraping is performed on wiki pages to extract population, average elevation, and country information for the selected cities. The data is then integrated into the SQL database.

Code for Web Scraping, City Data File (Web Scraped), Code for SQL Integration using SQLAlchemy, Integrated Data File for SQL

4. Weather Data Collection via OpenWeather API

Weather forecast data for the selected cities is collected using the OpenWeather API. Various weather-related metrics are obtained and stored in the SQL database.

Code for Weather Data Collection, Weather Data File

5. Flights Data Collection via AeroDataBox API

Flight arrival times for the cities of interest are obtained using the AeroDataBox API. The data includes airport ICAO codes, departure information, and scheduled arrival times.

Code for Flights Data Collection, Airport ICAO Codes Data File, Flights Data File

Phase 2: Cloud Pipeline

After establishing the data pipeline locally, I now enhance its capabilities by migrating it to the cloud. I began by configuring a fresh MySQL instance on the Google Cloud Platform (GCP). To be able to access the SQL DB locally, I connected the MySQL Workbench to GCP SQL Instance. I then proceeded to build the schema and tables necessary for this project within MySQL Workbench. Afterward, I launched the data collection scripts, get_weather, and get_arrival_data functions noted above, using GCP Cloud Functions. Finally, I employed the GCP Cloud Scheduler to ensure that the data collection scripts execute at specified intervals. This cloud-based data pipeline will empower Gans with instantaneous insights into weather patterns and flight arrivals. The resultant cloud scheduled weather and flight tables are shown below:

weather_table

flight_table

Phase 3: Write a Medium Article

The final deliverable for this project is a comprehensive Medium.com article summarizing all executed tasks and outcomes Link here.

About

Scooter-sharing system use case: This project demonstrates a local and cloud execution of automated data collection and cleaning pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published