Gans is a startup dedicated to developing an escooter-sharing system, with aspirations to operate in major European cities. The success of Gans hinges on strategically placing scooters where users need them, and the data engineering team has been tasked with gathering weather and flight information to make data-driven predictions about scooter locations.
The project will unfold in three phases, involving the creation of a local pipeline and its subsequent migration to the cloud. The ultimate goal is to establish a fully automated data pipeline, outlined in the following flowchart:
In this initial phase, a local data pipeline is built, encompassing the collection, transformation, and storage of data in an SQL database. The entire process is detailed below.
An empty SQL database named gans_cities is created. The Entity-Relationship (EER) diagram for the database is illustrated here:
A list of countries where Gans operates is predefined, and the Ninja API is employed to select large cities in these countries with populations exceeding 1,000,000.
Code for City Selection using Ninja API, City Data File
Web scraping is performed on wiki pages to extract population, average elevation, and country information for the selected cities. The data is then integrated into the SQL database.
Code for Web Scraping, City Data File (Web Scraped), Code for SQL Integration using SQLAlchemy, Integrated Data File for SQL
Weather forecast data for the selected cities is collected using the OpenWeather API. Various weather-related metrics are obtained and stored in the SQL database.
Code for Weather Data Collection, Weather Data File
Flight arrival times for the cities of interest are obtained using the AeroDataBox API. The data includes airport ICAO codes, departure information, and scheduled arrival times.
Code for Flights Data Collection, Airport ICAO Codes Data File, Flights Data File
After establishing the data pipeline locally, I now enhance its capabilities by migrating it to the cloud. I began by configuring a fresh MySQL instance on the Google Cloud Platform (GCP). To be able to access the SQL DB locally, I connected the MySQL Workbench to GCP SQL Instance. I then proceeded to build the schema and tables necessary for this project within MySQL Workbench. Afterward, I launched the data collection scripts, get_weather, and get_arrival_data functions noted above, using GCP Cloud Functions. Finally, I employed the GCP Cloud Scheduler to ensure that the data collection scripts execute at specified intervals. This cloud-based data pipeline will empower Gans with instantaneous insights into weather patterns and flight arrivals. The resultant cloud scheduled weather and flight tables are shown below:
The final deliverable for this project is a comprehensive Medium.com article summarizing all executed tasks and outcomes Link here.