This final semester project is carried out as part of the coruse (MSc in Data Science and Business Analytics) requirement I enrolled.
Topic: Building a Novel Predictive Model to Predict Tourist Travel Preferences for Effective Planning of Domestic Tour Packages
Presentation Deck: Click Here
Steps | Sections Invovled | Tools Used | Main Packages Involved |
---|---|---|---|
1 | Initial Data Exploration | Python - Google Colab | N/A |
2 | Exploratory Data Analysis | R Programming | ggplot2 & dplyr |
3 | Data Pre-Processing | Python - Google Colab | Numpy, Pandas & Sklearn (LabelEncoder & OneHotEncoder) |
4 | Modelling (Clustering) | Python - Google Colab | KModes & Matplotlib |
5 | Feature Selection | R Programming | Boruta |
6 | Modelling (Classification) | Python - Google Colab | Sklearn (Model Selection, LogisticRegression, DecisionTreeClassifier, MLPClassifier, RandomForestClassifier) |
7 | Evaluation | Python - Google Colab | Sklearn (Classification report & Confusion matrix) |
8 | Deployment | Python - Google Colab | Streamlit, Pickle & Pyngrok / Ngrok |
- Tourism industry’s crucial contribution to Malaysia's Gross Domestic Product
✈️ - Industry badly affected due to the Covid-19 pandemic 😷
- Post Covid-19 recovery on path for the travel industry 📈
- Huge potential in utilising Machine Learning to attract tourists (and for the whole industry) 🤖
- The lack of a predictive model and focus on identifying tourist preferences has led to inefficient planning of domestic tour packages by Malaysian tourist operators
In addressing the issues associated with the design and scheduling of the tour packages, a few questions were developed:
- How to effectively cluster the collected data into several clusters for classification?
- What are the predictive modelling approaches that could effectively provide an accurate prediction of tourists’ clusters for efficient planning of domestic tour packages?
- What are the valid recommendations that could be provided to the relevant authorities to enhance the scheduling of tour package?
The aim of this project is to develop a novel data mining solution to accurately predict tourist travel preferences for the scheduling of domestic tour packages in Malaysia.
With this, the 3 objectives of the project are listed below:
- To develop a clustering model to effectively cluster the collected data into respective clusters for classification purposes.
- To develop data mining models using predictive modelling approaches to predict tourist travel clusters for efficient planning of domestic tour packages.
- To draft relevant and valid recommendations for the relevant authorities.
- “CRoss-Industry Standard Process for Data Mining” or CRISP-DM methodology
- Frequently used for data science projects and is the standard data mining methodology used to obtain useful information from the dataset
- 6 stages invovled in CRISP-DM
The data was collected through a questionnaire survey with the All Questioned Asked
- Initial Data Exploration Repo: Click Here
- Exploratory Data Analysis (EDA) Repo: Click here
- Univariate Analysis
- Bivariate Analysis
Extra Note: What is EDA❓
- What questions are we trying to solve/prove?
- What kind of data do we have and how do we treat different types?
- What's missing from the data and how do we deal with it?
- Where are the outliers and why should we care about them?
- How can we add, change or remove features to get more out of our data?
- Data Pre-Processing & Clustering Repo: Click Here
- Level Combination (Combining the levels in categorical variables that had many levels)
- E.g. The “age” variable initially had a total of 4 categories. However, the last two categories only account for less than 5 observations. As such, “35-49 years old” group and “50 and above” group were combined with the “26 - 34 years old” group)
- Unsupervised Learning: K-Modes Clustering
- Data Pre-Processing & Modelling Repo: Click Here
- Feature Selection: Boruta Algorithm (Finding the answer of...which variables does not play a significant role in predicting the dependent variable?)
- Label encoding and One-hot encoding
- Logistic Regression / Decision Tree / Artificial Neural Network (ANN) / Random Forest
- Model Deployment Repo: Click Here
- Sample web application could be viewed below
Sample temporary web application UI - Sample 1
Web application UI prediction (Proof of Concept) - Sample 2
Aims - Accomplished
- K-Modes clustering model was successfully developed with the data collected from a questionnaire survey.
Objectives - Accomplished
- A total of 4 models were developed (LR, DT, ANN and RF)
- Suggestions on future model iterations provided & Potential collaborations with relevant stakeholders provided
- Eventually, ANN was selected as the final model to be deployed due to its high prediction accuracy and better evaluation metrics as compared with the others.