A Machine Learning classic starter project using Python libraries to cluster a data set of 'sms' messages into 'spam' and 'ham' using k-means.
The dataset is a collection of 5,574 SMS meesages taken from UCI Machine Learning repository, need to be tagged as "spam" and "ham".
The whole pipeline conists of the following steps:
- Loading data
- Data wrangling and pre-processing
- Feature Selection
- Feature Vector Modelling
- k-means clustering and evaluation
- Writing results
Although there are multiple methods for solving the problem, tfidf approach is employed here to obtain high prediction accuracy.