Algorithms and techniques of Data Mining and Machine Learning for analyzing massive datasets. Emphasis on system building with Spark. Case studies and applications.
Data mining is a fundamental skill for massive data analysis. At a high level, it allows the analyst to discover patterns in data, and transform them into usable products. The course will teach data mining algorithms for analyzing very large data sets. It will have an applied focus, in that it is meant for preparing students to utilize topics in data mining to build systems and solve real world problems.
Environment: Python 3.6, Scala 2.12, JDK 1.8 and Spark 3.1.2
Most of the assignments can only use standard python libraries and Spark RDD.
Topic | Programming | Tags | |
---|---|---|---|
1 | Spark Operation | Python | Spark Pyspark |
2 | Frequent Itemset | Python | SON A-Priori MultiHash PCY |
3 | Recommendation System | Python | LSH Jaccard similarity Pearson similarity Collaborative filtering Recommendation system |
4 | Community Detection | Python | Girvan-Newman Algorithm GraphFrames |
5 | Data Stream | Python | Bloom Filter Flajolet-Martin Algorithm Reservoir sampling |
6 | Clustering | Python | Bradley-Fayyad-Reina (BFR) Algorithm K-Means |
Environment: Python 3.6, Scala 2.12, JDK 1.8 and Spark 3.1.2
Can use any external Python libraries as long as they are available on Vocareum. Data pre/post-processing are required to only use Spark RDD.
Topic | Programming | Tags | RMSE |
---|---|---|---|
Hybrid Recommendation System | Python | XGBoost Yelp Data Model-based recommendation system |
0.979346 |