Skip to content

The Machine Learnings repo consist of open-source machine learning projects covering various domains. It provides users with access to diverse projects, complete with documentation, tutorials, and datasets. Users can engage with the community, contribute to projects, and stay updated with the latest developments

License

Notifications You must be signed in to change notification settings

ranzeet013/Machine_Learning_Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Projects Name and Notebooks Status of Completion
01. Classifying Iris Species âś…
02. EDA On Banking Dataset âś…
03. EDA On Hotel Booking Dataset âś…
04. Glass Type Classification âś…
05. Migration Prediction âś…
06. EDA On Jobs Dataset âś…
07. Mobile Price Classification âś…
08. Multiple Sclerosis (MS) Disease Classification âś…
09. Mushroom Classification âś…
10. Predicting Body Activity âś…
11. Predicting CO2 Emission Per Capita âś…
12. Predicting Cervical Cancer âś…
13. Predicting Diamond Price âś…
14. Predicting Future Sales âś…
15. Predicting Student Success Rate âś…
16. Predicting Video Game Sales âś…
17. Sarcasm Classification âś…
18. Stephen Curry NBA Stat Analysis âś…
19. Wine Class Prediction âś…
20. Visualizing_Nobel_Prize_History âś…
21. 911 Calls Capstone âś…
22. Airline Passenger Prediction âś…
23. Credit Card Clustering âś…
24. Predicting Electricity Production âś…
25. Predicting Beer Production in Aus âś…
26. Dogecoin Price Prediction âś…
27. Housing Price Prediction âś…
28. Medical Insurance Premium Prediction âś…
29. Water Quality Clustering âś…
30. Customer Churn Rate Prediction âś…
31. Visualizing Students Performance âś…

Projects Discription :

01. Classifying Iris Species :

In this project, I started by loading the Iris dataset from scikit-learn. The primary goal was to delve into the classification of distinct Iris species, employing both K-Nearest Neighbors (KNN) and Random Forest models. Following the acquisition of the dataset, I split it into training and testing sets to ensure a thorough evaluation of how well our models perform. Then I generated a correlation matrix, showing potential relationships among variables and providing valuable insights into the dataset's intrinsic structure. The next steps involved a assessment of model performance using metrics such as accuracy, classification reports, and confusion matrices. To reveal the models' ability to handle new data effectively, I used cross-validation techniques, specifically KFold and LeaveOneOut. The final touch involved utilizing the ConfusionMatrixDisplay function for a detailed and visually insightful interpretation of our classification results.

Link : Classifying Iris Species

02. EDA On Banking Dataset :

In this analysis, I started project by importing and exploring a financial dataset related to marketing interactions.I examined the dataset, addressing missing values and generating descriptive statistics for both numerical and categorical features. Various visualization techniques, including bar charts, heatmaps, scatter plots, histograms, and box plots, were employed to uncover patterns and trends in the data. Specific analyses included scrutinizing the distribution of the target variable ("y") and exploring correlations between different features. Crosstab and pivot tables were utilized to gain deeper insights into relationships among categorical variables. The project prioritized not only numerical summaries but also visual exploration to provide a comprehensive understanding of the financial dataset, laying the groundwork for further analysis and decision-making in the marketing context.

Link: EDA On Banking Dataset

03. EDA On Hotel Booking Dataset :

In this project, the analysis begins with the exploration of a hotel-related dataset, focusing on various aspects of booking details and guest demographics. The dataset undergoes a examination, encompassing the handling of missing values and the generation of descriptive statistics for both numerical and categorical features. Diverse visualization techniques, including bar charts, heatmaps, scatter plots, histograms, and box plots, are systematically employed to unveil underlying patterns and trends in the data. The analysis specifically targets the distribution of the target variable ("is_canceled") and explores correlations between different features. Additionally, crosstab and pivot tables are utilized to delve deeper into relationships among categorical variables, providing valuable insights. The project places equal emphasis on numerical summaries and visual exploration, contributing to a holistic understanding of the hotel dataset and establishing a foundation for subsequent analyses and decision-making within the hospitality context.

Link: EDA On Hotel Booking Dataset

04. Glass Type Classification :

In this Glass Type Classification project, I initiated the analysis by loading a dataset containing information about glass samples, particularly their chemical composition. The dataset includes attributes such as Refractive Index, Sodium, Magnesium, Aluminum, Silicon, Potassium, Calcium, Barium, Iron, and the target variable "Type," representing the glass type. Exploratory Data Analysis (EDA) was performed to examine the dataset, handling missing values and addressing duplicates. Various visualization techniques, including bar charts for glass type distribution, correlation matrices, and bar plots illustrating correlations with the target variable, were employed to reveal patterns and relationships among features. The project then proceeded to split the dataset into training and testing sets for model evaluation. A Random Forest Classifier was selected and trained on the training set, followed by a comprehensive error analysis using metrics like accuracy, precision, recall, and a confusion matrix.

Link: Glass Type Classification

05. Migration Prediction :

In this project on predicting migration rates, I started the analysis by loading a dataset containing socio-economic and demographic factors related to migration. The objective was to develop a machine learning model capable of accurately predicting migration rates. After importing the dataset, I explored its contents, addressing missing values and imputing them with appropriate strategies, such as filling with median values. I then performed data exploration, identifying unique values in categorical columns and transforming them into numerical representations using factorization. The dataset was split into training and testing sets for model evaluation, and a Random Forest Regressor was chosen as the predictive model. The analysis also involved visualizations, including a line plot showing the total migration rates over the years and a heatmap illustrating the correlation matrix between different features.

Link: Migration Prediction

07. Mobile Price Classification :

In this Mobile Price Classification project, i began by loading a dataset containing mobile phone features for predicting price ranges. The goal was to develop a machine learning model capable of classifying mobile phone prices into different categories based on their specifications. The dataset was processed, and statistical information was gathered, providing insights into the characteristics of mobile phone prices. A correlation matrix was created to understand the relationships between different features, and the correlation with the price range was visualized. The dataset was then scaled to ensure equal contribution of features to the model. Subsequently, the dataset was split into training and testing sets, and a Logistic Regression model was trained. The model's predictions were evaluated using accuracy, and error analysis was performed using a confusion matrix and classification report.

Link: Mobile Price Classification

08. Multiple Sclerosis (MS) Disease Classification :

In this analysis, the project began by importing and exploring a dataset focused on predicting the conversion of clinically isolated syndrome to multiple sclerosis (MS). Thorough examination included addressing missing values, notably in the 'Schooling' and 'Initial_Symptom' columns, through mean value imputation. Descriptive statistics provided insights into both numerical and categorical features. Visualization techniques, such as bar charts for 'group' distribution and a heatmap for the correlation matrix, revealed relationships between features. Crosstab and pivot tables deepened exploration of categorical variable relationships like 'Gender' and 'group'. Prioritizing both numerical summaries and visual exploration, the approach facilitated a comprehensive understanding, especially regarding the target variable ('group').

Link: Multiple Sclerosis (MS) Disease Classification

09. Mushroom Classification :

In this Mushroom Classification project, the primary objective is to develop an accurate system for classifying different types of mushrooms based on their characteristics. The analysis begins by exploring dataset, containing information about mushrooms, is then explored, and steps are taken to handle missing values and perform label encoding to convert categorical variables into numerical representations. Descriptive statistics, including the dataset's shape, information, and the distribution of the target variable ("class"), are analyzed. A correlation matrix and a correlation bar plot with the target variable are employed for feature analysis. The dataset is split into training and testing sets, and features are scaled using StandardScaler. Four machine learning classifiers, namely Random Forest Classifier, Logistic Regression, Decision Tree Classifier, and K-Nearest Neighbors (KNN), are implemented and evaluated for accuracy using confusion matrices. Additionally, the Support Vector Classifier (SVC) is applied. The accuracy scores of each algorithm are summarized in a table, indicating high accuracy for most classifiers, with Random Forest Classifier achieving a perfect accuracy score of 1.0.

Link: Mushroom Classification

10. Predicting Body Activity :

In this body activity prediction project, the dataset is explored and preprocessed, incorporating essential libraries for analysis and visualization. Exploratory data analysis (EDA) is conducted to understand the dataset's characteristics, followed by the concatenation of training and testing sets. After scaling and principal component analysis (PCA), the data is split for model development using the K-Nearest Neighbors (KNN) algorithm. Cross-validation with 10 folds is employed for evaluation, and the results, including accuracy, confusion matrix, and classification report, are presented.

Link: Predicting Body Activity

11. Predicting CO2 Emission Per Capita :

The "Predicting CO2 Emission Per Capita" project focuses on developing a data-driven solution to forecast carbon dioxide emissions on a per capita basis. Initial steps involve importing relevant libraries and loading a dataset containing environmental, atmospheric, and socio-economic attributes. Data processing includes handling missing values, exploring statistical information, and visualizing CO2 emissions trends over time. A correlation matrix and attribute correlation analysis provide insights into relationships within the dataset. The dataset is then split for model development, utilizing linear regression and random forest regression algorithms. The models are trained, predictions are made, and their respective scores are evaluated.

Link: Predicting CO2 Emission Per Capita

12. Predicting Cervical Cancer :

The "Predicting Cervical Cancer" project involves the development of a machine learning model to predict cervical cancer likelihood based on patient attributes. The initial steps include importing Python libraries, loading the dataset, and performing exploratory data analysis (EDA) to handle missing data, conduct statistical analysis, visualize data distributions, and analyze feature correlations. The correlation matrix guides data preprocessing steps such as filling missing values and scaling features. The dataset is then split into training, testing, and validation sets for effective model training and evaluation. The XGBoost classifier is selected for model development, trained on the training set, and evaluated on both training and testing sets. Error analysis, including a confusion matrix and classification report, is performed to assess model accuracy and identify areas for improvement.

Link: Predicting Cervical Cancer

13. Predicting Diamond Price :

In the "Predicting Diamond Price" project, the goal is to construct a predictive model for estimating diamond prices based on diverse characteristics. The project starts by importing essential Python libraries following this, the dataset is loaded and subjected to thorough exploratory data analysis (EDA), which involves visualizations, statistical analysis, and correlation assessments to uncover attribute relationships. Through data preprocessing steps such as encoding categorical variables and scaling features, the dataset is prepared for model development. The dataset is strategically split into training and testing sets to facilitate effective model training and evaluation. Two regression models, the Decision Tree Regressor and the Random Forest Regressor, are employed and evaluated using performance metrics like the R-squared score and mean squared error. Visualizations are utilized to showcase the model's performance.

Link: Predicting Diamond Price

14. Predicting Future Sales :

In this "Predicting Future Sales" project aims to develop a machine learning model to forecast future sales based on historical data. Key steps include importing Python libraries, loading the sales dataset, and conducting exploratory data analysis (EDA) to understand patterns and relationships. The project emphasizes data visualization for insights into attribute distributions and employs statistical information and correlation matrices for preprocessing. The dataset is split into training and testing sets, and an XGBoost classifier is chosen for model development. Evaluation metrics, including confusion matrices and classification reports, assess the model's performance on both training and testing sets.

Link: Predicting Future Sales

15. Predicting Student Success Rate :

The "Predicting Student Success Rate" project focuses on developing a machine learning model for forecasting students' academic performance using historical data. The project begins by importing essential Python libraries, dataset containing comprehensive information about students' enrollment and academic performance, is loaded for exploratory data analysis (EDA). EDA involves examining data distributions, correlations, and dependencies, facilitating the identification of outliers and trends. Statistical information is utilized to gain insights into the dataset, and correlation matrices guide feature selection. The dataset is preprocessed through label encoding, and irrelevant features are dropped to enhance model performance. The correlation of features with the target variable (dropout) is analyzed, leading to informed feature selection decisions. The dataset is split into training and testing sets, and scaling is applied for numerical feature normalization. The RandomForestClassifier is chosen for model development, and hyperparameter tuning is conducted using Grid Search to optimize model performance. The model is evaluated, and error analysis metrics such as accuracy, precision, and F1-score are calculated. Confusion matrices visualize model performance.

Link: Predicting Student Success Rate

16. Predicting Video Game Sales :

The "Predicting Video Game Sales" project aims to develop a regression model for forecasting global video game sales based on features such as rank, regional sales data (North America, Europe, Japan, and others), and other attributes like platform, year, genre, and publisher. The project starts by importing essential Python libraries and loading a comprehensive dataset containing information about various video games. Data processing involves handling missing values, and statistical information and correlation matrices are employed for data exploration. The correlation between features and the target variable (global sales) is analyzed to inform feature selection. The dataset is split into training and testing sets, and a Linear Regression model is chosen for development. The model is trained, evaluated, and tested, with predictions made on the testing set. Performance metrics, including the coefficient of determination (R-squared), are calculated to assess the model's accuracy.

Link: Predicting Video Game Sales

17. Sarcasm Classification :

The "Sarcasm Detection" project focuses on developing a model for identifying sarcasm in text headlines. The dataset consists of headlines and corresponding labels indicating whether each headline is sarcastic or not. The project begins with importing essential Python libraries and reading the dataset. The dataset is processed to extract relevant columns, and the feature (headlines) and label (is_sarcastic) are selected. The dataset is split into training and testing sets for model evaluation. The Bernoulli Naive Bayes algorithm is chosen for sarcasm detection. The CountVectorizer is employed to convert text data into a format suitable for model training. The model is trained, evaluated, and tested, with accuracy as the primary evaluation metric. Error analysis involves calculating accuracy scores, generating classification reports, and visualizing a confusion matrix.

Link: Sarcasm Classification

18. Stephen Curry NBA Stat Analysis :

In this NBA player box score analysis, the project begins by importing relevant Python libraries and utilizing the Sportsdataverse library to load NBA player box score data for the 2022 season, with a particular focus on Stephen Curry's performance. The dataset is explored by examining the structure and information it contains, addressing any missing values, and generating descriptive statistics. Various statistical analyses are conducted, such as identifying the maximum minutes played and exploring specific game scenarios. The distribution of points scored by Stephen Curry is visualized using a bar chart, providing insights into his scoring patterns. This project emphasizes both numerical summaries and visual exploration, employing line plots to showcase trends over time.

Link: Stephen Curry NBA Stat Analysis

19. Wine Class Prediction :

In this wine class prediction project, I initiated the analysis by importing essential Python libraries. The focus was on exploring a wine dataset containing information about different wine samples. The initial steps involved examining the dataset's structure, checking for missing values, and obtaining an overview of the dataset through descriptive statistics. Visualizations played a crucial role in the exploratory data analysis (EDA), employing bar charts to illustrate the distribution of wine classes. Additionally, a correlation matrix heatmap was generated to reveal relationships between various wine features. The correlation of individual features with the target variable, representing wine classes, was also investigated using bar plots. The dataset was split into training and testing sets, and numerical feature scaling was applied to standardize the data. For model development, a RandomForestClassifier was selected, and its performance was evaluated using accuracy scores, confusion matrices, and classification reports.

Link: Wine Class Prediction

20. Visualizing_Nobel_Prize_History :

In the "Visualization of Nobel Prize Datasets" project, essential Python libraries were employed to explore and analyze extensive datasets related to Nobel Prize laureates and their achievements. The analysis covers a range of aspects, including the distribution of Nobel Prizes across countries, the dominance of the United States, and the percentage of female winners over decades and categories. Visualizations such as bar charts, line plots, and scatter plots are utilized to unveil trends and patterns in the data. The project also investigates the age at which Nobel Prize winners receive their awards, providing insights into variations across different fields. The comprehensive visualizations contribute to a deeper understanding of Nobel Prize history, facilitating informed decision-making based on the identified patterns and trends in the datasets.

Link: Visualizing_Nobel_Prize_History

21. 911 Calls Capstone :

The provided Python script initiates the exploration of a 911 emergency call dataset by importing essential libraries, loading the dataset into a Pandas dataframe, and investigating potential outliers or patterns. Key columns, such as 'lat,' 'lng,' 'desc,' 'zip,' 'title,' 'timeStamp,' 'twp,' 'addr,' and 'e,' are identified and analyzed. The code employs visualizations, including line plots and count plots, to examine the frequency of emergency calls categorized by reasons ('Traffic,' 'EMS,' 'Fire') over time. Specific dates, such as March 2, 2018, and November 15, 2018, are scrutinized, offering insights into the distribution of call reasons on those days. The concluding line plots provide a detailed representation of the counts of emergency calls for each reason over time, contributing to a comprehensive understanding of patterns in the 911 call dataset.

Link: 911 Calls Capstone

About

The Machine Learnings repo consist of open-source machine learning projects covering various domains. It provides users with access to diverse projects, complete with documentation, tutorials, and datasets. Users can engage with the community, contribute to projects, and stay updated with the latest developments

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published