This project predicts tags for StackOverflow questions using machine learning models, specifically designed to address the multi-label classification problem of assigning relevant tags based on the content of the question.
- Source: Kaggle Facebook Recruiting III - Keyword Extraction competition.
- Features: Question titles, bodies, and associated tags.
- Size: 6 million rows of training data (questions with tags).
- Objective: Predict tags for StackOverflow questions based on the question’s title and body.
- Challenge: Multi-label classification where each question can have multiple tags.
-
Data Preprocessing:
- Removed HTML tags and special characters.
- Applied tokenization and stemming.
- Vectorized text data using TF-IDF and CountVectorizer.
-
Clustering and Dimensionality Reduction:
- Used Truncated SVD for dimensionality reduction on high-dimensional text data.
-
Models Implemented:
- Logistic Regression with OneVsRestClassifier for multi-label classification.
- SGDClassifier with L1 penalty for improved tag prediction accuracy.
- Multilabel K-Nearest Neighbors (MLkNN) for tag prediction.
- Evaluation Metrics: F1 score (micro and macro), Hamming Loss.
- Best Model Performance:
- Macro F1 score: 0.77
- Micro F1 score: 0.85
- Hamming Loss: 0.14
The project effectively predicts relevant tags for StackOverflow questions by leveraging multi-label classification techniques, with optimized performance through data preprocessing and model selection.