π· Wine Quality Classification β From Linear Models to Feature-Engineered KNN π Project Overview This project explores the problem of wine quality classification (Low, Medium, High) using a structured machine learning pipeline. The goal was not just to build a high-performing model, but to deeply understand:
How different models behave on the same dataset
Whether performance is limited by model choice or data structure
How feature engineering impacts separability
The project follows a progressive modeling approach, starting from simple linear models and evolving toward more flexible algorithms with engineered features.
π§ Problem Statement Wine quality is inherently subjective and continuous in nature, yet it is framed here as a multi-class classification problem.
The key challenge: Significant overlap between quality classes, especially the medium category.
π Workflow & Methodology πΉ 1. Exploratory Data Analysis (EDA)
Univariate analysis:
Identified skewness, outliers, and feature distributions
Bivariate analysis:
Key relationships discovered:
Alcohol, sulphates β positive correlation with quality
Volatile acidity β negative correlation
Observed non-linear and threshold effects
π Early insight:
Wine quality behaves like a gradient, not distinct clusters
πΉ 2. Data Preprocessing
Handling missing/invalid values
Feature scaling using StandardScaler
Train-test split with stratification (to preserve class distribution)
Encoding target into:
low, medium, high
πΉ 3. Baseline Models β Linear Discriminant Analysis (LDA)
Established linear separability baseline
Showed:
Strong clustering for low-quality wines
Heavy overlap between medium and high
Introduced non-linear boundaries
No improvement β slight overfitting
π Key takeaway:
Increasing model complexity alone does not solve class overlap
πΉ 4. Logistic Regression (Raw Features)
Multinomial Logistic Regression with scaling
Achieved 64% accuracy (best linear model)
Insights:
Low-quality wines β clearly separable
Medium class β transition zone
High class β frequently confused with medium
πΉ 5. K-Nearest Neighbors (Raw Features)
Tuned using GridSearchCV
Accuracy: 63%
Behavior:
Improved detection of high-quality wines
Reduced performance on low-quality class
π Insight:
Local patterns exist, but are not strong enough in raw feature space
πΉ 6. Feature Engineering Created domain-driven features to capture interactions:
Ratios (e.g., alcohol/density)
Balance features (e.g., sulphates vs acidity)
Non-linear transformations
Goal:
Improve feature space geometry and class separability
πΉ 7. Logistic Regression (Engineered Features)
No performance improvement (~64%)
π Insight:
Engineered features introduced non-linear relationships, not usable by linear models
πΉ 8. KNN (Engineered Features) β Final Model
Accuracy: 68% (Best Performance)
Improved across all classes:
High-quality detection β
Medium class stability β
Low class remains strong
π Key breakthrough:
Feature engineering + local learning significantly improves performance
π Model Comparison Summary
ModelAccuracyKey InsightLDA61%Linear baselineQDA61%No gain, overfittingLogistic (Raw)64%Best linear modelKNN (Raw)63%Captures local structureLogistic (Engineered)64%No improvementKNN (Engineered)68%β
Best model
π§ Key Learnings
- Class Overlap is the Core Challenge
Medium class behaves as a transition zone
High vs Medium separation remains difficult
- Model Complexity β Better Performance
QDA did not outperform LDA
Non-linearity alone is not enough
- Feature Engineering is Critical
Raw features limit performance
Engineered features improve representation
- Model-Feature Alignment Matters
Logistic Regression β prefers linear relationships
KNN β benefits from local, non-linear structures
Best results come from aligning feature design with model capability
π Final Conclusion
The best performance (68% accuracy) was achieved using KNN with engineered features, demonstrating that:
Wine quality classification is a fuzzy boundary problem
Feature engineering can reshape the problem space
Local learning methods can better exploit complex feature interactions
π Future Improvements
Try Random Forest / Gradient Boosting (XGBoost, LightGBM)
Perform advanced feature selection
Explore dimensionality reduction (PCA)
Reframe problem as:
Regression OR
Binary classification (high vs rest)
π‘ Takeaway This project is not just about building modelsβit's about understanding:
Why models succeed or fail, and how data representation shapes outcomes.