Skip to content

datascientistshorya/WineQT-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🍷 Wine Quality Classification β€” From Linear Models to Feature-Engineered KNN πŸ“Œ Project Overview This project explores the problem of wine quality classification (Low, Medium, High) using a structured machine learning pipeline. The goal was not just to build a high-performing model, but to deeply understand:

How different models behave on the same dataset

Whether performance is limited by model choice or data structure

How feature engineering impacts separability

The project follows a progressive modeling approach, starting from simple linear models and evolving toward more flexible algorithms with engineered features.

🧠 Problem Statement Wine quality is inherently subjective and continuous in nature, yet it is framed here as a multi-class classification problem.

The key challenge: Significant overlap between quality classes, especially the medium category.

πŸ” Workflow & Methodology πŸ”Ή 1. Exploratory Data Analysis (EDA)

Univariate analysis:

Identified skewness, outliers, and feature distributions

Bivariate analysis:

Key relationships discovered:

Alcohol, sulphates β†’ positive correlation with quality

Volatile acidity β†’ negative correlation

Observed non-linear and threshold effects

πŸ‘‰ Early insight:

Wine quality behaves like a gradient, not distinct clusters

πŸ”Ή 2. Data Preprocessing

Handling missing/invalid values

Feature scaling using StandardScaler

Train-test split with stratification (to preserve class distribution)

Encoding target into:

low, medium, high

πŸ”Ή 3. Baseline Models βœ… Linear Discriminant Analysis (LDA)

Established linear separability baseline

Showed:

Strong clustering for low-quality wines

Heavy overlap between medium and high

⚠️ Quadratic Discriminant Analysis (QDA)

Introduced non-linear boundaries

No improvement β†’ slight overfitting

πŸ‘‰ Key takeaway:

Increasing model complexity alone does not solve class overlap

πŸ”Ή 4. Logistic Regression (Raw Features)

Multinomial Logistic Regression with scaling

Achieved 64% accuracy (best linear model)

Insights:

Low-quality wines β†’ clearly separable

Medium class β†’ transition zone

High class β†’ frequently confused with medium

πŸ”Ή 5. K-Nearest Neighbors (Raw Features)

Tuned using GridSearchCV

Accuracy: 63%

Behavior:

Improved detection of high-quality wines

Reduced performance on low-quality class

πŸ‘‰ Insight:

Local patterns exist, but are not strong enough in raw feature space

πŸ”Ή 6. Feature Engineering Created domain-driven features to capture interactions:

Ratios (e.g., alcohol/density)

Balance features (e.g., sulphates vs acidity)

Non-linear transformations

Goal:

Improve feature space geometry and class separability

πŸ”Ή 7. Logistic Regression (Engineered Features)

No performance improvement (~64%)

πŸ‘‰ Insight:

Engineered features introduced non-linear relationships, not usable by linear models

πŸ”Ή 8. KNN (Engineered Features) βœ… Final Model

Accuracy: 68% (Best Performance)

Improved across all classes:

High-quality detection ↑

Medium class stability ↑

Low class remains strong

πŸ‘‰ Key breakthrough:

Feature engineering + local learning significantly improves performance

πŸ“Š Model Comparison Summary ModelAccuracyKey InsightLDA61%Linear baselineQDA61%No gain, overfittingLogistic (Raw)64%Best linear modelKNN (Raw)63%Captures local structureLogistic (Engineered)64%No improvementKNN (Engineered)68%βœ… Best model

🧠 Key Learnings

  1. Class Overlap is the Core Challenge

Medium class behaves as a transition zone

High vs Medium separation remains difficult

  1. Model Complexity β‰  Better Performance

QDA did not outperform LDA

Non-linearity alone is not enough

  1. Feature Engineering is Critical

Raw features limit performance

Engineered features improve representation

  1. Model-Feature Alignment Matters

Logistic Regression β†’ prefers linear relationships

KNN β†’ benefits from local, non-linear structures

Best results come from aligning feature design with model capability

πŸš€ Final Conclusion

The best performance (68% accuracy) was achieved using KNN with engineered features, demonstrating that:

Wine quality classification is a fuzzy boundary problem

Feature engineering can reshape the problem space

Local learning methods can better exploit complex feature interactions

πŸ“ˆ Future Improvements

Try Random Forest / Gradient Boosting (XGBoost, LightGBM)

Perform advanced feature selection

Explore dimensionality reduction (PCA)

Reframe problem as:

Regression OR

Binary classification (high vs rest)

πŸ’‘ Takeaway This project is not just about building modelsβ€”it's about understanding:

Why models succeed or fail, and how data representation shapes outcomes.

https://www.linkedin.com/in/shorya-bisht-a20144349/

About

Wine Quality Classification project exploring LDA, QDA, Logistic Regression, and KNN with feature engineering. Progression shows class overlap as key challenge. Best performance achieved with KNN + engineered features (68% accuracy), highlighting impact of feature design and local learning methods.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors