A collection of unsupervised machine learning projects built with Python and Scikit-learn, exploring clustering, dimensionality reduction, probabilistic modelling, and recommendation systems on real-world datasets.
| Folder | Technique | Description |
|---|---|---|
kmean-hierarchical_clustering |
K-Means + Hierarchical | Segment data into meaningful groups |
gaussian_mixture |
Gaussian Mixture Models | Probabilistic soft-assignment clustering |
principal_component_analysis |
PCA | Reduce high-dimensional data while preserving variance |
dimensional_reduction |
t-SNE / UMAP | Non-linear dimensionality reduction for visualisation |
recommendation_system |
Collaborative Filtering | Build a personalised item recommendation engine |
clustering_geospatial |
K-Means + GeoPandas | Cluster world countries by GDP and population on a map |
Applied K-Means clustering on world country data using GDP and population as features, then visualised the results as choropleth maps and multidimensional scatter overlays.
Key techniques:
- K-Means clustering on economic indicators
- Geospatial visualisation with GeoPandas
- Multidimensional plotting — colour (GDP), size (population), shape (continent)
- Coordinate Reference System (CRS) handling
Dataset: Natural Earth Low Resolution — country geometries, GDP, population, continent
Compared flat (K-Means) and hierarchical (agglomerative) clustering approaches on structured datasets. Used the elbow method to determine optimal cluster count and dendrograms to visualise merge distances.
Key techniques:
- K-Means with elbow method for optimal k
- Agglomerative hierarchical clustering
- Dendrogram visualisation
- Silhouette score evaluation
Modelled data distributions using probabilistic Gaussian Mixture Models (GMM), allowing soft cluster assignments where each data point has a probability of belonging to each cluster.
Key techniques:
- Expectation-Maximisation (EM) algorithm
- Soft vs hard cluster assignment
- BIC/AIC model selection
- Comparison with K-Means hard clustering
Applied PCA to reduce high-dimensional datasets to 2–3 components while retaining maximum variance, enabling faster modelling and cleaner visualisations.
Key techniques:
- Explained variance ratio analysis
- Scree plots for component selection
- 2D and 3D PCA projections
- PCA as preprocessing for downstream ML
Used non-linear dimensionality reduction techniques to reveal hidden structure in high-dimensional data that PCA cannot capture linearly.
Key techniques:
- t-SNE for cluster visualisation
- UMAP for faster non-linear projection
- Comparison of linear vs non-linear reduction
- Hyperparameter tuning (perplexity, n_neighbors)
Built a collaborative filtering recommendation engine that suggests items to users based on patterns in historical interactions — the same approach used by Netflix and Spotify.
Key techniques:
- User-based and item-based collaborative filtering
- Cosine similarity for user/item matching
- Matrix factorisation concepts
- Evaluation with precision and recall
| Library | Purpose |
|---|---|
scikit-learn |
Clustering, PCA, GMM, model evaluation |
pandas |
Data loading, manipulation |
numpy |
Numerical computation |
matplotlib |
All visualisations and plots |
seaborn |
Statistical visualisation |
geopandas |
Geospatial data handling and mapping |
shapely |
Geometric operations |
git clone https://github.com/Lami14/alx_data_science.git
cd alx_data_sciencepip install scikit-learn pandas numpy matplotlib seaborn geopandas shapelycd kmean-hierarchical_clustering
jupyter notebookEach folder contains its own notebook with inline explanations and outputs.
- How unsupervised learning finds hidden structure in unlabelled data
- When to choose K-Means vs GMM vs hierarchical clustering for a given problem
- How PCA and t-SNE complement each other — PCA for speed, t-SNE for visualisation
- How real-world recommendation systems work under the hood using similarity metrics
- How to represent economic and demographic data visually on geographic maps
- Add evaluation metrics (silhouette score, Davies-Bouldin index) across all projects
- Deploy the recommendation system as a Flask API
- Add an interactive map dashboard using Folium or Plotly
- Extend geospatial analysis with South African provincial data
MIT License — feel free to fork and build on this work.
Built by Lamla · Unsupervised ML Portfolio · ALX Africa Data Science Programme 🇿🇦