AI-powered synthetic dataset generation for machine learning education. Uses the Gemini API to design realistic, pedagogically useful datasets for teaching ML classification algorithms.
Currently supports: SVM (Support Vector Machines), KNN (K-Nearest Neighbors), Decision Trees, Logistic Regression, Naive Bayes, Neural Networks (MLP)
ml-dataset-generators/
│
├── README.md # This file
├── .env # API keys (GEMINI_API_KEY) — not committed
├── requirements.txt # Python dependencies
├── batch_examples.json # Example batch config for SVM generation
├── batch_knn_examples.json # Example batch config for KNN generation
├── batch_dt_examples.json # Example batch config for Decision Tree generation
├── batch_lr_examples.json # Example batch config for Logistic Regression generation
├── batch_nb_examples.json # Example batch config for Naive Bayes generation
├── batch_mlp_examples.json # Example batch config for Neural Network generation
│
├── svm/ # SVM dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for SVM datasets
│ ├── pipeline.py # SVM model fitting (train/test/accuracy)
│ ├── writers.py # Output file generation (CSV, JSON, PNG, MD)
│ ├── batch.py # Batch processing mode
│ ├── svm_explorer.ipynb # Interactive student notebook
│ ├── dataset-gen-prompt v2.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated SVM datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ └── visualization.png
│
├── knn/ # KNN dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for KNN datasets
│ ├── pipeline.py # KNN model fitting (k-sweep, learning curve, comparisons)
│ ├── writers.py # Output file generation (CSV, JSON, PNG, MD)
│ ├── batch.py # Batch processing mode
│ ├── knn_explorer.ipynb # Interactive student notebook
│ ├── dataset-gen-prompt.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated KNN datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ └── visualization.png
│
├── decision_tree/ # Decision Tree dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for DT datasets
│ ├── pipeline.py # DT model fitting (depth sweep, pruning, RF comparison)
│ ├── writers.py # Output file generation (CSV, JSON, PNGs, MD)
│ ├── batch.py # Batch processing mode
│ ├── dt_explorer.ipynb # Interactive student notebook
│ ├── dataset-gen-prompt.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated Decision Tree datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ ├── tree.png
│ ├── depth_and_pruning.png
│ ├── feature_importance.png
│ ├── boundary.png # 2D datasets only
│ └── rf_comparison.png # --compare-rf only
│
├── logistic_regression/ # Logistic Regression dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for LR datasets
│ ├── pipeline.py # LR model fitting (C-sweep, penalty comparison, SVM comparison)
│ ├── writers.py # Output file generation (CSV, JSON, PNGs, MD)
│ ├── batch.py # Batch processing mode
│ ├── lr_explorer.ipynb # Interactive student notebook
│ ├── dataset-gen-prompt.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated Logistic Regression datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ ├── boundary.png # Decision boundary line (all datasets)
│ └── coefficients.png # Signed coefficient bar chart (all datasets)
│
├── naive_bayes/ # Naive Bayes dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for NB datasets
│ ├── pipeline.py # NB model fitting (variant selection, correlation analysis)
│ ├── writers.py # Output file generation (CSV, JSON, PNGs, MD)
│ ├── batch.py # Batch processing mode
│ ├── nb_explorer.ipynb # Interactive student notebook
│ ├── dataset-gen-prompt.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated Naive Bayes datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ ├── distributions.png # Per-class feature distributions (all datasets)
│ └── boundary.png # Decision boundary (all datasets)
│
├── neural_network/ # Neural Network (MLP) dataset generator
│ ├── __init__.py
│ ├── generate.py # CLI entry point
│ ├── gemini_client.py # Gemini prompt + JSON schema for MLP datasets
│ ├── pipeline.py # MLP fitting (boundary postprocessing, multi-model comparisons)
│ ├── writers.py # Output file generation (CSV, JSON, PNGs, MD)
│ ├── batch.py # Batch processing mode
│ ├── mlp_explorer.ipynb # Interactive student notebook (curriculum capstone)
│ ├── dataset-gen-prompt.md # Gemini prompt spec used to build this generator
│ └── output/ # All generated Neural Network datasets
│ ├── index.md # Human-readable index of all datasets
│ ├── index.json # Machine-readable index
│ └── <dataset_slug>/ # One folder per generated dataset
│ ├── dataset.csv
│ ├── metadata.json
│ ├── regenerate.py
│ ├── datasheet.md
│ ├── boundary.png # Decision regions (all datasets)
│ ├── loss_curve.png # Training loss over iterations (all datasets)
│ └── architecture.png # Network diagram (all datasets)
│
├── utilities/ # Shared code reused by all generators
│ ├── __init__.py
│ ├── data_generator.py # generate_dataset(): Gaussian sampling with overlap
│ └── index_manager.py # update_index(): maintains index.md and index.json
│
└── docs/
└── gemini-documentation.md # Gemini API reference notes
-
Clone the repository and create a virtual environment:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Create a
.envfile in the project root:GEMINI_API_KEY=your_api_key_here
Run all commands from the project root directory.
python -m svm.generate "a medical dataset about predicting diabetes risk"
python -m svm.generate "sports performance classification" --difficulty-label mid
python -m svm.generate "environmental sensor data" --overlap 0.25 --rows 300 --seed 42
python -m svm.generate "two-class industrial fault detection" --force-2dOptions:
| Flag | Description |
|---|---|
--rows N |
Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
Class overlap 0.0–1.0 (0=separated, 1=fully mixed) |
--difficulty-label LABEL |
Shorthand: great=0.1, mid=0.5, bad=0.9 |
--features N |
Number of features (overrides Gemini's choice) |
--force-2d |
Force exactly 2 features for a clean scatter plot |
--seed N |
Random seed for reproducibility |
--batch FILE |
Path to a JSON batch config file |
python -m svm.generate --batch batch_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed.
[
{ "prompt": "diabetes diagnosis from blood markers", "difficulty_label": "great", "seed": 42 },
{ "prompt": "elite vs recreational runner classification", "difficulty_label": "mid", "rows": 300 }
]Run all commands from the project root directory.
python -m knn.generate "a medical dataset about predicting diabetes risk"
python -m knn.generate "neighborhood classification of housing types" --difficulty-label mid
python -m knn.generate "environmental sensor readings" --overlap 0.3 --rows 300 --seed 42
python -m knn.generate "complex non-linear boundary dataset" --force-2dOptions:
| Flag | Description |
|---|---|
--rows N |
Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
Class overlap 0.0–1.0 (0=separated, 1=fully mixed) |
--difficulty-label LABEL |
Shorthand: great=0.1, mid=0.4, bad=0.75 (tighter than SVM) |
--features N |
Number of features (overrides Gemini's choice) |
--force-2d |
Force exactly 2 features for a clean scatter plot |
--seed N |
Random seed for reproducibility |
--batch FILE |
Path to a JSON batch config file |
python -m knn.generate --batch batch_knn_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed.
[
{ "prompt": "housing type classification by neighborhood metrics", "difficulty_label": "mid", "seed": 1, "force_2d": true },
{ "prompt": "tissue sample cancer classification", "difficulty_label": "mid", "seed": 2, "features": 8 }
]Run all commands from the project root directory.
python -m decision_tree.generate "a medical dataset about predicting diabetes risk"
python -m decision_tree.generate "loan default prediction" --difficulty-label mid
python -m decision_tree.generate "plant classification from measurements" --boundary-angle 0.0 --true-depth 2
python -m decision_tree.generate "noisy sensor fault detection" --overlap 0.4 --compare-rfOptions:
| Flag | Description |
|---|---|
--rows N |
Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
Class overlap 0.0–1.0 (0=separated, 1=fully mixed) |
--difficulty-label LABEL |
Shorthand: great=0.1, mid=0.5, bad=0.9 |
--features N |
Number of features (overrides Gemini's choice) |
--force-2d |
Force exactly 2 features for a clean scatter plot |
--seed N |
Random seed for reproducibility |
--batch FILE |
Path to a JSON batch config file |
--boundary-angle FLOAT |
0.0=axis-aligned (tree's best case), 1.0=fully diagonal (staircase required) |
--true-depth INT |
Number of sequential splits the true boundary requires (1–6) |
--compare-rf |
Also fit a RandomForestClassifier and produce rf_comparison.png |
python -m decision_tree.generate --batch batch_dt_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, boundary_angle, true_depth, compare_rf.
[
{ "prompt": "fruit quality sorting based on a single clear chemical threshold", "difficulty_label": "great", "seed": 1, "force_2d": true },
{ "prompt": "customer churn where the boundary is diagonal through usage-satisfaction space", "boundary_angle": 0.9, "seed": 2, "compare_rf": true }
]Run all commands from the project root directory.
python -m logistic_regression.generate "a medical dataset about predicting diabetes risk"
python -m logistic_regression.generate "loan default prediction" --difficulty-label mid
python -m logistic_regression.generate "linearly separable species classification" --nonlinearity 0.0
python -m logistic_regression.generate "circular boundary dataset" --nonlinearity 0.8 --separation 0.3
python -m logistic_regression.generate "sparse feature selection demo" --multiclass --compare-svmOptions:
| Flag | Description |
|---|---|
--rows N |
Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
Class overlap 0.0–1.0 (0=separated, 1=fully mixed) |
--difficulty-label LABEL |
Shorthand: great=0.1, mid=0.5, bad=0.9 |
--features N |
Number of features (overrides Gemini's choice) |
--force-2d |
Force exactly 2 features for a clean scatter plot |
--seed N |
Random seed for reproducibility |
--batch FILE |
Path to a JSON batch config file |
--nonlinearity FLOAT |
0.0=linear boundary (LR's best case), 1.0=fully non-linear (circular/XOR) |
--separation FLOAT |
0.0=low-confidence probabilities (0.4–0.6 range), 1.0=high-confidence (near 0 and 1) |
--multiclass |
Generate 3-class dataset using one-vs-rest strategy |
--compare-svm |
Also fit a LinearSVC and overlay its boundary on boundary.png |
python -m logistic_regression.generate --batch batch_lr_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, nonlinearity, separation, multiclass, compare_svm.
[
{ "prompt": "diabetes diagnosis from blood markers", "difficulty_label": "great", "seed": 1, "force_2d": true },
{ "prompt": "gene expression cancer classification", "features": 6, "seed": 2 },
{ "prompt": "three-class species classifier", "multiclass": true, "seed": 3 }
]Run all commands from the project root directory.
python -m naive_bayes.generate "a medical symptom dataset for disease diagnosis"
python -m naive_bayes.generate "email spam classification" --variant multinomial
python -m naive_bayes.generate "sensor fault detection" --correlation 0.0 --difficulty-label great
python -m naive_bayes.generate "correlated financial indicators" --correlation 0.8 --compare-lr
python -m naive_bayes.generate "binary feature presence dataset" --variant bernoulliOptions:
| Flag | Description |
|---|---|
--rows N |
Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
Class overlap 0.0–1.0 (0=separated, 1=fully mixed) |
--difficulty-label LABEL |
Shorthand: great=0.1, mid=0.5, bad=0.9 |
--features N |
Number of features (overrides Gemini's choice) |
--force-2d |
Force exactly 2 features for a clean scatter plot |
--seed N |
Random seed for reproducibility |
--batch FILE |
Path to a JSON batch config file |
--variant STR |
NB variant: gaussian (default), multinomial, or bernoulli |
--correlation FLOAT |
Feature correlation 0.0=independent (NB's best case), 1.0=fully correlated |
--compare-lr |
Also fit a LogisticRegression and overlay its boundary on boundary.png |
If --variant is not provided, Gemini selects the most appropriate variant for the scenario.
python -m naive_bayes.generate --batch batch_nb_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, variant, correlation, compare_lr.
[
{ "prompt": "medical symptom classification with independent features", "variant": "gaussian", "seed": 1 },
{ "prompt": "spam detection from word counts", "variant": "multinomial", "seed": 2 },
{ "prompt": "binary keyword presence classifier", "variant": "bernoulli", "seed": 3, "compare_lr": true }
]python -m neural_network.generate "a spiral classification dataset"
python -m neural_network.generate "XOR-like boundary dataset" --force-2d --layers 2 --neurons 32
python -m neural_network.generate "complex sensor data" --compare-all --seed 42
python -m neural_network.generate "medical diagnosis classification" --difficulty-label mid --layers 3
python -m neural_network.generate --batch batch_mlp_examples.json| Argument | Type | Description |
|---|---|---|
prompt |
string | Natural language description of the dataset to generate |
--rows INT |
int | Number of data rows (overrides Gemini's choice) |
--overlap FLOAT |
float | Class overlap from 0.0 (separated) to 1.0 (mixed). Overrides --difficulty-label |
--difficulty-label LABEL |
great/mid/bad | Difficulty alias: great=0.1, mid=0.5, bad=0.9 |
--features INT |
int | Number of features (overrides Gemini's choice) |
--force-2d |
flag | Force exactly 2 features for clean 2D boundary visualization |
--seed INT |
int | Random seed for reproducibility |
--batch FILE |
path | Path to a JSON batch config file |
--layers INT |
int | Number of hidden layers (1–5); overrides Gemini's suggestion |
--neurons INT |
int | Neurons per hidden layer; overrides Gemini's suggestion |
--compare-all |
flag | Fit all five prior algorithms and include comparison in metadata and datasheet |
python -m neural_network.generate --batch batch_mlp_examples.jsonThe batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, layers, neurons, compare_all.
[
{ "prompt": "spiral classification dataset", "force_2d": true, "rows": 500, "seed": 1 },
{ "prompt": "XOR pattern data", "layers": 1, "neurons": 2, "force_2d": true, "seed": 2 },
{ "prompt": "interleaved crescents boundary", "force_2d": true, "compare_all": true, "seed": 7 }
]Each generated dataset is saved to svm/output/<dataset_slug>/, knn/output/<dataset_slug>/, or decision_tree/output/<dataset_slug>/ and contains:
| File | Description |
|---|---|
dataset.csv |
The dataset with feature columns and a label column |
metadata.json |
Full spec, model results, generation parameters (includes knn_concept_focus for KNN, dt_concept_focus for DT) |
regenerate.py |
Self-contained script to recreate dataset.csv exactly (numpy + pandas only, no API call) |
datasheet.md |
Educator-facing documentation: composition, difficulty, model results, teaching notes |
visualization.png |
2×2 subplot grid (KNN) or scatter/pairplot (SVM) |
tree.png |
Decision tree diagram at suggested_max_depth (Decision Tree only) |
depth_and_pruning.png |
Depth curve + pruning curve (Decision Tree only) |
feature_importance.png |
Feature importance bar chart (Decision Tree only) |
boundary.png |
Axis-aligned decision regions with true split overlays (Decision Tree, 2D only) |
rf_comparison.png |
Random Forest vs Decision Tree comparison (Decision Tree, --compare-rf only) |
A running index is maintained in the output/ subfolder of each algorithm's folder.
svm/svm_explorer.ipynb is a scaffolded, hands-on learning environment for students.
To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in svm/output/.
The notebook covers:
- Data exploration and visualization
- Train/test split and StandardScaler normalization
- SVM hyperparameter tuning (kernel, C, gamma, degree)
- Decision boundary visualization
- Kernel comparison (linear, rbf, poly, sigmoid)
- C × gamma accuracy heatmap
knn/knn_explorer.ipynb is a scaffolded notebook for KNN exploration.
To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in knn/output/.
The notebook covers:
- Data exploration and class balance analysis
- Train/test split with raw vs scaled feature range comparison
- KNN fitting with adjustable K
- Accuracy vs K curve with train/test comparison
- Decision boundary visualization (2D) or PCA projection (3D+)
- Nearest neighbor inspection with color-coded neighbor lines
- Learning curve analysis
- Conditional advanced sections: curse of dimensionality, feature scale, distance metric comparison
- SVM comparison cell with discussion prompt
decision_tree/dt_explorer.ipynb is a scaffolded notebook for Decision Tree exploration.
To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in decision_tree/output/.
The notebook covers:
- Data exploration and feature type identification (numerical vs categorical)
- Train/test split with a note explaining why StandardScaler is NOT needed for decision trees
- Decision tree fitting and classification report
- Tree diagram generation with
plot_tree()— students trace the path by hand - Depth curve: sweep max_depth, identify the overfitting cliff
- Pruning with
ccp_alpha: find the optimal pruning level - Feature importance bar chart with signal vs noise feature annotations
- Boundary visualization (2D): staircase boundary with true split overlays
- Tree instability demo: refit on 5 random splits, observe root split variation
- Random Forest extension: compare accuracy and feature importances to a single tree
- Algorithm comparison cell: Decision Tree vs SVM vs KNN accuracy bar chart
naive_bayes/nb_explorer.ipynb is a scaffolded notebook for Naive Bayes exploration.
To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in naive_bayes/output/.
The notebook covers:
- Data exploration with class balance and feature type detection (continuous/counts/binary)
- Train/test split with preprocessing notes by variant (GaussianNB: no scaling; Multinomial: non-negative; Bernoulli: binary)
- Naive Bayes fitting with all three variants and classification report
- Per-class distribution visualization: the signature NB output showing what the model knows internally
- Independence assumption analysis: feature correlation matrix and discussion of double-counting
- Probability output exploration: histogram comparison showing overconfidence when features are correlated
- Accuracy vs calibration: calibration curve showing NB can be right for the wrong reasons
- Laplace smoothing demo (MultinomialNB/BernoulliNB): trigger zero-probability error, then fix it
- Variant comparison: fit all three NB variants on the same data
- Prior probability manipulation: adjust class priors and observe accuracy changes per class
- Logistic Regression comparison: probability histogram overlay showing NB overconfidence vs LR calibration
- Decision Tree comparison: accuracy comparison with discussion of rules vs distributions
neural_network/mlp_explorer.ipynb is a scaffolded capstone notebook for Neural Network (MLP) exploration.
To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in neural_network/output/.
The notebook covers:
- Data exploration and class balance analysis
- Train/test split with StandardScaler — with a note explaining why scaling is required for MLP (unlike DT and NB)
- MLP fitting with
MLPClassifierand full classification report - Loss curve visualization: plot
loss_curve_and identify convergence point - Architecture diagram: count total weights and discuss parameter-to-data ratio
- Decision boundary visualization (2D) or PCA projection (3D+)
- Architecture sweep: from
(4,)to(64,64,64); plot accuracy vs parameter count - Activation comparison: relu / tanh / sigmoid; compare loss curves and boundaries
- Overfitting demo: large network on small data subset; observe train/test accuracy gap
- Black box exploration: inspect
mlp.coefs_[0]and compare interpretability to Decision Tree - All-algorithm comparison: fit all 6 algorithms; horizontal bar chart
- Curriculum capstone reflection: 5 synthesis questions linking MLP back to all prior algorithms
- Next steps: PyTorch, Keras, TensorFlow pointers for production-scale deep learning
The Decision Tree generator supports eight dt_concept_focus values. Each one drives the Gemini prompt design, DT-specific post-processing, pipeline comparisons, and datasheet content.
| Focus | What it teaches |
|---|---|
axis_aligned_boundary |
Tree's best case — true boundary is a clean threshold on 1–2 features; tree finds it in minimal splits |
diagonal_boundary |
Tree's core weakness — diagonal boundary requires a staircase approximation visible in boundary.png |
overfitting_depth |
Unconstrained tree memorizes training data; the depth curve's overfitting cliff is directly observable |
feature_importance |
Strong contrast in feature signal strength; tree exposes which features matter and which are noise |
categorical_structure |
Hierarchical if-then structure matching real-world categorical logic; tree splits map to readable rules |
noisy_threshold |
Noise near the decision boundary causes unstable splits; labels are flipped near threshold values |
deep_structure |
True boundary requires 4–5 sequential splits; students observe why shallow trees fail |
ensemble_motivation |
Single tree is unstable; Random Forest comparison shows how averaging trees reduces instability |
batch_dt_examples.json contains one entry for each concept focus. Run all eight at once with:
python -m decision_tree.generate --batch batch_dt_examples.jsonThe KNN generator supports ten knn_concept_focus values. Each one drives the Gemini prompt design, any KNN-specific post-processing applied to the dataset, what comparisons the pipeline runs, and what the educator's datasheet emphasises.
| Focus | What it teaches |
|---|---|
k_effect |
How K controls the bias-variance tradeoff; suggested K is intentionally suboptimal |
curse_of_dimensionality |
Distance becomes meaningless in high dimensions; use 6–10 features |
noise_sensitivity |
Deep-interior outliers poison local neighborhoods more than boundary noise |
class_imbalance |
Minority class gets outvoted by neighbors; one class is 3–4× larger |
nonlinear_boundary |
Irregular, jagged boundary KNN handles naturally but SVM cannot |
multimodal_classes |
Each class exists in 2+ disconnected regions; no global boundary works |
disconnected_regions |
One class forms an island entirely surrounded by the other class |
feature_scale |
Features at different scales distort Euclidean distance; scaling is critical |
irrelevant_features |
1–2 features carry no class signal; students discover which features matter |
distance_metric |
Correlated features distort Euclidean distance but not Manhattan |
batch_knn_examples.json contains one entry for each concept focus, in order. Run all ten at once with:
python -m knn.generate --batch batch_knn_examples.jsonEach entry's prompt is written to naturally describe the target concept without naming it, so Gemini selects the right knn_concept_focus from context. The parameters (force_2d, features, rows, difficulty_label) are chosen to make the concept effect as clear as possible.
1. k_effect — seed 1, force_2d, mid
"a wine quality dataset classifying wines as premium or standard based on two chemical measurements, where using too few neighbors picks up noise from individual outlier bottles and using too many neighbors averages away the subtle local patterns that distinguish quality tiers — making the choice of K critical to accuracy"
force_2d keeps the decision boundary visible and the K-sweep curve interpretable. mid difficulty (overlap=0.4) creates a zone where K matters — too small and the model overfits to individual points; too large and it smooths away real signal. The prompt explicitly names both failure modes so Gemini understands the data needs enough noise that K=1 underperforms. Gemini is instructed to set a suggested_k that is intentionally suboptimal, so students can find a better K themselves using the accuracy-vs-K curve.
2. curse_of_dimensionality — seed 2, features: 8, rows: 300, mid
"a drug screening dataset classifying compounds as active or inactive based on many molecular descriptors, where adding more features makes distance-based classification progressively less reliable because points become nearly equidistant from each other in high-dimensional space"
Eight features push KNN into the regime where Euclidean distance starts to lose meaning — all points are roughly equidistant in high dimensions, so the notion of "nearest neighbour" breaks down. The prompt explicitly names the equidistance effect to prevent Gemini from choosing irrelevant_features instead. rows: 300 is set deliberately larger than the default range to give the K-sweep enough data to show a meaningful trend, and to let students run the notebook's curse demo (adding features one at a time and watching accuracy fall).
3. noise_sensitivity — seed 3, force_2d, great
"a quality control dataset for manufacturing where a few mislabeled defective parts were accidentally placed deep in the passing-product storage region"
great difficulty (overlap=0.1) makes the two classes well-separated overall, so the only hard-to-classify points are the deep-interior outliers inserted by the post-processor. This isolates the lesson: KNN is hypersensitive to outliers that land far inside the wrong class's territory, because they corrupt every query point's neighbourhood in that region. Boundary noise matters less — KNN can simply average it away.
4. class_imbalance — seed 4, rows: 400, mid
"a clinical trial dataset where one treatment group is three times larger than the control group, predicting whether patients respond well or poorly to a new drug"
rows: 400 ensures the minority class still has enough members (~100) to produce a meaningful class-level accuracy comparison. The 3:1 imbalance means that for any query point near the boundary, the majority class wins by default — students see the minority precision/recall collapse while overall accuracy looks deceivingly high. No force_2d is needed because imbalance is a counting phenomenon, not a geometric one; the effect appears regardless of dimensionality.
5. nonlinear_boundary — seed 5, force_2d, mid
"a geology dataset classifying two rock types based on two soil measurements, where the boundary between them follows a complex winding path rather than any straight line or smooth curve — the classes intermingle along a jagged frontier but neither forms separate disconnected clusters"
force_2d is essential here — the whole point is to visualise a boundary that no straight line or smooth curve could separate cleanly. The prompt explicitly rules out disconnected clusters to prevent Gemini from choosing multimodal_classes instead. KNN handles it naturally because it makes purely local decisions, never committing to a global shape. The mid difficulty adds some noise to prevent the boundary from being trivially jagged. The datasheet includes a note on how a linear SVM would fail and an rbf SVM might partially succeed.
6. multimodal_classes — seed 6, force_2d, great
"a geology dataset classifying rock formations where two mineral types each appear in multiple disconnected deposits across a survey region based on seismic and magnetic readings"
great difficulty (overlap=0.1) ensures each mode is a tight, clearly visible cluster. The pedagogical goal is to show that even when the classes are individually easy to identify, a global boundary (linear SVM, logistic regression) cannot work because each class occupies multiple disconnected regions of feature space. KNN handles it gracefully by consulting local neighbours. force_2d lets students see all four or more clusters at once.
7. disconnected_regions — seed 7, force_2d, mid
"a marine biology dataset where a rare reef fish species forms isolated colonies surrounded entirely by a dominant predator species, classified by water temperature and depth"
The post-processor physically relocates a pocket of Class B points to sit entirely inside Class A's territory. Unlike multimodal_classes (where both classes have multiple modes), here only one class forms an island inside the other — making it topologically impossible for a convex decision boundary to succeed. mid difficulty adds some background noise so the island is not surrounded by a perfectly clean moat. force_2d is required to see the island structure.
8. feature_scale — seed 8, force_2d, great
"a sensor dataset from two types of industrial machines where one sensor reports in millivolts and another reports in kilopascals, making raw distance-based classification unreliable"
great difficulty means the classes are genuinely separable — but only once features are on the same scale. The post-processor enforces that one feature's range is at least 10× the other's, so raw Euclidean distance is dominated by the large-scale feature and the small-scale feature is ignored entirely. The pipeline runs a feature_scale_comparison: unscaled KNN vs StandardScaler KNN. Students see a large accuracy gap that closes when scaling is applied. force_2d keeps the distortion visible.
9. irrelevant_features — seed 9, features: 5, mid
"a patient health dataset with several physiological measurements for predicting hypertension risk, where some measurements are medically irrelevant noise features"
Five features with 1–2 flagged is_irrelevant: true in the spec. The irrelevant features add random noise to every distance calculation, degrading KNN performance. Students are prompted to compare accuracy when those features are dropped. The educator's datasheet identifies which features are irrelevant and why. No force_2d — you need more than 2 features to demonstrate the effect, and the notebook's feature-selection section works across any dimensionality.
10. distance_metric — seed 10, force_2d, mid
"a financial dataset classifying loan applicants as low-risk or high-risk where income and credit utilization are strongly correlated, making Euclidean distance misleading"
Strongly correlated features create an elongated point cloud. Euclidean distance treats both axes as independent and equally informative, so it measures "closeness" along a diagonal that ignores the correlation structure. Manhattan distance (p=1) is less affected. The pipeline runs a distance_metric_comparison at suggested_k, showing Euclidean vs Manhattan accuracy. force_2d lets students see the correlation ellipse and understand why the metric matters.
The Logistic Regression generator supports eight lr_concept_focus values. Each one drives the Gemini prompt design, LR-specific post-processing, pipeline comparisons, and datasheet content.
| Focus | What it teaches |
|---|---|
linear_boundary |
LR's best case — true boundary is linear, coefficients are clean and interpretable |
nonlinear_boundary |
LR's core weakness — circular or XOR boundary; best-fit line is visibly inadequate |
probability_confidence |
Contrast between high-confidence (near 0/1) and low-confidence (near 0.5) predictions |
coefficient_interpretation |
Features with meaningful, domain-coherent signed weights; coefficients tell a story |
l1_regularization |
Several irrelevant features; L1 drives their coefficients to exactly zero |
l2_regularization |
Correlated features; L2 shrinks coefficients uniformly; contrast with L1 sparsity |
perfect_separation |
Perfectly separable data exposing coefficient divergence; regularization as the fix |
class_imbalance |
Majority class dominates probability outputs; minority class gets poor calibration |
batch_lr_examples.json contains one entry for each concept focus. Run all eight at once with:
python -m logistic_regression.generate --batch batch_lr_examples.jsonThe Naive Bayes generator supports eight nb_concept_focus values. Each one drives the Gemini prompt design, variant-specific data generation, pipeline analysis, and datasheet content.
| Focus | What it teaches |
|---|---|
independence_holds |
Features are truly independent; NB assumption is valid; fast, accurate, and well-calibrated |
independence_violated |
Strongly correlated features; NB double-counts; probabilities overconfident despite decent accuracy |
naive_works_anyway |
Moderate correlation; NB gets the right label despite technically wrong probabilities |
gaussian_fit |
GaussianNB only; per-class feature distributions are cleanly Gaussian; model fits them correctly |
distribution_mismatch |
GaussianNB on skewed or bimodal features; Gaussian assumption breaks; likelihoods miscalculated |
prior_dominance |
Class imbalance so strong the prior overwhelms the likelihood; minority class rarely predicted |
zero_frequency |
MultinomialNB or BernoulliNB only; some feature values absent from training; Laplace smoothing required |
high_dimensional |
Many features with sparse signal; demonstrates NB's strength relative to more complex models |
batch_nb_examples.json contains one entry for each concept focus. Run all eight at once with:
python -m naive_bayes.generate --batch batch_nb_examples.jsonThe Neural Network generator supports eight mlp_concept_focus values. Each one drives the Gemini prompt design, boundary postprocessing, pipeline behavior, visualizations, and datasheet content.
| Focus | What it teaches |
|---|---|
universal_approximation |
MLP can learn any boundary given sufficient capacity; non-linear dataset proves the point |
underfitting_architecture |
Too few neurons/layers cannot capture the true boundary; accuracy plateau is visible in the loss curve |
overfitting_small_data |
Large network on small data memorizes training labels; train/test accuracy gap is pronounced |
activation_comparison |
relu / tanh / sigmoid produce different loss curves and boundaries; relu is the practical default |
training_instability |
High overlap + large learning rate causes oscillating loss curve; convergence is not guaranteed |
black_box_contrast |
MLP accuracy vs Decision Tree interpretability; weights are unreadable, tree rules are not |
depth_necessity |
Complex boundary requires multiple layers; shallow network fails; deep network succeeds |
simple_data_overkill |
Linearly separable data where Logistic Regression matches MLP; extra capacity adds no benefit |
batch_mlp_examples.json contains one entry for each concept focus. Run all eight at once with:
python -m neural_network.generate --batch batch_mlp_examples.jsonThe six generators form a progressive curriculum. Each algorithm is introduced in a context where the previous one fails:
| Step | Algorithm | Key limitation exposed | Next algorithm's answer |
|---|---|---|---|
| 1 | Logistic Regression | Linear boundary only | SVM: maximise margin, use kernels |
| 2 | SVM | Black-box kernel; distance-based | KNN: purely local, no global model |
| 3 | KNN | Sensitive to scale, irrelevant features, high dims | Decision Tree: rule-based, interpretable |
| 4 | Decision Tree | Unstable splits; staircase boundary | Naive Bayes: probabilistic, low variance |
| 5 | Naive Bayes | Independence assumption; overconfident probs | MLP: learns feature interactions, calibrated |
| 6 | Neural Network (MLP) | Black box; needs data and tuning | (open question: deep learning, ensembles) |
The --compare-all flag on the Neural Network generator fits all five prior algorithms on the same dataset and writes the full comparison to metadata.json and datasheet.md, closing the curriculum loop.
This project follows an algorithm-per-folder pattern. Each algorithm lives in its own top-level folder and is a self-contained Python package. The utilities/ folder holds logic shared across all generators.
Create a new folder (e.g., knn/) with this layout:
knn/
├── __init__.py # Makes it a Python package
├── generate.py # CLI entry point; defines OUTPUT_ROOT pointing to knn/output/
├── gemini_client.py # Gemini prompt + response schema tailored to KNN datasets
├── pipeline.py # KNN-specific model fitting; imports generate_dataset from utilities
├── writers.py # Output file generation; imports update_index from utilities
├── batch.py # Batch processing mode
├── knn_explorer.ipynb # Interactive student notebook for KNN
├── dataset-gen-prompt.md # The Gemini prompt spec used to design this generator
└── output/ # All generated KNN datasets (created automatically)
├── index.md
├── index.json
└── <dataset_slug>/
| Utility | What it does | How to use it |
|---|---|---|
utilities.data_generator.generate_dataset |
Generates a synthetic DataFrame from a Gemini spec using Gaussian sampling and an overlap parameter. Algorithm-agnostic. | from utilities.data_generator import generate_dataset |
utilities.index_manager.update_index |
Appends an entry to index.md and index.json in any output folder. |
from utilities.index_manager import update_index |
gemini_client.py— The Gemini system instruction and JSON schema are algorithm-specific. Write a new one tailored to KNN or decision trees (e.g., ask Gemini to recommend akvalue instead of an SVM kernel).pipeline.py— Algorithm fitting logic is specific to each ML model. Importgenerate_datasetfrom utilities and add your own fitting function.writers.py— Output file content varies per algorithm (e.g., a KNN datasheet discusses neighbor counts, not support vectors).
All generators are run as Python modules from the project root:
python -m knn.generate "your prompt"
python -m decision_tree.generate "your prompt" --batch batch.jsonIn your generate.py, set OUTPUT_ROOT as an absolute path relative to the module file so it works from any working directory:
OUTPUT_ROOT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")See svm/ for a complete, working example of this pattern.