ML Dataset Generators

AI-powered synthetic dataset generation for machine learning education. Uses the Gemini API to design realistic, pedagogically useful datasets for teaching ML classification algorithms.

Currently supports: SVM (Support Vector Machines), KNN (K-Nearest Neighbors), Decision Trees, Logistic Regression, Naive Bayes, Neural Networks (MLP)

Project Structure

ml-dataset-generators/
│
├── README.md                       # This file
├── .env                            # API keys (GEMINI_API_KEY) — not committed
├── requirements.txt                # Python dependencies
├── batch_examples.json             # Example batch config for SVM generation
├── batch_knn_examples.json         # Example batch config for KNN generation
├── batch_dt_examples.json          # Example batch config for Decision Tree generation
├── batch_lr_examples.json          # Example batch config for Logistic Regression generation
├── batch_nb_examples.json          # Example batch config for Naive Bayes generation
├── batch_mlp_examples.json         # Example batch config for Neural Network generation
│
├── svm/                            # SVM dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for SVM datasets
│   ├── pipeline.py                 # SVM model fitting (train/test/accuracy)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNG, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── svm_explorer.ipynb          # Interactive student notebook
│   ├── dataset-gen-prompt v2.md    # Gemini prompt spec used to build this generator
│   └── output/                     # All generated SVM datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           └── visualization.png
│
├── knn/                            # KNN dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for KNN datasets
│   ├── pipeline.py                 # KNN model fitting (k-sweep, learning curve, comparisons)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNG, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── knn_explorer.ipynb          # Interactive student notebook
│   ├── dataset-gen-prompt.md       # Gemini prompt spec used to build this generator
│   └── output/                     # All generated KNN datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           └── visualization.png
│
├── decision_tree/                  # Decision Tree dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for DT datasets
│   ├── pipeline.py                 # DT model fitting (depth sweep, pruning, RF comparison)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNGs, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── dt_explorer.ipynb           # Interactive student notebook
│   ├── dataset-gen-prompt.md       # Gemini prompt spec used to build this generator
│   └── output/                     # All generated Decision Tree datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           ├── tree.png
│           ├── depth_and_pruning.png
│           ├── feature_importance.png
│           ├── boundary.png        # 2D datasets only
│           └── rf_comparison.png   # --compare-rf only
│
├── logistic_regression/            # Logistic Regression dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for LR datasets
│   ├── pipeline.py                 # LR model fitting (C-sweep, penalty comparison, SVM comparison)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNGs, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── lr_explorer.ipynb           # Interactive student notebook
│   ├── dataset-gen-prompt.md       # Gemini prompt spec used to build this generator
│   └── output/                     # All generated Logistic Regression datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           ├── boundary.png        # Decision boundary line (all datasets)
│           └── coefficients.png    # Signed coefficient bar chart (all datasets)
│
├── naive_bayes/                    # Naive Bayes dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for NB datasets
│   ├── pipeline.py                 # NB model fitting (variant selection, correlation analysis)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNGs, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── nb_explorer.ipynb           # Interactive student notebook
│   ├── dataset-gen-prompt.md       # Gemini prompt spec used to build this generator
│   └── output/                     # All generated Naive Bayes datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           ├── distributions.png   # Per-class feature distributions (all datasets)
│           └── boundary.png        # Decision boundary (all datasets)
│
├── neural_network/                 # Neural Network (MLP) dataset generator
│   ├── __init__.py
│   ├── generate.py                 # CLI entry point
│   ├── gemini_client.py            # Gemini prompt + JSON schema for MLP datasets
│   ├── pipeline.py                 # MLP fitting (boundary postprocessing, multi-model comparisons)
│   ├── writers.py                  # Output file generation (CSV, JSON, PNGs, MD)
│   ├── batch.py                    # Batch processing mode
│   ├── mlp_explorer.ipynb          # Interactive student notebook (curriculum capstone)
│   ├── dataset-gen-prompt.md       # Gemini prompt spec used to build this generator
│   └── output/                     # All generated Neural Network datasets
│       ├── index.md                # Human-readable index of all datasets
│       ├── index.json              # Machine-readable index
│       └── <dataset_slug>/         # One folder per generated dataset
│           ├── dataset.csv
│           ├── metadata.json
│           ├── regenerate.py
│           ├── datasheet.md
│           ├── boundary.png        # Decision regions (all datasets)
│           ├── loss_curve.png      # Training loss over iterations (all datasets)
│           └── architecture.png    # Network diagram (all datasets)
│
├── utilities/                      # Shared code reused by all generators
│   ├── __init__.py
│   ├── data_generator.py           # generate_dataset(): Gaussian sampling with overlap
│   └── index_manager.py            # update_index(): maintains index.md and index.json
│
└── docs/
    └── gemini-documentation.md     # Gemini API reference notes

Installation

Clone the repository and create a virtual environment:

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Create a .env file in the project root:
```
GEMINI_API_KEY=your_api_key_here
```

Usage: SVM Dataset Generation

Run all commands from the project root directory.

Single dataset

python -m svm.generate "a medical dataset about predicting diabetes risk"
python -m svm.generate "sports performance classification" --difficulty-label mid
python -m svm.generate "environmental sensor data" --overlap 0.25 --rows 300 --seed 42
python -m svm.generate "two-class industrial fault detection" --force-2d

Options:

Flag	Description
`--rows N`	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	Class overlap 0.0–1.0 (0=separated, 1=fully mixed)
`--difficulty-label LABEL`	Shorthand: `great`=0.1, `mid`=0.5, `bad`=0.9
`--features N`	Number of features (overrides Gemini's choice)
`--force-2d`	Force exactly 2 features for a clean scatter plot
`--seed N`	Random seed for reproducibility
`--batch FILE`	Path to a JSON batch config file

Batch mode

python -m svm.generate --batch batch_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed.

[
  { "prompt": "diabetes diagnosis from blood markers", "difficulty_label": "great", "seed": 42 },
  { "prompt": "elite vs recreational runner classification", "difficulty_label": "mid", "rows": 300 }
]

Usage: KNN Dataset Generation

Run all commands from the project root directory.

Single dataset

python -m knn.generate "a medical dataset about predicting diabetes risk"
python -m knn.generate "neighborhood classification of housing types" --difficulty-label mid
python -m knn.generate "environmental sensor readings" --overlap 0.3 --rows 300 --seed 42
python -m knn.generate "complex non-linear boundary dataset" --force-2d

Options:

Flag	Description
`--rows N`	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	Class overlap 0.0–1.0 (0=separated, 1=fully mixed)
`--difficulty-label LABEL`	Shorthand: `great`=0.1, `mid`=0.4, `bad`=0.75 (tighter than SVM)
`--features N`	Number of features (overrides Gemini's choice)
`--force-2d`	Force exactly 2 features for a clean scatter plot
`--seed N`	Random seed for reproducibility
`--batch FILE`	Path to a JSON batch config file

Batch mode

python -m knn.generate --batch batch_knn_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed.

[
  { "prompt": "housing type classification by neighborhood metrics", "difficulty_label": "mid", "seed": 1, "force_2d": true },
  { "prompt": "tissue sample cancer classification", "difficulty_label": "mid", "seed": 2, "features": 8 }
]

Usage: Decision Tree Dataset Generation

Run all commands from the project root directory.

Single dataset

python -m decision_tree.generate "a medical dataset about predicting diabetes risk"
python -m decision_tree.generate "loan default prediction" --difficulty-label mid
python -m decision_tree.generate "plant classification from measurements" --boundary-angle 0.0 --true-depth 2
python -m decision_tree.generate "noisy sensor fault detection" --overlap 0.4 --compare-rf

Options:

Flag	Description
`--rows N`	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	Class overlap 0.0–1.0 (0=separated, 1=fully mixed)
`--difficulty-label LABEL`	Shorthand: `great`=0.1, `mid`=0.5, `bad`=0.9
`--features N`	Number of features (overrides Gemini's choice)
`--force-2d`	Force exactly 2 features for a clean scatter plot
`--seed N`	Random seed for reproducibility
`--batch FILE`	Path to a JSON batch config file
`--boundary-angle FLOAT`	0.0=axis-aligned (tree's best case), 1.0=fully diagonal (staircase required)
`--true-depth INT`	Number of sequential splits the true boundary requires (1–6)
`--compare-rf`	Also fit a RandomForestClassifier and produce `rf_comparison.png`

Batch mode

python -m decision_tree.generate --batch batch_dt_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, boundary_angle, true_depth, compare_rf.

[
  { "prompt": "fruit quality sorting based on a single clear chemical threshold", "difficulty_label": "great", "seed": 1, "force_2d": true },
  { "prompt": "customer churn where the boundary is diagonal through usage-satisfaction space", "boundary_angle": 0.9, "seed": 2, "compare_rf": true }
]

Usage: Logistic Regression Dataset Generation

Run all commands from the project root directory.

Single dataset

python -m logistic_regression.generate "a medical dataset about predicting diabetes risk"
python -m logistic_regression.generate "loan default prediction" --difficulty-label mid
python -m logistic_regression.generate "linearly separable species classification" --nonlinearity 0.0
python -m logistic_regression.generate "circular boundary dataset" --nonlinearity 0.8 --separation 0.3
python -m logistic_regression.generate "sparse feature selection demo" --multiclass --compare-svm

Options:

Flag	Description
`--rows N`	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	Class overlap 0.0–1.0 (0=separated, 1=fully mixed)
`--difficulty-label LABEL`	Shorthand: `great`=0.1, `mid`=0.5, `bad`=0.9
`--features N`	Number of features (overrides Gemini's choice)
`--force-2d`	Force exactly 2 features for a clean scatter plot
`--seed N`	Random seed for reproducibility
`--batch FILE`	Path to a JSON batch config file
`--nonlinearity FLOAT`	0.0=linear boundary (LR's best case), 1.0=fully non-linear (circular/XOR)
`--separation FLOAT`	0.0=low-confidence probabilities (0.4–0.6 range), 1.0=high-confidence (near 0 and 1)
`--multiclass`	Generate 3-class dataset using one-vs-rest strategy
`--compare-svm`	Also fit a LinearSVC and overlay its boundary on `boundary.png`

Batch mode

python -m logistic_regression.generate --batch batch_lr_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, nonlinearity, separation, multiclass, compare_svm.

[
  { "prompt": "diabetes diagnosis from blood markers", "difficulty_label": "great", "seed": 1, "force_2d": true },
  { "prompt": "gene expression cancer classification", "features": 6, "seed": 2 },
  { "prompt": "three-class species classifier", "multiclass": true, "seed": 3 }
]

Usage: Naive Bayes Dataset Generation

Run all commands from the project root directory.

Single dataset

python -m naive_bayes.generate "a medical symptom dataset for disease diagnosis"
python -m naive_bayes.generate "email spam classification" --variant multinomial
python -m naive_bayes.generate "sensor fault detection" --correlation 0.0 --difficulty-label great
python -m naive_bayes.generate "correlated financial indicators" --correlation 0.8 --compare-lr
python -m naive_bayes.generate "binary feature presence dataset" --variant bernoulli

Options:

Flag	Description
`--rows N`	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	Class overlap 0.0–1.0 (0=separated, 1=fully mixed)
`--difficulty-label LABEL`	Shorthand: `great`=0.1, `mid`=0.5, `bad`=0.9
`--features N`	Number of features (overrides Gemini's choice)
`--force-2d`	Force exactly 2 features for a clean scatter plot
`--seed N`	Random seed for reproducibility
`--batch FILE`	Path to a JSON batch config file
`--variant STR`	NB variant: `gaussian` (default), `multinomial`, or `bernoulli`
`--correlation FLOAT`	Feature correlation 0.0=independent (NB's best case), 1.0=fully correlated
`--compare-lr`	Also fit a LogisticRegression and overlay its boundary on `boundary.png`

If --variant is not provided, Gemini selects the most appropriate variant for the scenario.

Batch mode

python -m naive_bayes.generate --batch batch_nb_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, variant, correlation, compare_lr.

[
  { "prompt": "medical symptom classification with independent features", "variant": "gaussian", "seed": 1 },
  { "prompt": "spam detection from word counts", "variant": "multinomial", "seed": 2 },
  { "prompt": "binary keyword presence classifier", "variant": "bernoulli", "seed": 3, "compare_lr": true }
]

Usage: Neural Network (MLP) Dataset Generation

python -m neural_network.generate "a spiral classification dataset"
python -m neural_network.generate "XOR-like boundary dataset" --force-2d --layers 2 --neurons 32
python -m neural_network.generate "complex sensor data" --compare-all --seed 42
python -m neural_network.generate "medical diagnosis classification" --difficulty-label mid --layers 3
python -m neural_network.generate --batch batch_mlp_examples.json

Argument	Type	Description
`prompt`	string	Natural language description of the dataset to generate
`--rows INT`	int	Number of data rows (overrides Gemini's choice)
`--overlap FLOAT`	float	Class overlap from 0.0 (separated) to 1.0 (mixed). Overrides `--difficulty-label`
`--difficulty-label LABEL`	great/mid/bad	Difficulty alias: great=0.1, mid=0.5, bad=0.9
`--features INT`	int	Number of features (overrides Gemini's choice)
`--force-2d`	flag	Force exactly 2 features for clean 2D boundary visualization
`--seed INT`	int	Random seed for reproducibility
`--batch FILE`	path	Path to a JSON batch config file
`--layers INT`	int	Number of hidden layers (1–5); overrides Gemini's suggestion
`--neurons INT`	int	Neurons per hidden layer; overrides Gemini's suggestion
`--compare-all`	flag	Fit all five prior algorithms and include comparison in metadata and datasheet

Batch mode

python -m neural_network.generate --batch batch_mlp_examples.json

The batch config is a JSON array of objects. Each object accepts: prompt (required), overlap, difficulty_label, rows, features, force_2d, seed, layers, neurons, compare_all.

[
  { "prompt": "spiral classification dataset", "force_2d": true, "rows": 500, "seed": 1 },
  { "prompt": "XOR pattern data", "layers": 1, "neurons": 2, "force_2d": true, "seed": 2 },
  { "prompt": "interleaved crescents boundary", "force_2d": true, "compare_all": true, "seed": 7 }
]

Output Format

Each generated dataset is saved to svm/output/<dataset_slug>/, knn/output/<dataset_slug>/, or decision_tree/output/<dataset_slug>/ and contains:

File	Description
`dataset.csv`	The dataset with feature columns and a `label` column
`metadata.json`	Full spec, model results, generation parameters (includes `knn_concept_focus` for KNN, `dt_concept_focus` for DT)
`regenerate.py`	Self-contained script to recreate `dataset.csv` exactly (numpy + pandas only, no API call)
`datasheet.md`	Educator-facing documentation: composition, difficulty, model results, teaching notes
`visualization.png`	2×2 subplot grid (KNN) or scatter/pairplot (SVM)
`tree.png`	Decision tree diagram at `suggested_max_depth` (Decision Tree only)
`depth_and_pruning.png`	Depth curve + pruning curve (Decision Tree only)
`feature_importance.png`	Feature importance bar chart (Decision Tree only)
`boundary.png`	Axis-aligned decision regions with true split overlays (Decision Tree, 2D only)
`rf_comparison.png`	Random Forest vs Decision Tree comparison (Decision Tree, `--compare-rf` only)

A running index is maintained in the output/ subfolder of each algorithm's folder.

Interactive Notebooks

svm/svm_explorer.ipynb is a scaffolded, hands-on learning environment for students.

To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in svm/output/.

The notebook covers:

Data exploration and visualization
Train/test split and StandardScaler normalization
SVM hyperparameter tuning (kernel, C, gamma, degree)
Decision boundary visualization
Kernel comparison (linear, rbf, poly, sigmoid)
C × gamma accuracy heatmap

knn/knn_explorer.ipynb is a scaffolded notebook for KNN exploration.

To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in knn/output/.

The notebook covers:

Data exploration and class balance analysis
Train/test split with raw vs scaled feature range comparison
KNN fitting with adjustable K
Accuracy vs K curve with train/test comparison
Decision boundary visualization (2D) or PCA projection (3D+)
Nearest neighbor inspection with color-coded neighbor lines
Learning curve analysis
Conditional advanced sections: curse of dimensionality, feature scale, distance metric comparison
SVM comparison cell with discussion prompt

decision_tree/dt_explorer.ipynb is a scaffolded notebook for Decision Tree exploration.

To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in decision_tree/output/.

The notebook covers:

Data exploration and feature type identification (numerical vs categorical)
Train/test split with a note explaining why StandardScaler is NOT needed for decision trees
Decision tree fitting and classification report
Tree diagram generation with plot_tree() — students trace the path by hand
Depth curve: sweep max_depth, identify the overfitting cliff
Pruning with ccp_alpha: find the optimal pruning level
Feature importance bar chart with signal vs noise feature annotations
Boundary visualization (2D): staircase boundary with true split overlays
Tree instability demo: refit on 5 random splits, observe root split variation
Random Forest extension: compare accuracy and feature importances to a single tree
Algorithm comparison cell: Decision Tree vs SVM vs KNN accuracy bar chart

naive_bayes/nb_explorer.ipynb is a scaffolded notebook for Naive Bayes exploration.

To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in naive_bayes/output/.

The notebook covers:

Data exploration with class balance and feature type detection (continuous/counts/binary)
Train/test split with preprocessing notes by variant (GaussianNB: no scaling; Multinomial: non-negative; Bernoulli: binary)
Naive Bayes fitting with all three variants and classification report
Per-class distribution visualization: the signature NB output showing what the model knows internally
Independence assumption analysis: feature correlation matrix and discussion of double-counting
Probability output exploration: histogram comparison showing overconfidence when features are correlated
Accuracy vs calibration: calibration curve showing NB can be right for the wrong reasons
Laplace smoothing demo (MultinomialNB/BernoulliNB): trigger zero-probability error, then fix it
Variant comparison: fit all three NB variants on the same data
Prior probability manipulation: adjust class priors and observe accuracy changes per class
Logistic Regression comparison: probability histogram overlay showing NB overconfidence vs LR calibration
Decision Tree comparison: accuracy comparison with discussion of rules vs distributions

neural_network/mlp_explorer.ipynb is a scaffolded capstone notebook for Neural Network (MLP) exploration.

To use it, open the notebook in Jupyter and update the dataset path in the Configuration cell to point to any dataset.csv in neural_network/output/.

The notebook covers:

Data exploration and class balance analysis
Train/test split with StandardScaler — with a note explaining why scaling is required for MLP (unlike DT and NB)
MLP fitting with MLPClassifier and full classification report
Loss curve visualization: plot loss_curve_ and identify convergence point
Architecture diagram: count total weights and discuss parameter-to-data ratio
Decision boundary visualization (2D) or PCA projection (3D+)
Architecture sweep: from (4,) to (64,64,64); plot accuracy vs parameter count
Activation comparison: relu / tanh / sigmoid; compare loss curves and boundaries
Overfitting demo: large network on small data subset; observe train/test accuracy gap
Black box exploration: inspect mlp.coefs_[0] and compare interpretability to Decision Tree
All-algorithm comparison: fit all 6 algorithms; horizontal bar chart
Curriculum capstone reflection: 5 synthesis questions linking MLP back to all prior algorithms
Next steps: PyTorch, Keras, TensorFlow pointers for production-scale deep learning

Concept Focus Reference

Decision Tree — `dt_concept_focus`

The Decision Tree generator supports eight dt_concept_focus values. Each one drives the Gemini prompt design, DT-specific post-processing, pipeline comparisons, and datasheet content.

Focus	What it teaches
`axis_aligned_boundary`	Tree's best case — true boundary is a clean threshold on 1–2 features; tree finds it in minimal splits
`diagonal_boundary`	Tree's core weakness — diagonal boundary requires a staircase approximation visible in `boundary.png`
`overfitting_depth`	Unconstrained tree memorizes training data; the depth curve's overfitting cliff is directly observable
`feature_importance`	Strong contrast in feature signal strength; tree exposes which features matter and which are noise
`categorical_structure`	Hierarchical if-then structure matching real-world categorical logic; tree splits map to readable rules
`noisy_threshold`	Noise near the decision boundary causes unstable splits; labels are flipped near threshold values
`deep_structure`	True boundary requires 4–5 sequential splits; students observe why shallow trees fail
`ensemble_motivation`	Single tree is unstable; Random Forest comparison shows how averaging trees reduces instability

batch_dt_examples.json contains one entry for each concept focus. Run all eight at once with:

python -m decision_tree.generate --batch batch_dt_examples.json

KNN — `knn_concept_focus`

The KNN generator supports ten knn_concept_focus values. Each one drives the Gemini prompt design, any KNN-specific post-processing applied to the dataset, what comparisons the pipeline runs, and what the educator's datasheet emphasises.

Quick reference

Focus	What it teaches
`k_effect`	How K controls the bias-variance tradeoff; suggested K is intentionally suboptimal
`curse_of_dimensionality`	Distance becomes meaningless in high dimensions; use 6–10 features
`noise_sensitivity`	Deep-interior outliers poison local neighborhoods more than boundary noise
`class_imbalance`	Minority class gets outvoted by neighbors; one class is 3–4× larger
`nonlinear_boundary`	Irregular, jagged boundary KNN handles naturally but SVM cannot
`multimodal_classes`	Each class exists in 2+ disconnected regions; no global boundary works
`disconnected_regions`	One class forms an island entirely surrounded by the other class
`feature_scale`	Features at different scales distort Euclidean distance; scaling is critical
`irrelevant_features`	1–2 features carry no class signal; students discover which features matter
`distance_metric`	Correlated features distort Euclidean distance but not Manhattan

`batch_knn_examples.json` — one dataset per concept

batch_knn_examples.json contains one entry for each concept focus, in order. Run all ten at once with:

python -m knn.generate --batch batch_knn_examples.json

Each entry's prompt is written to naturally describe the target concept without naming it, so Gemini selects the right knn_concept_focus from context. The parameters (force_2d, features, rows, difficulty_label) are chosen to make the concept effect as clear as possible.

1. k_effect — seed 1, force_2d, mid

"a wine quality dataset classifying wines as premium or standard based on two chemical measurements, where using too few neighbors picks up noise from individual outlier bottles and using too many neighbors averages away the subtle local patterns that distinguish quality tiers — making the choice of K critical to accuracy"

force_2d keeps the decision boundary visible and the K-sweep curve interpretable. mid difficulty (overlap=0.4) creates a zone where K matters — too small and the model overfits to individual points; too large and it smooths away real signal. The prompt explicitly names both failure modes so Gemini understands the data needs enough noise that K=1 underperforms. Gemini is instructed to set a suggested_k that is intentionally suboptimal, so students can find a better K themselves using the accuracy-vs-K curve.

2. curse_of_dimensionality — seed 2, features: 8, rows: 300, mid

"a drug screening dataset classifying compounds as active or inactive based on many molecular descriptors, where adding more features makes distance-based classification progressively less reliable because points become nearly equidistant from each other in high-dimensional space"

Eight features push KNN into the regime where Euclidean distance starts to lose meaning — all points are roughly equidistant in high dimensions, so the notion of "nearest neighbour" breaks down. The prompt explicitly names the equidistance effect to prevent Gemini from choosing irrelevant_features instead. rows: 300 is set deliberately larger than the default range to give the K-sweep enough data to show a meaningful trend, and to let students run the notebook's curse demo (adding features one at a time and watching accuracy fall).

3. noise_sensitivity — seed 3, force_2d, great

"a quality control dataset for manufacturing where a few mislabeled defective parts were accidentally placed deep in the passing-product storage region"

great difficulty (overlap=0.1) makes the two classes well-separated overall, so the only hard-to-classify points are the deep-interior outliers inserted by the post-processor. This isolates the lesson: KNN is hypersensitive to outliers that land far inside the wrong class's territory, because they corrupt every query point's neighbourhood in that region. Boundary noise matters less — KNN can simply average it away.

4. class_imbalance — seed 4, rows: 400, mid

"a clinical trial dataset where one treatment group is three times larger than the control group, predicting whether patients respond well or poorly to a new drug"

rows: 400 ensures the minority class still has enough members (~100) to produce a meaningful class-level accuracy comparison. The 3:1 imbalance means that for any query point near the boundary, the majority class wins by default — students see the minority precision/recall collapse while overall accuracy looks deceivingly high. No force_2d is needed because imbalance is a counting phenomenon, not a geometric one; the effect appears regardless of dimensionality.

5. nonlinear_boundary — seed 5, force_2d, mid

"a geology dataset classifying two rock types based on two soil measurements, where the boundary between them follows a complex winding path rather than any straight line or smooth curve — the classes intermingle along a jagged frontier but neither forms separate disconnected clusters"

force_2d is essential here — the whole point is to visualise a boundary that no straight line or smooth curve could separate cleanly. The prompt explicitly rules out disconnected clusters to prevent Gemini from choosing multimodal_classes instead. KNN handles it naturally because it makes purely local decisions, never committing to a global shape. The mid difficulty adds some noise to prevent the boundary from being trivially jagged. The datasheet includes a note on how a linear SVM would fail and an rbf SVM might partially succeed.

6. multimodal_classes — seed 6, force_2d, great

"a geology dataset classifying rock formations where two mineral types each appear in multiple disconnected deposits across a survey region based on seismic and magnetic readings"

great difficulty (overlap=0.1) ensures each mode is a tight, clearly visible cluster. The pedagogical goal is to show that even when the classes are individually easy to identify, a global boundary (linear SVM, logistic regression) cannot work because each class occupies multiple disconnected regions of feature space. KNN handles it gracefully by consulting local neighbours. force_2d lets students see all four or more clusters at once.

7. disconnected_regions — seed 7, force_2d, mid

"a marine biology dataset where a rare reef fish species forms isolated colonies surrounded entirely by a dominant predator species, classified by water temperature and depth"

The post-processor physically relocates a pocket of Class B points to sit entirely inside Class A's territory. Unlike multimodal_classes (where both classes have multiple modes), here only one class forms an island inside the other — making it topologically impossible for a convex decision boundary to succeed. mid difficulty adds some background noise so the island is not surrounded by a perfectly clean moat. force_2d is required to see the island structure.

8. feature_scale — seed 8, force_2d, great

"a sensor dataset from two types of industrial machines where one sensor reports in millivolts and another reports in kilopascals, making raw distance-based classification unreliable"

great difficulty means the classes are genuinely separable — but only once features are on the same scale. The post-processor enforces that one feature's range is at least 10× the other's, so raw Euclidean distance is dominated by the large-scale feature and the small-scale feature is ignored entirely. The pipeline runs a feature_scale_comparison: unscaled KNN vs StandardScaler KNN. Students see a large accuracy gap that closes when scaling is applied. force_2d keeps the distortion visible.

9. irrelevant_features — seed 9, features: 5, mid

"a patient health dataset with several physiological measurements for predicting hypertension risk, where some measurements are medically irrelevant noise features"

Five features with 1–2 flagged is_irrelevant: true in the spec. The irrelevant features add random noise to every distance calculation, degrading KNN performance. Students are prompted to compare accuracy when those features are dropped. The educator's datasheet identifies which features are irrelevant and why. No force_2d — you need more than 2 features to demonstrate the effect, and the notebook's feature-selection section works across any dimensionality.

10. distance_metric — seed 10, force_2d, mid

"a financial dataset classifying loan applicants as low-risk or high-risk where income and credit utilization are strongly correlated, making Euclidean distance misleading"

Strongly correlated features create an elongated point cloud. Euclidean distance treats both axes as independent and equally informative, so it measures "closeness" along a diagonal that ignores the correlation structure. Manhattan distance (p=1) is less affected. The pipeline runs a distance_metric_comparison at suggested_k, showing Euclidean vs Manhattan accuracy. force_2d lets students see the correlation ellipse and understand why the metric matters.

Logistic Regression — `lr_concept_focus`

The Logistic Regression generator supports eight lr_concept_focus values. Each one drives the Gemini prompt design, LR-specific post-processing, pipeline comparisons, and datasheet content.

Focus	What it teaches
`linear_boundary`	LR's best case — true boundary is linear, coefficients are clean and interpretable
`nonlinear_boundary`	LR's core weakness — circular or XOR boundary; best-fit line is visibly inadequate
`probability_confidence`	Contrast between high-confidence (near 0/1) and low-confidence (near 0.5) predictions
`coefficient_interpretation`	Features with meaningful, domain-coherent signed weights; coefficients tell a story
`l1_regularization`	Several irrelevant features; L1 drives their coefficients to exactly zero
`l2_regularization`	Correlated features; L2 shrinks coefficients uniformly; contrast with L1 sparsity
`perfect_separation`	Perfectly separable data exposing coefficient divergence; regularization as the fix
`class_imbalance`	Majority class dominates probability outputs; minority class gets poor calibration

batch_lr_examples.json contains one entry for each concept focus. Run all eight at once with:

python -m logistic_regression.generate --batch batch_lr_examples.json

Naive Bayes — `nb_concept_focus`

The Naive Bayes generator supports eight nb_concept_focus values. Each one drives the Gemini prompt design, variant-specific data generation, pipeline analysis, and datasheet content.

Focus	What it teaches
`independence_holds`	Features are truly independent; NB assumption is valid; fast, accurate, and well-calibrated
`independence_violated`	Strongly correlated features; NB double-counts; probabilities overconfident despite decent accuracy
`naive_works_anyway`	Moderate correlation; NB gets the right label despite technically wrong probabilities
`gaussian_fit`	GaussianNB only; per-class feature distributions are cleanly Gaussian; model fits them correctly
`distribution_mismatch`	GaussianNB on skewed or bimodal features; Gaussian assumption breaks; likelihoods miscalculated
`prior_dominance`	Class imbalance so strong the prior overwhelms the likelihood; minority class rarely predicted
`zero_frequency`	MultinomialNB or BernoulliNB only; some feature values absent from training; Laplace smoothing required
`high_dimensional`	Many features with sparse signal; demonstrates NB's strength relative to more complex models

batch_nb_examples.json contains one entry for each concept focus. Run all eight at once with:

python -m naive_bayes.generate --batch batch_nb_examples.json

Neural Network (MLP) — `mlp_concept_focus`

The Neural Network generator supports eight mlp_concept_focus values. Each one drives the Gemini prompt design, boundary postprocessing, pipeline behavior, visualizations, and datasheet content.

Focus	What it teaches
`universal_approximation`	MLP can learn any boundary given sufficient capacity; non-linear dataset proves the point
`underfitting_architecture`	Too few neurons/layers cannot capture the true boundary; accuracy plateau is visible in the loss curve
`overfitting_small_data`	Large network on small data memorizes training labels; train/test accuracy gap is pronounced
`activation_comparison`	relu / tanh / sigmoid produce different loss curves and boundaries; relu is the practical default
`training_instability`	High overlap + large learning rate causes oscillating loss curve; convergence is not guaranteed
`black_box_contrast`	MLP accuracy vs Decision Tree interpretability; weights are unreadable, tree rules are not
`depth_necessity`	Complex boundary requires multiple layers; shallow network fails; deep network succeeds
`simple_data_overkill`	Linearly separable data where Logistic Regression matches MLP; extra capacity adds no benefit

batch_mlp_examples.json contains one entry for each concept focus. Run all eight at once with:

python -m neural_network.generate --batch batch_mlp_examples.json

Curriculum Sequence

The six generators form a progressive curriculum. Each algorithm is introduced in a context where the previous one fails:

Step	Algorithm	Key limitation exposed	Next algorithm's answer
1	Logistic Regression	Linear boundary only	SVM: maximise margin, use kernels
2	SVM	Black-box kernel; distance-based	KNN: purely local, no global model
3	KNN	Sensitive to scale, irrelevant features, high dims	Decision Tree: rule-based, interpretable
4	Decision Tree	Unstable splits; staircase boundary	Naive Bayes: probabilistic, low variance
5	Naive Bayes	Independence assumption; overconfident probs	MLP: learns feature interactions, calibrated
6	Neural Network (MLP)	Black box; needs data and tuning	(open question: deep learning, ensembles)

The --compare-all flag on the Neural Network generator fits all five prior algorithms on the same dataset and writes the full comparison to metadata.json and datasheet.md, closing the curriculum loop.

Adding a New Generator

This project follows an algorithm-per-folder pattern. Each algorithm lives in its own top-level folder and is a self-contained Python package. The utilities/ folder holds logic shared across all generators.

Structure every new generator must follow

Create a new folder (e.g., knn/) with this layout:

knn/
├── __init__.py                 # Makes it a Python package
├── generate.py                 # CLI entry point; defines OUTPUT_ROOT pointing to knn/output/
├── gemini_client.py            # Gemini prompt + response schema tailored to KNN datasets
├── pipeline.py                 # KNN-specific model fitting; imports generate_dataset from utilities
├── writers.py                  # Output file generation; imports update_index from utilities
├── batch.py                    # Batch processing mode
├── knn_explorer.ipynb          # Interactive student notebook for KNN
├── dataset-gen-prompt.md       # The Gemini prompt spec used to design this generator
└── output/                     # All generated KNN datasets (created automatically)
    ├── index.md
    ├── index.json
    └── <dataset_slug>/

What to reuse from `utilities/`

Utility	What it does	How to use it
`utilities.data_generator.generate_dataset`	Generates a synthetic DataFrame from a Gemini spec using Gaussian sampling and an overlap parameter. Algorithm-agnostic.	`from utilities.data_generator import generate_dataset`
`utilities.index_manager.update_index`	Appends an entry to `index.md` and `index.json` in any output folder.	`from utilities.index_manager import update_index`

What belongs in the algorithm folder

gemini_client.py — The Gemini system instruction and JSON schema are algorithm-specific. Write a new one tailored to KNN or decision trees (e.g., ask Gemini to recommend a k value instead of an SVM kernel).
pipeline.py — Algorithm fitting logic is specific to each ML model. Import generate_dataset from utilities and add your own fitting function.
writers.py — Output file content varies per algorithm (e.g., a KNN datasheet discusses neighbor counts, not support vectors).

Entry point convention

All generators are run as Python modules from the project root:

python -m knn.generate "your prompt"
python -m decision_tree.generate "your prompt" --batch batch.json

In your generate.py, set OUTPUT_ROOT as an absolute path relative to the module file so it works from any working directory:

OUTPUT_ROOT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")

Reference implementation

See svm/ for a complete, working example of this pattern.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
decision_tree		decision_tree
docs		docs
knn		knn
logistic_regression		logistic_regression
naive_bayes		naive_bayes
neural_network		neural_network
svm		svm
utilities		utilities
CLAUDE.md		CLAUDE.md
README.md		README.md
batch_dt_examples.json		batch_dt_examples.json
batch_examples.json		batch_examples.json
batch_knn_examples.json		batch_knn_examples.json
batch_lr_examples.json		batch_lr_examples.json
batch_mlp_examples.json		batch_mlp_examples.json
batch_nb_examples.json		batch_nb_examples.json
neural-network-mlp-prompt.md		neural-network-mlp-prompt.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ML Dataset Generators

Project Structure

Installation

Usage: SVM Dataset Generation

Single dataset

Batch mode

Usage: KNN Dataset Generation

Single dataset

Batch mode

Usage: Decision Tree Dataset Generation

Single dataset

Batch mode

Usage: Logistic Regression Dataset Generation

Single dataset

Batch mode

Usage: Naive Bayes Dataset Generation

Single dataset

Batch mode

Usage: Neural Network (MLP) Dataset Generation

Batch mode

Output Format

Interactive Notebooks

Concept Focus Reference

Decision Tree — dt_concept_focus

KNN — knn_concept_focus

Quick reference

batch_knn_examples.json — one dataset per concept

Logistic Regression — lr_concept_focus

Naive Bayes — nb_concept_focus

Neural Network (MLP) — mlp_concept_focus

Curriculum Sequence

Adding a New Generator

Structure every new generator must follow

What to reuse from utilities/

What belongs in the algorithm folder

Entry point convention

Reference implementation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Decision Tree — `dt_concept_focus`

KNN — `knn_concept_focus`

`batch_knn_examples.json` — one dataset per concept

Logistic Regression — `lr_concept_focus`

Naive Bayes — `nb_concept_focus`

Neural Network (MLP) — `mlp_concept_focus`

What to reuse from `utilities/`

Packages