Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ The directory structure of your new project will look something like this (depen
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── AGENTS.md <- The top-level AGENTS file for AI coding agents.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
Expand Down
1 change: 1 addition & 0 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ The directory structure of your new project will look something like this (depen
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── AGENTS.md <- The top-level AGENTS file for AI coding agents.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
Expand Down
4 changes: 4 additions & 0 deletions docs/docs/using-the-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,10 @@ Now you'll be able to [create a Pull Request in GitHub](https://docs.github.com/

There's no magic in the `Makefile`. We often add project-specific commands or update the existing ones over the course of a project. For example, we've added scripts to generate reports with pandoc, build and serve documentation, publish static sites from assets, package code for distribution, and more.

## Changing the `AGENTS.md` file

There's no magic in the `AGENTS.md` file either (apart from the ✨magic✨ of AI). This is a place to put instructions for AI coding agents that you might use in your project. We often add project-specific instructions for how to use agents effectively with the codebase and data in our project. Best practices for instructing agents in a project are evolving rapidly, so we recommend updating this file as you learn what works best for your project.

## Installing Make on Windows

Unfortunately, GNU Make is not typically pre-installed on Windows. Here are a few different options for getting Make:
Expand Down
58 changes: 58 additions & 0 deletions tests/test_creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,62 @@ def no_curlies(filepath):
return not any(template_strings_in_file)


def verify_agents_md(root, config):
"""Test that AGENTS.md is correctly rendered for the given config."""
agents_md = (root / "AGENTS.md").read_text()

Comment on lines +44 to +47
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGENTS.md includes non-ASCII characters (e.g., the arrow and em dashes in the new template). Reading it via Path.read_text() without an explicit encoding makes this test locale-dependent and can fail/garble under non-UTF-8 locales. Use read_text(encoding='utf-8') here (and consider doing the same in no_curlies, which also reads files without specifying encoding).

Copilot uses AI. Check for mistakes.
# Project name and module name are always rendered
assert config["project_name"] in agents_md
assert config["module_name"] in agents_md

# No unrendered Jinja2 template strings
assert no_curlies(root / "AGENTS.md")

# Code scaffold section conditionally included in project structure
if config["include_code_scaffold"] == "Yes":
assert "dataset.py" in agents_md
assert "features.py" in agents_md
assert "modeling/" in agents_md
else:
assert "dataset.py" not in agents_md
assert "features.py" not in agents_md

# Dataset storage conditionals
has_storage = "none" not in config["dataset_storage"]
if has_storage:
assert "sync_data_down" in agents_md
assert "sync_data_up" in agents_md
else:
assert "sync_data_down" not in agents_md
assert "sync_data_up" not in agents_md

# Environment manager
env_manager = config["environment_manager"]
if env_manager == "conda":
assert "conda activate" in agents_md
elif env_manager == "virtualenv":
assert "workon" in agents_md
elif env_manager == "pipenv":
assert "pipenv shell" in agents_md
elif env_manager == "uv":
assert "source .venv/bin/activate" in agents_md
elif env_manager == "pixi":
assert "pixi shell" in agents_md
elif env_manager == "poetry":
assert "poetry env activate" in agents_md

# Dependency file
assert f"`{config['dependency_file']}`" in agents_md

# Linting and formatting
if config["linting_and_formatting"] == "ruff":
assert "Uses ruff" in agents_md
assert "flake8" not in agents_md
elif config["linting_and_formatting"] == "flake8+black+isort":
assert "flake8, black, and isort" in agents_md
assert "Uses ruff" not in agents_md


def test_baking_configs(config, fast):
"""For every generated config in the config_generator, run all
of the tests.
Expand All @@ -49,6 +105,7 @@ def test_baking_configs(config, fast):
with bake_project(config) as project_directory:
verify_folders(project_directory, config)
verify_files(project_directory, config)
verify_agents_md(project_directory, config)

if fast < 2:
verify_makefile_commands(project_directory, config)
Expand Down Expand Up @@ -96,6 +153,7 @@ def verify_folders(root, config):
def verify_files(root, config):
"""Test that expected files and only expected files exist."""
expected_files = [
"AGENTS.md",
"Makefile",
"README.md",
"pyproject.toml",
Expand Down
153 changes: 153 additions & 0 deletions {{ cookiecutter.repo_name }}/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# AGENTS.md

* **Project name:** {{ cookiecutter.project_name }}
* **Description:** {{ cookiecutter.description }}

This project was generated from the [Cookiecutter Data Science](https://cookiecutter-data-science.drivendata.org/) template. Follow the conventions below when working in this codebase.

## Key Commands

Run tasks through `make`. Key recipes:

* `make` — List all available commands
* `make requirements` — Install/update dependencies
* `make data` — Run the data processing pipeline (assumes data is present in `data/raw/`){% if cookiecutter.linting_and_formatting != 'none' %}
* `make lint` — Check code style
* `make format` — Auto-format code{% endif %}{% if cookiecutter.testing_framework != 'none' %}
* `make test` — Run the test suite{% endif %}{% if not cookiecutter.dataset_storage.none %}
* `make sync_data_down` — Pull data from cloud storage
* `make sync_data_up` — Push data to cloud storage{% endif %}
* `make create_environment` — Set up the Python environment
* `make clean` — Remove compiled Python files

Add project-specific recipes to the `Makefile` for commands that are run frequently or require multiple steps.

## Project Directory Structure

* {{ cookiecutter.repo_name }}/
* data/ <- Data files (gitignored)
* raw/ <- Original, immutable data. NEVER modify.
* external/ <- Third-party data sources.
* interim/ <- Intermediate transformed data.
* processed/ <- Final, canonical datasets.
* models/ <- Trained models, predictions, summaries (gitignored)
* notebooks/ <- Jupyter notebooks for exploration
* references/ <- Data dictionaries, manuals, documentation
* reports/ <- Generated analysis outputs
* figures/ <- Generated figures
* {{ cookiecutter.module_name }}/ <- Source code for this project{%- if cookiecutter.include_code_scaffold == 'Yes' %}
* config.py <- Project configuration and path definitions
* dataset.py <- Data loading and generation
* features.py <- Feature engineering code
* plots.py <- Visualization code
* modeling/
* train.py <- Model training.
* predict.py <- Model inference{% endif %}
* tests/ <- Test suite

## Core Principles

Reproducibility is the most critical component of any data science project. The following principles should be followed to ensure that the project is reproducible and maintainable.

### Data analysis is a directed acyclic graph

Treat the data pipeline as a directed acyclic graph. Each step takes inputs and produces outputs with no circular dependencies. Anyone must be able to reproduce final outputs from code and raw data alone.

### Raw data is immutable

Never edit, overwrite, or manually modify files in `data/raw/`. Data flows one direction:

`data/raw/` → `data/interim/` → `data/processed/`

Intermediate outputs should be cached in `interim/`. Analysis-ready datasets that are end products in themselves or don't require any more preprocessing or feature engineering go in `processed/`.

### Data is not in source control

The `data/` and `models/` directories are gitignored. Do not commit data files, trained models, or `.env` files to git.{% if not cookiecutter.dataset_storage.none %} Use `make sync_data_down` / `make sync_data_up` to sync data with cloud storage.{% endif %}

### Use Make as the task runner

Run all tasks through `make` — see the Key Commands section above for the full list of available recipes.

### Notebooks are for exploration; source code is for repetition

Use `notebooks/` for exploratory analysis notebooks. When code is reused across notebooks, refactor it into the `{{ cookiecutter.module_name }}/` package. The project is installed as a local package, so you can import with:

```python
from {{ cookiecutter.module_name }}.dataset import main
```

Notebook naming convention: `<step>.<order>-<identifier>-<description>.ipynb` (e.g., `0.3-bull-visualize-distributions.ipynb`). Step numbers: 0=exploration, 1=cleaning/features, 2=visualizations, 3=modeling, 4=publication. Order number defines execution order within each step.

### Secrets

**NEVER** read `.env` or any secrets files directly. Code should load secrets with `python-dotenv`. Use `{{ cookiecutter.module_name }}/config.py` for project paths and configuration. Never hardcode credentials, print secrets in logs, or add them to source control.

## Development Workflow

* **Python version:** {{ cookiecutter.python_version_number }}
{% if cookiecutter.environment_manager == 'conda' %}
* **Environment:** conda. Activate with `conda activate {{ cookiecutter.repo_name }}`.
{%- elif cookiecutter.environment_manager == 'virtualenv' %}
* **Environment:** virtualenv. Activate with `workon {{ cookiecutter.repo_name }}`.
{%- elif cookiecutter.environment_manager == 'pipenv' %}
* **Environment:** pipenv. Activate with `pipenv shell`.
{%- elif cookiecutter.environment_manager == 'uv' %}
* **Environment:** uv. Activate with `source .venv/bin/activate`.
{%- elif cookiecutter.environment_manager == 'pixi' %}
* **Environment:** pixi. Activate with `pixi shell`.
{%- elif cookiecutter.environment_manager == 'poetry' %}
* **Environment:** poetry. Activate with `$(poetry env activate)` or prefix commands with `poetry run`.
{%- endif %}
{% if cookiecutter.dependency_file == 'requirements.txt' %}
* **Dependencies:** Defined in `requirements.txt`. Install with `make requirements`.
{%- elif cookiecutter.dependency_file == 'pyproject.toml' %}
* **Dependencies:** Defined in `pyproject.toml`. Install with `make requirements`.
{%- elif cookiecutter.dependency_file == 'environment.yml' %}
* **Dependencies:** Defined in `environment.yml`. Install with `make requirements`.
{%- elif cookiecutter.dependency_file == 'Pipfile' %}
* **Dependencies:** Defined in `Pipfile`. Install with `make requirements`.
{%- elif cookiecutter.dependency_file == 'pixi.toml' %}
* **Dependencies:** Defined in `pixi.toml`. Install with `make requirements`.
{%- endif %}
{% if cookiecutter.linting_and_formatting == 'ruff' %}
* **Linting/Formatting:** Uses ruff. Run `make lint` to check, `make format` to fix.
{%- elif cookiecutter.linting_and_formatting == 'flake8+black+isort' %}
* **Linting/Formatting:** Uses flake8, black, and isort. Run `make lint` to check, `make format` to fix.
{%- endif %}
{% if cookiecutter.testing_framework == 'pytest' %}
* **Testing:** Uses pytest. Run `make test`.
{%- elif cookiecutter.testing_framework == 'unittest' %}
* **Testing:** Uses unittest. Run `make test`.
{%- endif %}
Comment thread
chrisjkuch marked this conversation as resolved.

Linting and testing should succeed before committing work or at the end of each session. Run `make format` to format code if linting fails.

## Version Control

* Do not push to a remote repository without asking first.
* Write concise commit messages that describe the change and why it was made.{% if cookiecutter.linting_and_formatting != 'none' %}
* Run `make lint` and `make format` before committing changes.{% endif %}{% if cookiecutter.testing_framework != 'none' %}
* Run `make test` before committing changes. Fix any failures — do not skip or disable tests.{% endif %}

## Boundaries

### Always do

* Run all Python code within the project environment. {% if cookiecutter.environment_manager == 'conda' %}Use `conda run -n {{ cookiecutter.repo_name }}` to prefix commands, or activate with `conda activate {{ cookiecutter.repo_name }}` first.{% elif cookiecutter.environment_manager == 'uv' %}Use `uv run` to prefix commands, or activate with `source .venv/bin/activate` first.{% elif cookiecutter.environment_manager == 'pipenv' %}Use `pipenv run` to prefix commands, or activate with `pipenv shell` first.{% elif cookiecutter.environment_manager == 'pixi' %}Use `pixi run` to prefix commands, or activate with `pixi shell` first.{% elif cookiecutter.environment_manager == 'poetry' %}Use `poetry run` to prefix commands.{% elif cookiecutter.environment_manager == 'virtualenv' %}Activate with `workon {{ cookiecutter.repo_name }}` first.{% endif %}
* Update the dependency file when installing new packages
* Refactor reusable notebook code into the `{{ cookiecutter.module_name }}/` package

### Ask first

* Before deleting or overwriting files in `data/processed/`
* Before adding new dependencies to the project
* Before modifying the `Makefile`
* Before creating new notebooks

### Never do

* Delete, edit, or overwrite files in `data/raw/`
* Commit data files, trained models, or `.env` to version control
* Hardcode credentials or print secrets in logs
* Skip or disable lint rules or tests to get around failures
1 change: 1 addition & 0 deletions {{ cookiecutter.repo_name }}/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── AGENTS.md <- The top-level AGENTS file for AI coding agents.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
Expand Down
Loading