Skip to content

Commit 2ed4d65

Browse files
CopilotEZoni
andauthored
Add AGENTS.md for coding agent onboarding (#390)
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: EZoni <59625522+EZoni@users.noreply.github.com> Co-authored-by: Edoardo Zoni <ezoni@lbl.gov>
1 parent 2ecdda9 commit 2ed4d65

1 file changed

Lines changed: 136 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Coding Agent Instructions for Synapse
2+
3+
## Project Overview
4+
5+
Synapse (**Synergistic Software Platform for AI, Physics Simulations, and Experiments**) is a modular framework for building digital twin components at Lawrence Berkeley National Laboratory. It couples experimental data, simulations, and ML models trained on combined data. The platform targets NERSC infrastructure (Spin for cloud services, Superfacility API for HPC on Perlmutter).
6+
7+
## Repository Structure
8+
9+
```
10+
synapse/
11+
├── dashboard/ # Trame-based web GUI application
12+
│ ├── app.py # Main entry point (Trame web app)
13+
│ ├── *_manager.py # Feature managers (model, parameters, outputs, calibration, optimization, state, sfapi, error)
14+
│ ├── utils.py # Shared utilities (DB access, plotting, config)
15+
│ ├── environment.yml # Conda dependencies for GUI
16+
│ └── environment-lock.yml
17+
├── ml/ # ML training module
18+
│ ├── train_model.py # Main training script (GP, NN, ensemble)
19+
│ ├── Neural_Net_Classes.py # PyTorch neural network classes
20+
│ ├── training_pm.sbatch # SLURM batch script for Perlmutter
21+
│ ├── environment.yml # Conda dependencies for ML
22+
│ └── environment-lock.yml
23+
├── experiments/ # Experiment configs (cloned from private repos)
24+
├── tests/ # Integration tests (ML pipeline)
25+
│ ├── test_ml_pipeline.py # Full ML training pipeline test
26+
│ └── check_model.py # Model checking utility
27+
├── dashboard.Dockerfile # Docker image for the GUI
28+
├── ml.Dockerfile # Docker image for ML training (CUDA 12.4)
29+
├── publish_container.py # Script to build & push Docker containers to NERSC registry
30+
├── .pre-commit-config.yaml # Ruff linter/formatter hooks
31+
└── .github/workflows/codeql.yml # CodeQL security scanning
32+
```
33+
34+
## Language and Dependencies
35+
36+
- **Language**: Python 3.12 (managed via Conda)
37+
- **Dashboard dependencies**: trame (web framework), plotly, pymongo, botorch, pytorch, lume-model, sfapi_client, mlflow
38+
- **ML dependencies**: pytorch (CUDA 12.4), gpytorch, botorch, lume-model, mlflow, pymongo, scikit-learn
39+
- **Environment management**: Conda with `conda-lock` for reproducible environments. Each component (`dashboard/`, `ml/`) has its own `environment.yml` and `environment-lock.yml`.
40+
41+
## Linting and Formatting
42+
43+
This project uses **Ruff** for linting and formatting, configured via `.pre-commit-config.yaml`. There is no `ruff.toml` or `pyproject.toml` — Ruff runs with default rules.
44+
45+
```bash
46+
# Run the linter (with auto-fix)
47+
ruff check --fix .
48+
49+
# Run the formatter
50+
ruff format .
51+
52+
# Run both via pre-commit (if installed)
53+
pre-commit run --all-files
54+
```
55+
56+
Always run `ruff check` and `ruff format` before committing changes.
57+
58+
## Building
59+
60+
There is no traditional build step (no `setup.py`, `pyproject.toml`, or `Makefile`). The project runs directly as Python scripts within Conda environments and is containerized via Docker for deployment.
61+
62+
### Docker builds (from repository root)
63+
64+
```bash
65+
# Build the dashboard container
66+
docker build --platform linux/amd64 --output type=image,oci-mediatypes=true -t synapse-gui -f dashboard.Dockerfile .
67+
68+
# Build the ML training container
69+
docker build --platform linux/amd64 --output type=image,oci-mediatypes=true -t synapse-ml -f ml.Dockerfile .
70+
71+
# Automated build and publish (interactive)
72+
python publish_container.py --gui --ml
73+
```
74+
75+
## Testing
76+
77+
There is no pytest/unittest framework configured, but `tests/test_ml_pipeline.py` tests the full ML training pipeline (training → upload to MLflow → download from MLflow → check accuracy). It requires a local MLflow server:
78+
79+
```bash
80+
# Start a local MLflow server
81+
docker run -p 127.0.0.1:5000:5000 ghcr.io/mlflow/mlflow mlflow server --host 0.0.0.0
82+
83+
# Run the test from the repository root
84+
python tests/test_ml_pipeline.py
85+
86+
# Optionally restrict to a specific model type or config
87+
python tests/test_ml_pipeline.py --model NN --config_file experiments/synapse-bella-ip2
88+
```
89+
90+
Dashboard validation is done manually by running the application.
91+
92+
## CI/CD
93+
94+
The only CI workflow is **CodeQL Advanced** (`.github/workflows/codeql.yml`), which runs security scanning on Python code for pushes and PRs to `main`.
95+
96+
## Key Architecture Patterns
97+
98+
### Dashboard (Trame GUI)
99+
100+
- Built on [Trame](https://kitware.github.io/trame/) — a Python framework for interactive web applications.
101+
- Uses the **manager pattern**: each feature area has a dedicated `*_manager.py` class that handles its UI components and business logic.
102+
- `state_manager.py` manages the global Trame server, state, and controller.
103+
- Data flows through MongoDB (PyMongo) for experiment and simulation data.
104+
- Data flows through MLflow for ML models.
105+
- NERSC Superfacility API integration is in `sfapi_manager.py`.
106+
107+
### ML Training
108+
109+
- `train_model.py` supports three model types: Gaussian Process (GP), Neural Network (NN), and Ensemble.
110+
- Uses PyTorch, BoTorch, and GPyTorch for model training.
111+
- CUDA is auto-detected for GPU acceleration.
112+
- Models are serialized and stored in an MLflow tracking server.
113+
114+
### Data Storage
115+
116+
- **MongoDB** is used for persistent data from experiments and simulations.
117+
- **MLflow** is used for persistent data from ML models.
118+
- Database access requires SSH tunneling to NERSC when running locally.
119+
- Environment variables: `SF_DB_HOST` (dashboard), `SF_DB_READONLY_PASSWORD` (dashboard and ML training), `AM_SC_API_KEY` (dashboard and ML training, required when MLflow tracking_uri is AmSC).
120+
121+
## Common Pitfalls and Workarounds
122+
123+
1. **No `pyproject.toml` or `ruff.toml`**: Ruff uses default rules. Do not create these files unless the project explicitly adopts them.
124+
2. **Conda, not pip**: Dependencies are managed via `conda` and `conda-lock`, not `pip`. Do not add `requirements.txt` or modify `pyproject.toml` for dependencies. Update `environment.yml` in the relevant component directory and regenerate the lock file.
125+
3. **Separate environments**: The dashboard and ML components have independent Conda environments (`synapse-gui` and `synapse-ml`). Changes to dependencies must be made in the correct `environment.yml`.
126+
4. **Docker builds from root**: Dockerfiles reference paths relative to the repository root. Always run `docker build` from the repository root directory.
127+
5. **Limited test infrastructure**: There is no pytest/unittest framework, but `tests/test_ml_pipeline.py` can validate ML changes end-to-end (requires a local MLflow server). Always run the linter (`ruff check .`) and verify logic through code review.
128+
6. **Experiment configs are external**: The `experiments/` directory contains cloned private repositories. These are not checked into this repository (excluded via `.gitignore`).
129+
7. **NERSC-specific infrastructure**: Much of the deployment depends on NERSC services (Spin, Superfacility API, Perlmutter). Code changes affecting deployment or data access should be tested against NERSC services when possible.
130+
131+
## Making Changes
132+
133+
- **Python code**: Edit files directly in `dashboard/` or `ml/`. Run `ruff check --fix .` and `ruff format .` after changes.
134+
- **Dependencies**: Edit the appropriate `environment.yml` file. Regenerate the lock file with `conda-lock`.
135+
- **Docker**: Modify `dashboard.Dockerfile` or `ml.Dockerfile`. Rebuild with the commands above.
136+
- **New features**: Follow the manager pattern for dashboard features — create a new `*_manager.py` file and integrate it with `app.py` and `state_manager.py`.

0 commit comments

Comments
 (0)