Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# scikit-matter Development Guide

## Overview
scikit-matter is a scikit-learn-compatible toolbox of methods from computational chemistry and materials science. All estimators follow sklearn API conventions (fit/transform/predict) and inherit from sklearn base classes.

## Architecture

### Core Selection Framework
The codebase centers on a unique **dual-selection architecture** split across `feature_selection/` and `sample_selection/` with shared implementation in `_selection.py`:

- **`_selection.py`**: Contains base classes (`GreedySelector`, `_CUR`, `_FPS`, `_PCovCUR`, `_PCovFPS`) that implement core algorithms independent of axis
- **`feature_selection/_base.py`**: Thin wrappers that instantiate base classes with `selection_type="feature"` and inherit from `SelectorMixin` (enables `transform()`)
- **`sample_selection/_base.py`**: Thin wrappers with `selection_type="sample"` - return indices via `selected_idx_` attribute (no `transform()`)

Example: `FPS` (Farthest Point Sampling) exists as both `feature_selection.FPS` and `sample_selection.FPS`, sharing the same `_FPS` implementation but differing only in which axis they select along.

### Module Organization
- **`decomposition/`**: PCovR (Principal Covariates Regression) and variants - supervised dimensionality reduction combining PCA-like and regression objectives
- **`linear_model/`**: `OrthogonalRegression`, `Ridge2FoldCV` (custom 2-fold CV for efficiency)
- **`metrics/`**: Reconstruction measures (GRE, GRD, LRE) and prediction rigidities (LPR, CPR)
- **`preprocessing/`**: `StandardFlexibleScaler`, `SparseKernelCenterer` with column-wise scaling options
- **`utils/`**: Orthogonalizers, PCovR utilities, progress bar helpers
- **`datasets/`**: Chemistry/materials datasets (CSD-1000r, CH4 manifolds, etc.)

## Development Workflows

Use 88 character line length limit for code and docstrings.

### Testing
```bash
# Run all tests with coverage
tox -e tests

# Run specific test file
tox -e tests -- tests/test_feature_simple_cur.py

# Run tests against sklearn dev version
tox -e tests-dev
```

Tests use pytest-style assertions and fixtures. Common patterns:
- Use `@pytest.fixture` for test data setup
- Use `assert` statements instead of `self.assertEqual()`
- Use `pytest.raises()` for exception testing always `match=` parameter. If match is too long that the `with` statement exceeds line length, define `match` variable before.
- Use `pytest.warns()` for warning testing
- Use `pytest.mark.parametrize` for parameterized tests
- Tests often load datasets via `skmatter.datasets.load_*()`

**Exception Testing Style:**
- Keep `with pytest.raises(...)` statement on one line (88 char limit)
- For long match strings, define a `match` variable before the with statement:
```python
match = "Long error message that would exceed line length limit"
with pytest.raises(ValueError, match=match):
some_function()
```
- Use `re.escape()` when matching messages with special regex characters:
```python
import re
match = f"Found array with shape={X.shape} ..."
with pytest.raises(ValueError, match=re.escape(match)):
selector.fit(X)
```

### Linting & Formatting
```bash
# Check only (CI mode)
tox -e lint

# Auto-format code
tox -e format

# More aggressive fixes (review changes carefully)
tox -e format-unsafe
```

Uses `ruff` for both formatting and linting. Configuration in `pyproject.toml` ignores F401 (unused imports in `__init__.py`).

### Building Docs
```bash
tox -e docs # Builds HTML docs, runs examples via sphinx-gallery
```

Documentation uses Sphinx with `.rst` format. Examples in `examples/` are executed during doc builds.

### Building Package
```bash
tox -e build # Builds wheel and sdist, runs check-manifest and twine check
```

Uses `setuptools_scm` for versioning from git tags. Version file auto-generated at `src/skmatter/_version.py`.

## Key Conventions

### scikit-learn Compliance
- All estimators inherit from appropriate sklearn mixins (`RegressorMixin`, `TransformerMixin`, `SelectorMixin`)
- Use `validate_data()` (not deprecated `check_X_y()`) for input validation
- Implement `fit()` returning `self`, store fitted attributes with trailing underscore (`selected_idx_`, `n_selected_`)
- Support `warm_start` parameter in selectors to continue from previous fit

### Selection Methods Patterns
```python
# Feature selection (returns transformed X)
from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
selector = CUR(n_to_select=10, progress_bar=True)
X_reduced = selector.fit(X).transform(X)

# Sample selection (returns indices)
from skmatter.sample_selection import CUR
selector = CUR(n_to_select=10)
selector.fit(X)
X_subset = X[selector.selected_idx_]
```

### PCovR Methods
Always center and scale inputs (`StandardFlexibleScaler`) before using PCovR/PCovC - results change drastically near α→0 or α→1 otherwise. Use `column_wise=True` when features are comparable.

### Progress Bars
Optional `tqdm` progress bar via `progress_bar=True` parameter. Implementation uses utility functions `get_progress_bar()` / `no_progress_bar()` from `utils/`.

## Dependencies & Python Support
- **Python**: 3.11+ (as of v0.3.3)
- **Core**: scikit-learn 1.8.x, scipy ≥1.15
- **Optional**: matplotlib, pandas, tqdm (for examples)
- **Testing**: Requires Python 3.11 and 3.14 on Ubuntu, macOS, Windows

## Pull Request Requirements
- Update tests for new features/bugfixes
- Update documentation for new features
- Reference issue numbers in PR description
- Reviewer updates CHANGELOG for important changes (not contributor)

## Common Pitfalls
- Don't use deprecated sklearn APIs - check sklearn version in `pyproject.toml`
- Selection methods: feature vs sample selection use same algorithms but different interfaces
- PCovR requires pre-scaled data - document this in examples
- Test files use pytest - use fixtures, `assert`, `pytest.raises()`, not unittest classes
32 changes: 17 additions & 15 deletions docs/src/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ supported tox environments please use
Running the tests
-----------------

The testsuite is implemented using Python's `unittest`_ framework and should be set-up
The testsuite is implemented using `pytest`_ framework and should be set-up
and run in an isolated virtual environment with `tox`_. All tests can be run with

.. code-block:: bash
Expand All @@ -51,15 +51,14 @@ If you wish to test only specific functionalities, for example:
tox -e tests # unit tests
tox -e examples # test the examples


You can also use ``tox -e format`` to use tox to do actual formatting instead of just
testing it. Also, you may want to setup your editor to automatically apply the `black
<https://black.readthedocs.io/en/stable/>`_ code formatter when saving your files, there
are plugins to do this with `all major editors
<https://black.readthedocs.io/en/stable/editor_integration.html>`_.

.. _unittest: https://docs.python.org/3/library/unittest.html
.. _tox: https://tox.readthedocs.io/en/latest
.. _pytest: https://pytest.org
.. _tox: https://tox.readthedocs.io

Contributing to the documentation
---------------------------------
Expand Down Expand Up @@ -195,20 +194,23 @@ properly. It should look something like this:

.. code-block:: python

class MyDatasetTests(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.my_data = load_my_data()
import pytest
from skmatter.datasets import load_my_dataset


@pytest.fixture
def my_data():
"""Load my dataset for testing."""
return load_my_dataset()


def test_load_my_data(self):
# test if representations and properties have commensurate shape
self.assertTrue(
self.my_data.data.X.shape[0] == self.my_data.data.y.shape[0]
)
def test_load_my_data(my_data):
# Test if representations and properties have commensurate shape
assert my_data.data.X.shape[0] == my_data.data.y.shape[0]

def test_load_my_data_descr(self):
self.my_data.DESCR

def test_load_my_data_descr(my_data):
assert my_data.DESCR

You're good to go! Time to submit a `pull request.
<https://github.com/lab-cosmo/scikit-matter/pulls>`_
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,6 @@ convention = "numpy"
"D205",
"D400",
]
"tests/**" = [
"D103",
]
122 changes: 68 additions & 54 deletions tests/test_clustering.py
Original file line number Diff line number Diff line change
@@ -1,59 +1,73 @@
import unittest

import numpy as np
import pytest

from skmatter.clustering import QuickShift


class QuickShiftTests(unittest.TestCase):
@classmethod
def setUpClass(cls) -> None:
cls.points = np.array(
[
[-1.72779275, -1.32763554],
[-4.44991964, -2.13474901],
[0.54817734, -2.43319467],
[3.19881307, -0.49547222],
[-1.1335991, 2.33478428],
[0.55437388, 0.18745963],
]
)
cls.cuts = np.array(
[6.99485011, 8.80292681, 7.68486852, 9.5115009, 8.07736919, 6.22057056]
)
cls.weights = np.array(
[
-3.94008092,
-12.68095664,
-7.07512499,
-9.03064023,
-8.26529849,
-2.61132267,
]
)
cls.qs_labels_ = np.array([0, 0, 0, 5, 5, 5])
cls.qs_cluster_centers_idx_ = np.array([0, 5])
cls.gabriel_labels_ = np.array([5, 5, 5, 5, 5, 5])
cls.gabriel_cluster_centers_idx_ = np.array([5])
cls.cell = [3, 3]
cls.gabriel_shell = 2

def test_fit_qs(self):
model = QuickShift(dist_cutoff_sq=self.cuts)
model.fit(self.points, samples_weight=self.weights)
self.assertTrue(np.all(model.labels_ == self.qs_labels_))
self.assertTrue(
np.all(model.cluster_centers_idx_ == self.qs_cluster_centers_idx_)
)

def test_fit_garbriel(self):
model = QuickShift(gabriel_shell=self.gabriel_shell)
model.fit(self.points, samples_weight=self.weights)
self.assertTrue(np.all(model.labels_ == self.gabriel_labels_))
self.assertTrue(
np.all(model.cluster_centers_idx_ == self.gabriel_cluster_centers_idx_)
)

def test_dimension_check(self):
model = QuickShift(self.cuts, metric_params={"cell_length": self.cell})
self.assertRaises(ValueError, model.fit, np.array([[2]]))
@pytest.fixture(scope="module")
def test_data():
points = np.array(
[
[-1.72779275, -1.32763554],
[-4.44991964, -2.13474901],
[0.54817734, -2.43319467],
[3.19881307, -0.49547222],
[-1.1335991, 2.33478428],
[0.55437388, 0.18745963],
]
)
cuts = np.array(
[6.99485011, 8.80292681, 7.68486852, 9.5115009, 8.07736919, 6.22057056]
)
weights = np.array(
[
-3.94008092,
-12.68095664,
-7.07512499,
-9.03064023,
-8.26529849,
-2.61132267,
]
)
qs_labels_ = np.array([0, 0, 0, 5, 5, 5])
qs_cluster_centers_idx_ = np.array([0, 5])
gabriel_labels_ = np.array([5, 5, 5, 5, 5, 5])
gabriel_cluster_centers_idx_ = np.array([5])
cell = [3, 3]
gabriel_shell = 2

return {
"points": points,
"cuts": cuts,
"weights": weights,
"qs_labels_": qs_labels_,
"qs_cluster_centers_idx_": qs_cluster_centers_idx_,
"gabriel_labels_": gabriel_labels_,
"gabriel_cluster_centers_idx_": gabriel_cluster_centers_idx_,
"cell": cell,
"gabriel_shell": gabriel_shell,
}


def test_fit_qs(test_data):
model = QuickShift(dist_cutoff_sq=test_data["cuts"])
model.fit(test_data["points"], samples_weight=test_data["weights"])
assert np.all(model.labels_ == test_data["qs_labels_"])
assert np.all(model.cluster_centers_idx_ == test_data["qs_cluster_centers_idx_"])


def test_fit_garbriel(test_data):
model = QuickShift(gabriel_shell=test_data["gabriel_shell"])
model.fit(test_data["points"], samples_weight=test_data["weights"])
assert np.all(model.labels_ == test_data["gabriel_labels_"])
assert np.all(
model.cluster_centers_idx_ == test_data["gabriel_cluster_centers_idx_"]
)


def test_dimension_check(test_data):
model = QuickShift(
test_data["cuts"], metric_params={"cell_length": test_data["cell"]}
)
with pytest.raises(ValueError, match="Dimension.*does not match"):
model.fit(np.array([[2]]))
Loading