Add public API skeletons and export ml module#1081
Conversation
Reviewer's GuideIntroduce initial public ML API scaffolding by implementing a concrete MAFeatureDataset initializer with validation, adding placeholder methods for ML workflows, expanding tests to cover the new behavior and error cases, and exposing the ml module at the package level. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- MAFeatureDataset currently keeps direct references to inputs like map_features and sample_metadata; if callers might mutate those structures later, consider copying or documenting the expectation of immutability to avoid hard-to-debug side effects.
- The type annotations for many parameters (e.g., map_features, sample_metadata, descriptor_features, descriptor_fields, descriptor_transformers) are very broad (Any); tightening these to concrete interfaces (e.g., numpy.ndarray | scipy.sparse.spmatrix | Sequence[Sequence[float]], Mapping, etc.) would make the new public API easier to understand and use correctly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- MAFeatureDataset currently keeps direct references to inputs like map_features and sample_metadata; if callers might mutate those structures later, consider copying or documenting the expectation of immutability to avoid hard-to-debug side effects.
- The type annotations for many parameters (e.g., map_features, sample_metadata, descriptor_features, descriptor_fields, descriptor_transformers) are very broad (Any); tightening these to concrete interfaces (e.g., numpy.ndarray | scipy.sparse.spmatrix | Sequence[Sequence[float]], Mapping, etc.) would make the new public API easier to understand and use correctly.
## Individual Comments
### Comment 1
<location path="nimare/ml.py" line_range="93-108" />
<code_context>
+ if len(target) != n_samples:
+ raise ValueError("target length must match number of rows in map_features")
+
+ if feature_names is not None:
+ if len(feature_names) != n_features:
+ raise ValueError(
+ "feature_names length must match number of columns in map_features"
+ )
+
+ self.map_features = map_features
+ self.sample_ids = list(sample_ids)
+ self.study_ids = list(study_ids)
+ self.sample_metadata = sample_metadata
+ self.masker = masker
+ self.descriptor_features = descriptor_features
+ self.target = target
+ self.feature_names = feature_names
+ self.exclusion_report = exclusion_report
+ self.provenance = provenance
</code_context>
<issue_to_address>
**suggestion:** Normalize `feature_names` to a list to match the documented type and avoid surprises with Index/array-like inputs.
The docstring declares `feature_names` as `list[str]`, but the value is stored unchanged. If callers pass an Index/ndarray/other sequence, code that assumes a plain list (e.g., for JSON serialization or mutation) may misbehave.
You can enforce the documented type with a small normalization step:
```python
if feature_names is not None:
feature_names = list(feature_names)
if len(feature_names) != n_features:
raise ValueError(
"feature_names length must match number of columns in map_features"
)
self.feature_names = feature_names
```
This keeps the public contract consistent with the implementation.
```suggestion
if feature_names is not None:
feature_names = list(feature_names)
if len(feature_names) != n_features:
raise ValueError(
"feature_names length must match number of columns in map_features"
)
self.map_features = map_features
self.sample_ids = list(sample_ids)
self.study_ids = list(study_ids)
self.sample_metadata = sample_metadata
self.masker = masker
self.descriptor_features = descriptor_features
self.target = target
self.feature_names = feature_names
self.exclusion_report = exclusion_report
self.provenance = provenance
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if feature_names is not None: | ||
| if len(feature_names) != n_features: | ||
| raise ValueError( | ||
| "feature_names length must match number of columns in map_features" | ||
| ) | ||
|
|
||
| self.map_features = map_features | ||
| self.sample_ids = list(sample_ids) | ||
| self.study_ids = list(study_ids) | ||
| self.sample_metadata = sample_metadata | ||
| self.masker = masker | ||
| self.descriptor_features = descriptor_features | ||
| self.target = target | ||
| self.feature_names = feature_names | ||
| self.exclusion_report = exclusion_report | ||
| self.provenance = provenance |
There was a problem hiding this comment.
suggestion: Normalize feature_names to a list to match the documented type and avoid surprises with Index/array-like inputs.
The docstring declares feature_names as list[str], but the value is stored unchanged. If callers pass an Index/ndarray/other sequence, code that assumes a plain list (e.g., for JSON serialization or mutation) may misbehave.
You can enforce the documented type with a small normalization step:
if feature_names is not None:
feature_names = list(feature_names)
if len(feature_names) != n_features:
raise ValueError(
"feature_names length must match number of columns in map_features"
)
self.feature_names = feature_namesThis keeps the public contract consistent with the implementation.
| if feature_names is not None: | |
| if len(feature_names) != n_features: | |
| raise ValueError( | |
| "feature_names length must match number of columns in map_features" | |
| ) | |
| self.map_features = map_features | |
| self.sample_ids = list(sample_ids) | |
| self.study_ids = list(study_ids) | |
| self.sample_metadata = sample_metadata | |
| self.masker = masker | |
| self.descriptor_features = descriptor_features | |
| self.target = target | |
| self.feature_names = feature_names | |
| self.exclusion_report = exclusion_report | |
| self.provenance = provenance | |
| if feature_names is not None: | |
| feature_names = list(feature_names) | |
| if len(feature_names) != n_features: | |
| raise ValueError( | |
| "feature_names length must match number of columns in map_features" | |
| ) | |
| self.map_features = map_features | |
| self.sample_ids = list(sample_ids) | |
| self.study_ids = list(study_ids) | |
| self.sample_metadata = sample_metadata | |
| self.masker = masker | |
| self.descriptor_features = descriptor_features | |
| self.target = target | |
| self.feature_names = feature_names | |
| self.exclusion_report = exclusion_report | |
| self.provenance = provenance |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 001-ma-feature-dataset #1081 +/- ##
=========================================================
Coverage ? 85.58%
=========================================================
Files ? 57
Lines ? 11316
Branches ? 0
=========================================================
Hits ? 9685
Misses ? 1631
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9976aeb to
6d66ec2
Compare
…internel helper placeholders
f8849e2 to
110cabb
Compare
|
hi @jdkent, I've committed once again to meet the requirements of the new specs. I have a question on task T009 and I've asked you on slack. Please let me now if there's anything to change! |
Closes #1016 #1017 (T006 T007 T008 T009).
Changes proposed in this pull request:
Summary by Sourcery
Introduce initial public machine learning API scaffolding and export the ml module at the package level.
New Features:
Enhancements:
Tests: