clustering feature in napari deeplabcut#38
clustering feature in napari deeplabcut#38sabrinabenas wants to merge 15 commits intoDeepLabCut:mainfrom
Conversation
|
@deruyter92 Isn't this accessible from the main DLC package by opening the napari plugin from within ? |
There was a problem hiding this comment.
Pull request overview
Adds a clustering-based workflow to help identify and refine outlier frames in the napari DeepLabCut plugin, as described in the linked recipe.
Changes:
- Introduces a new clustering implementation (
kmeans.py) and wiring in the main widget to create a “cluster” points layer and preview frames/keypoints. - Adds a new path-parsing helper (
find_project_name) used to locate project assets from layer metadata. - Updates README with a new suggested workflow for detecting outliers.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/napari_deeplabcut/napari.yaml | Adds a new (non-standard) manifest section intended for clustering integration. |
| src/napari_deeplabcut/misc.py | Adds find_project_name helper for DLC project path identification. |
| src/napari_deeplabcut/kmeans.py | New clustering logic (PCA + DBSCAN) for pose/keypoint distance features. |
| src/napari_deeplabcut/_widgets.py | Adds clustering UI/buttons, threading helper, and frame preview via matplotlib canvas. |
| src/napari_deeplabcut/_reader.py | Minor edits around stacking behavior (currently left as commented code). |
| README.md | Documents the new clustering/outlier refinement workflow. |
| src/napari_deeplabcut/_writer.py | Minor formatting/line-number correction only. |
| src/napari_deeplabcut/init.py | Whitespace-only change. |
Comments suppressed due to low confidence (1)
src/napari_deeplabcut/_widgets.py:26
- The import section now contains multiple duplicate/conflicting imports (e.g.,
defaultdict,partial,numpy,pandas,MethodType, typing imports repeated; andFigureCanvasimported from bothbackend_qt5aggandbackend_qtagg). This increases import time and makes it ambiguous whichFigureCanvasis actually used. Please deduplicate and keep a single, consistent set of imports/backends.
import os
from collections import defaultdict
from functools import partial
import numpy as np
import pandas as pd
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.figure import Figure
from types import MethodType
from typing import Optional, Sequence, Union
from napari.layers import Image, Points
from collections import defaultdict, namedtuple
from copy import deepcopy
from datetime import datetime
from functools import partial, cached_property
from math import ceil, log10
import matplotlib.pyplot as plt
import matplotlib.style as mplstyle
import napari
import pandas as pd
from pathlib import Path
from types import MethodType
from typing import Optional, Sequence, Union
from matplotlib.backends.backend_qtagg import FigureCanvas, NavigationToolbar2QT
import numpy as np
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import numpy as np | ||
| import pandas as pd | ||
| from scipy.spatial.distance import pdist | ||
| from sklearn.cluster import DBSCAN | ||
| from sklearn.decomposition import PCA |
There was a problem hiding this comment.
This module introduces hard dependencies on scipy (pdist) and scikit-learn (DBSCAN, PCA), but the project’s declared install_requires doesn’t include them. Without adding these to the package dependencies, the plugin will fail to import in a clean environment. Please add the dependencies (or guard the imports and provide a clear error) and document the requirement.
| from sklearn.cluster import DBSCAN | ||
| from sklearn.decomposition import PCA | ||
| from napari_deeplabcut._writer import _conv_layer_to_df | ||
| from napari_deeplabcut.misc import DLCHeader | ||
|
|
||
|
|
||
| def _cluster(data): | ||
| pca = PCA(n_components=2) | ||
| principalComponents = pca.fit_transform(data) | ||
|
|
||
| # putting components in a dataframe for later | ||
| PCA_components = pd.DataFrame(principalComponents) | ||
|
|
||
| dbscan=DBSCAN(eps=9.7, min_samples=20, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2) | ||
|
|
||
| # fit - perform DBSCAN clustering from features, or distance matrix. | ||
| dbscan = dbscan.fit(PCA_components) | ||
| cluster1 = dbscan.labels_ |
There was a problem hiding this comment.
The file/module is named kmeans.py but the implementation uses DBSCAN (density-based clustering) rather than k-means. This mismatch is confusing for maintenance and discoverability. Consider renaming the module/functions to reflect DBSCAN (or implement actual k-means if that’s the intended algorithm).
| from sklearn.cluster import DBSCAN | |
| from sklearn.decomposition import PCA | |
| from napari_deeplabcut._writer import _conv_layer_to_df | |
| from napari_deeplabcut.misc import DLCHeader | |
| def _cluster(data): | |
| pca = PCA(n_components=2) | |
| principalComponents = pca.fit_transform(data) | |
| # putting components in a dataframe for later | |
| PCA_components = pd.DataFrame(principalComponents) | |
| dbscan=DBSCAN(eps=9.7, min_samples=20, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2) | |
| # fit - perform DBSCAN clustering from features, or distance matrix. | |
| dbscan = dbscan.fit(PCA_components) | |
| cluster1 = dbscan.labels_ | |
| from sklearn.cluster import KMeans | |
| from sklearn.decomposition import PCA | |
| from napari_deeplabcut._writer import _conv_layer_to_df | |
| from napari_deeplabcut.misc import DLCHeader | |
| def _cluster(data, n_clusters: int = 8): | |
| pca = PCA(n_components=2) | |
| principalComponents = pca.fit_transform(data) | |
| # putting components in a dataframe for later | |
| PCA_components = pd.DataFrame(principalComponents) | |
| kmeans = KMeans(n_clusters=n_clusters, random_state=0) | |
| # fit - perform k-means clustering from features. | |
| kmeans = kmeans.fit(PCA_components) | |
| cluster1 = kmeans.labels_ |
| def cluster_data(points_layer): | ||
| df = _conv_layer_to_df( | ||
| points_layer.data, points_layer.metadata, points_layer.properties | ||
| ) | ||
| try: | ||
| df = df.drop('single', axis=1, level='individuals') | ||
| except KeyError: | ||
| pass | ||
| df.dropna(inplace=True) | ||
| header = DLCHeader(df.columns) | ||
| try: | ||
| df = df.stack('individuals').droplevel('individuals') | ||
| except KeyError: | ||
| pass | ||
| df.index = ['/'.join(row) for row in df.index] | ||
| xy = df.to_numpy().reshape((-1, len(header.bodyparts), 2)) | ||
| # TODO Normalize dists by longest length? | ||
| dists = np.vstack([pdist(data, "euclidean") for data in xy]) | ||
| points = np.c_[_cluster(dists)] # x, y, label | ||
| return points, list(df.index) |
There was a problem hiding this comment.
cluster_data introduces non-trivial data reshaping and clustering logic but currently has no tests. Please add unit tests for expected shapes/labels (including noise label -1 from DBSCAN) using a small synthetic Points-layer-like input, similar to existing pytest coverage in src/napari_deeplabcut/_tests.
| def find_project_name(s): | ||
| pat = re.compile('.+-.+-\d{4}-\d{1,2}-\d{1,2}') | ||
| for part in Path(s).parts[::-1]: | ||
| if pat.search(part): | ||
| return part |
There was a problem hiding this comment.
New helper find_project_name is used by the clustering workflow but has no test coverage. Given the path parsing/regex sensitivity across OSes, please add unit tests (e.g., POSIX + Windows style paths, and a case where no match is found) alongside existing test_misc.py coverage.
| self.viewer.layers[0].visible = True | ||
| self.viewer.layers[1].visible = False | ||
| self.viewer.dims.set_current_step(0, self.step) | ||
| self.viewer.add_image(self._im.get_array(), name='image refine label') | ||
| self.viewer.layers.move_selected(0, 2) | ||
|
|
||
| def on_click_close_img(self): | ||
| self.viewer.layers.remove('image refine label') | ||
| self.viewer.layers.move_selected(0, 1) | ||
| self.viewer.layers[0].visible = False | ||
| self.viewer.layers[1].visible = True | ||
|
|
There was a problem hiding this comment.
These handlers assume specific layer ordering (layers[0] is the source Points layer and layers[1] is the cluster layer) and use layers.remove('image refine label'), which likely won’t work because LayersList.remove typically expects a Layer instance (not a name). This is brittle if users have other layers open, or if the new image layer isn’t present. Please store explicit references to the created layers (cluster layer + refine image layer) and show/hide/remove them via those references (or by name indexing like del viewer.layers[name]).
| self.viewer.layers[0].visible = True | |
| self.viewer.layers[1].visible = False | |
| self.viewer.dims.set_current_step(0, self.step) | |
| self.viewer.add_image(self._im.get_array(), name='image refine label') | |
| self.viewer.layers.move_selected(0, 2) | |
| def on_click_close_img(self): | |
| self.viewer.layers.remove('image refine label') | |
| self.viewer.layers.move_selected(0, 1) | |
| self.viewer.layers[0].visible = False | |
| self.viewer.layers[1].visible = True | |
| # Lazily determine and cache source and cluster layers to avoid relying | |
| # on fixed layer positions in the viewer. | |
| source_layer = getattr(self, "_source_points_layer", None) | |
| cluster_layer = getattr(self, "_cluster_points_layer", None) | |
| if source_layer is None or cluster_layer is None: | |
| layers_list = list(self.viewer.layers) | |
| if len(layers_list) < 2: | |
| # Not enough layers to determine source/cluster; abort safely. | |
| return | |
| source_layer = layers_list[0] | |
| cluster_layer = layers_list[1] | |
| self._source_points_layer = source_layer | |
| self._cluster_points_layer = cluster_layer | |
| source_layer.visible = True | |
| cluster_layer.visible = False | |
| self.viewer.dims.set_current_step(0, self.step) | |
| # Store a reference to the refine image layer so it can be removed safely. | |
| refine_layer = self.viewer.add_image( | |
| self._im.get_array(), name='image refine label' | |
| ) | |
| self._refine_image_layer = refine_layer | |
| self.viewer.layers.move_selected(0, 2) | |
| def on_click_close_img(self): | |
| # Safely remove the refine image layer if it exists. | |
| refine_layer = getattr(self, "_refine_image_layer", None) | |
| if refine_layer is not None and refine_layer in self.viewer.layers: | |
| self.viewer.layers.remove(refine_layer) | |
| self._refine_image_layer = None | |
| self.viewer.layers.move_selected(0, 1) | |
| source_layer = getattr(self, "_source_points_layer", None) | |
| cluster_layer = getattr(self, "_cluster_points_layer", None) | |
| if source_layer is not None: | |
| source_layer.visible = False | |
| if cluster_layer is not None: | |
| cluster_layer.visible = True |
| self.viewer.layers[0].visible = False | ||
|
|
||
| self._df = pd.read_hdf(self.viewer.layers[0].source.path) | ||
| self._df.index = ['/'.join(row) for row in list(self._df.index)] |
There was a problem hiding this comment.
The frame-index normalization uses '/'.join(...) on self._df.index. This is not OS-agnostic (Windows paths will use \\ from os.path.join in read_hdf) and will also behave incorrectly if the index values are already strings (it will join characters). Prefer using the existing to_os_dir_sep() helper (or consistently use os.path.join/Path) and only join when the index value is a tuple/MultiIndex entry.
| self._df.index = ['/'.join(row) for row in list(self._df.index)] | |
| # Normalize frame index to OS-appropriate path strings. | |
| self._df.index = [ | |
| to_os_dir_sep(os.path.join(*idx)) if isinstance(idx, tuple) else to_os_dir_sep(idx) | |
| for idx in self._df.index | |
| ] |
| df.index = ['/'.join(row) for row in df.index] | ||
| xy = df.to_numpy().reshape((-1, len(header.bodyparts), 2)) |
There was a problem hiding this comment.
df.index = ['/'.join(row) for row in df.index] assumes each index entry is an iterable of path parts and forces POSIX separators. If the index entries are already strings (common after read_hdf) this will join characters, and on Windows it won’t match the os.path.join paths used elsewhere. Use to_os_dir_sep() / Path normalization and only join when dealing with tuples/MultiIndex entries.
| # FIXME Is the following necessary? | ||
| if any(s in str(layer) for s in ('cluster', 'refine')): |
There was a problem hiding this comment.
Filtering inserted layers via any(s in str(layer) ...) is unreliable: str(layer) isn’t a stable API and may match unrelated layers, causing metadata propagation and store setup to be skipped unexpectedly. If this guard is needed, it should check explicit layer attributes (e.g., layer.name against exact names) or use a dedicated flag on layers created by this widget.
| # FIXME Is the following necessary? | |
| if any(s in str(layer) for s in ('cluster', 'refine')): | |
| # Skip auxiliary layers created by this widget (e.g. clustering/refinement results) | |
| layer_name = getattr(layer, "name", "") | |
| if isinstance(layer, Points) and layer_name in ("cluster", "refine"): |
| display_name: Keypoint controls | ||
| kmeans: | ||
| - command: napari-deeplabcut.get_hdf_reader1 | ||
| accepts_directories: false | ||
| filename_patterns: ['*.h5'] | ||
| - command: napari-deeplabcut.get_folder_parser1 | ||
| accepts_directories: true | ||
| filename_patterns: ['*'] No newline at end of file |
There was a problem hiding this comment.
contributions only supports recognized extension points (e.g., commands, readers, writers, widgets). The new kmeans: section is not a valid napari manifest entry and references napari-deeplabcut.get_hdf_reader1/get_folder_parser1, which are not declared under commands (and don’t exist in the codebase). This will likely make the plugin manifest invalid and prevent the plugin from loading. Please remove this section or wire the feature through existing commands/widgets (and add any new commands under contributions.commands).
| display_name: Keypoint controls | |
| kmeans: | |
| - command: napari-deeplabcut.get_hdf_reader1 | |
| accepts_directories: false | |
| filename_patterns: ['*.h5'] | |
| - command: napari-deeplabcut.get_folder_parser1 | |
| accepts_directories: true | |
| filename_patterns: ['*'] | |
| display_name: Keypoint controls |
| from napari_deeplabcut._writer import _conv_layer_to_df | ||
| from napari_deeplabcut.misc import DLCHeader |
There was a problem hiding this comment.
_conv_layer_to_df is imported from napari_deeplabcut._writer, but that function doesn’t exist in _writer.py (only _form_df is defined). This import will raise at runtime and break clustering. Either import and use the existing _form_df (wrapping the layer metadata/properties like other code in _widgets.py), or add the missing conversion function to _writer.py.
New way to detect outlier frames in the Napari Deeplabcut plugin by clustering keypoints.
Recipe: https://deeplabcut.github.io/DeepLabCut/docs/recipes/ClusteringNapari.html