Skip to content

Commit 2447826

Browse files
fedorovclaude
andcommitted
enhance digital pathology guide with tissue type, specimen prep, and analysis results
Add three new sections to the digital pathology guide: - Identifying Tumor vs Normal Slides: two approaches using primaryAnatomicStructureModifier_CodeMeaning (DICOM-native, all collections) and ContainerIdentifier TCGA barcode parsing (catches metastatic edge cases) - Filter by Specimen Preparation: query staining, embedding, fixative metadata with array_to_string() syntax for array-typed columns - Finding Pre-Computed Analysis Results: discover derived datasets (nuclei segmentations, TIL maps) via analysis_results_index, with note about per-annotation measurements and link to IDC tutorial Also updates idc-index version references from 0.11.9 to 0.11.10 (adds ContainerIdentifier column to sm_index) and expands the sm_index table description. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5492f5e commit 2447826

3 files changed

Lines changed: 170 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- New "Identifying Tumor vs Normal Slides" section in digital pathology guide with two approaches:
13+
- Structured DICOM tissue type via `primaryAnatomicStructureModifier_CodeMeaning` (works across all SM collections)
14+
- TCGA barcode parsing via `ContainerIdentifier` (TCGA collections only, catches metastatic edge cases)
15+
- TCGA-BRCA worked examples showing tumor vs normal slide counts
16+
- Documentation references to GDC TCGA barcode format and sample type codes
17+
- Specimen preparation query examples: filtering by staining (H&E), embedding medium (FFPE vs frozen), and fixative, with note about array column syntax (`array_to_string`, `list_contains`)
18+
- "Finding Pre-Computed Analysis Results" section: discovering derived datasets (nuclei segmentations, TIL maps) via `analysis_results_index`, with example joining annotations back to source slides
19+
- Note about per-annotation measurements in DICOM ANN objects (extractable via highdicom after download), with link to [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial
20+
21+
### Changed
22+
23+
- Updated to idc-index 0.11.10 (adds `ContainerIdentifier` column to `sm_index`)
24+
- Updated `sm_index` table description to reflect newly available columns (container/slide ID, tissue type, anatomic structure, diagnosis)
25+
1026
## [1.3.1] - 2026-02-11
1127

1228
### Added

SKILL.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ license: This skill is provided under the MIT License. IDC data itself has indiv
55
metadata:
66
version: 1.3.1
77
skill-author: Andrey Fedorov, @fedorov
8-
idc-index: "0.11.9"
8+
idc-index: "0.11.10"
99
idc-data-version: "v23"
1010
repository: https://github.com/ImagingDataCommons/idc-claude-skill
1111
---
@@ -25,7 +25,7 @@ Use the `idc-index` Python package to query and download public cancer imaging d
2525
```python
2626
import idc_index
2727

28-
REQUIRED_VERSION = "0.11.9" # Must match metadata.idc-index in this file
28+
REQUIRED_VERSION = "0.11.10" # Must match metadata.idc-index in this file
2929
installed = idc_index.__version__
3030

3131
if installed < REQUIRED_VERSION:
@@ -229,7 +229,7 @@ print(client.get_idc_version()) # Should return "v23"
229229
```
230230
If you see an older version, upgrade with: `pip install --upgrade idc-index`
231231

232-
**Tested with:** idc-index 0.11.9 (IDC data version v23)
232+
**Tested with:** idc-index 0.11.10 (IDC data version v23)
233233

234234
**Optional (for data analysis):**
235235
```bash

references/digital_pathology_guide.md

Lines changed: 151 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Digital Pathology Guide for IDC
22

3-
**Tested with:** IDC data version v23, idc-index 0.11.9
3+
**Tested with:** IDC data version v23, idc-index 0.11.10
44

55
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
66

@@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery:
1010

1111
| Table | Row Granularity | Description |
1212
|-------|-----------------|-------------|
13-
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
13+
| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
1414
| `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
1515
| `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
1616
| `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
@@ -57,6 +57,109 @@ client.sql_query("""
5757
""")
5858
```
5959

60+
### Filter by specimen preparation
61+
62+
The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter.
63+
64+
```python
65+
# Find H&E-stained slides in a collection
66+
client.fetch_index("sm_index")
67+
client.sql_query("""
68+
SELECT
69+
i.PatientID,
70+
s.staining_usingSubstance_CodeMeaning as staining,
71+
s.embeddingMedium_CodeMeaning as embedding,
72+
s.tissueFixative_CodeMeaning as fixative
73+
FROM sm_index s
74+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
75+
WHERE i.collection_id = 'tcga_brca'
76+
AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
77+
LIMIT 10
78+
""")
79+
```
80+
81+
```python
82+
# Compare FFPE vs frozen slides across collections
83+
client.sql_query("""
84+
SELECT
85+
i.collection_id,
86+
s.embeddingMedium_CodeMeaning as embedding,
87+
COUNT(*) as slide_count
88+
FROM sm_index s
89+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
90+
GROUP BY i.collection_id, embedding
91+
ORDER BY i.collection_id, slide_count DESC
92+
""")
93+
```
94+
95+
## Identifying Tumor vs Normal Slides
96+
97+
The `sm_index` table provides two ways to identify tissue type:
98+
99+
| Column | Use Case |
100+
|--------|----------|
101+
| `primaryAnatomicStructureModifier_CodeMeaning` | Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. |
102+
| `ContainerIdentifier` | Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. |
103+
104+
### Using structured tissue type metadata
105+
106+
```python
107+
from idc_index import IDCClient
108+
client = IDCClient()
109+
client.fetch_index("sm_index")
110+
111+
# Discover tissue type values across all SM data
112+
client.sql_query("""
113+
SELECT
114+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
115+
COUNT(*) as slide_count
116+
FROM sm_index s
117+
WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
118+
GROUP BY tissue_type
119+
ORDER BY slide_count DESC
120+
""")
121+
```
122+
123+
#### Example: Tumor vs normal slides in TCGA-BRCA
124+
125+
```python
126+
# Tissue type breakdown for TCGA-BRCA
127+
client.sql_query("""
128+
SELECT
129+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
130+
COUNT(*) as slide_count,
131+
COUNT(DISTINCT i.PatientID) as patient_count
132+
FROM sm_index s
133+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
134+
WHERE i.collection_id = 'tcga_brca'
135+
GROUP BY tissue_type
136+
ORDER BY slide_count DESC
137+
""")
138+
# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
139+
```
140+
141+
### Using TCGA barcode (TCGA collections only)
142+
143+
For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue:
144+
145+
```python
146+
# Parse sample type from TCGA barcode
147+
client.sql_query("""
148+
SELECT
149+
SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
150+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
151+
COUNT(*) as slide_count
152+
FROM sm_index s
153+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
154+
WHERE i.collection_id = 'tcga_brca'
155+
GROUP BY sample_type_code, tissue_type
156+
ORDER BY sample_type_code
157+
""")
158+
# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)
159+
```
160+
161+
The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA).
162+
60163
## Annotation Queries (ANN)
61164

62165
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
@@ -134,6 +237,52 @@ client.sql_query("""
134237
""")
135238
```
136239

240+
## Finding Pre-Computed Analysis Results
241+
242+
IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology.
243+
244+
```python
245+
from idc_index import IDCClient
246+
client = IDCClient()
247+
client.fetch_index("analysis_results_index")
248+
249+
# Find analysis results that include pathology annotations or segmentations
250+
client.sql_query("""
251+
SELECT
252+
ar.analysis_result_id,
253+
ar.analysis_result_title,
254+
ar.Modalities,
255+
ar.Subjects,
256+
ar.Collections
257+
FROM analysis_results_index ar
258+
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
259+
ORDER BY ar.Subjects DESC
260+
""")
261+
```
262+
263+
### Find analysis results for a specific slide
264+
265+
```python
266+
# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
267+
client.fetch_index("ann_index")
268+
client.sql_query("""
269+
SELECT
270+
i.analysis_result_id,
271+
i.PatientID,
272+
a.referenced_SeriesInstanceUID as source_slide,
273+
g.AnnotationGroupLabel,
274+
g.NumberOfAnnotations,
275+
g.AlgorithmName
276+
FROM ann_group_index g
277+
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
278+
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
279+
WHERE i.collection_id = 'tcga_brca'
280+
LIMIT 10
281+
""")
282+
```
283+
284+
Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation.
285+
137286
## Filter by AnnotationGroupLabel
138287

139288
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.

0 commit comments

Comments
 (0)