enhance digital pathology guide with tissue type, specimen prep, and analysis results

fedorov · claude · fedorov · commit 2447826194cd · 2026-03-02T17:00:22.000-05:00
Add three new sections to the digital pathology guide:

- Identifying Tumor vs Normal Slides: two approaches using
  primaryAnatomicStructureModifier_CodeMeaning (DICOM-native, all collections)
  and ContainerIdentifier TCGA barcode parsing (catches metastatic edge cases)
- Filter by Specimen Preparation: query staining, embedding, fixative
  metadata with array_to_string() syntax for array-typed columns
- Finding Pre-Computed Analysis Results: discover derived datasets
  (nuclei segmentations, TIL maps) via analysis_results_index, with
  note about per-annotation measurements and link to IDC tutorial

Also updates idc-index version references from 0.11.9 to 0.11.10
(adds ContainerIdentifier column to sm_index) and expands the
sm_index table description.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
 
 ## [Unreleased]
 
+### Added
+
+- New "Identifying Tumor vs Normal Slides" section in digital pathology guide with two approaches:
+  - Structured DICOM tissue type via `primaryAnatomicStructureModifier_CodeMeaning` (works across all SM collections)
+  - TCGA barcode parsing via `ContainerIdentifier` (TCGA collections only, catches metastatic edge cases)
+- TCGA-BRCA worked examples showing tumor vs normal slide counts
+- Documentation references to GDC TCGA barcode format and sample type codes
+- Specimen preparation query examples: filtering by staining (H&E), embedding medium (FFPE vs frozen), and fixative, with note about array column syntax (`array_to_string`, `list_contains`)
+- "Finding Pre-Computed Analysis Results" section: discovering derived datasets (nuclei segmentations, TIL maps) via `analysis_results_index`, with example joining annotations back to source slides
+- Note about per-annotation measurements in DICOM ANN objects (extractable via highdicom after download), with link to [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial
+
+### Changed
+
+- Updated to idc-index 0.11.10 (adds `ContainerIdentifier` column to `sm_index`)
+- Updated `sm_index` table description to reflect newly available columns (container/slide ID, tissue type, anatomic structure, diagnosis)
+
 ## [1.3.1] - 2026-02-11
 
 ### Added
diff --git a/SKILL.md b/SKILL.md
@@ -5,7 +5,7 @@ license: This skill is provided under the MIT License. IDC data itself has indiv
 metadata:
     version: 1.3.1
     skill-author: Andrey Fedorov, @fedorov
-    idc-index: "0.11.9"
+    idc-index: "0.11.10"
     idc-data-version: "v23"
     repository: https://github.com/ImagingDataCommons/idc-claude-skill
 ---
@@ -25,7 +25,7 @@ Use the `idc-index` Python package to query and download public cancer imaging d
 ```python
 import idc_index
 
-REQUIRED_VERSION = "0.11.9"  # Must match metadata.idc-index in this file
+REQUIRED_VERSION = "0.11.10"  # Must match metadata.idc-index in this file
 installed = idc_index.__version__
 
 if installed < REQUIRED_VERSION:
@@ -229,7 +229,7 @@ print(client.get_idc_version())  # Should return "v23"
 ```
 If you see an older version, upgrade with: `pip install --upgrade idc-index`
 
-**Tested with:** idc-index 0.11.9 (IDC data version v23)
+**Tested with:** idc-index 0.11.10 (IDC data version v23)
 
 **Optional (for data analysis):**
 ```bash
diff --git a/references/digital_pathology_guide.md b/references/digital_pathology_guide.md
@@ -1,6 +1,6 @@
 # Digital Pathology Guide for IDC
 
-**Tested with:** IDC data version v23, idc-index 0.11.9
+**Tested with:** IDC data version v23, idc-index 0.11.10
 
 For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
 
@@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery:
 
 | Table | Row Granularity | Description |
 |-------|-----------------|-------------|
-| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
+| `sm_index` | 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
 | `sm_instance_index` | 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
 | `seg_index` | 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
 | `ann_index` | 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
@@ -57,6 +57,109 @@ client.sql_query("""
 """)
 ```
 
+### Filter by specimen preparation
+
+The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter.
+
+```python
+# Find H&E-stained slides in a collection
+client.fetch_index("sm_index")
+client.sql_query("""
+    SELECT
+        i.PatientID,
+        s.staining_usingSubstance_CodeMeaning as staining,
+        s.embeddingMedium_CodeMeaning as embedding,
+        s.tissueFixative_CodeMeaning as fixative
+    FROM sm_index s
+    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
+    WHERE i.collection_id = 'tcga_brca'
+      AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
+    LIMIT 10
+""")
+```
+
+```python
+# Compare FFPE vs frozen slides across collections
+client.sql_query("""
+    SELECT
+        i.collection_id,
+        s.embeddingMedium_CodeMeaning as embedding,
+        COUNT(*) as slide_count
+    FROM sm_index s
+    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
+    GROUP BY i.collection_id, embedding
+    ORDER BY i.collection_id, slide_count DESC
+""")
+```
+
+## Identifying Tumor vs Normal Slides
+
+The `sm_index` table provides two ways to identify tissue type:
+
+| Column | Use Case |
+|--------|----------|
+| `primaryAnatomicStructureModifier_CodeMeaning` | Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. |
+| `ContainerIdentifier` | Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. |
+
+### Using structured tissue type metadata
+
+```python
+from idc_index import IDCClient
+client = IDCClient()
+client.fetch_index("sm_index")
+
+# Discover tissue type values across all SM data
+client.sql_query("""
+    SELECT
+        s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
+        COUNT(*) as slide_count
+    FROM sm_index s
+    WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
+    GROUP BY tissue_type
+    ORDER BY slide_count DESC
+""")
+```
+
+#### Example: Tumor vs normal slides in TCGA-BRCA
+
+```python
+# Tissue type breakdown for TCGA-BRCA
+client.sql_query("""
+    SELECT
+        s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
+        COUNT(*) as slide_count,
+        COUNT(DISTINCT i.PatientID) as patient_count
+    FROM sm_index s
+    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
+    WHERE i.collection_id = 'tcga_brca'
+    GROUP BY tissue_type
+    ORDER BY slide_count DESC
+""")
+# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
+```
+
+### Using TCGA barcode (TCGA collections only)
+
+For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue:
+
+```python
+# Parse sample type from TCGA barcode
+client.sql_query("""
+    SELECT
+        SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
+        s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
+        COUNT(*) as slide_count
+    FROM sm_index s
+    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
+    WHERE i.collection_id = 'tcga_brca'
+    GROUP BY sample_type_code, tissue_type
+    ORDER BY sample_type_code
+""")
+# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)
+```
+
+The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA).
+
 ## Annotation Queries (ANN)
 
 DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
@@ -134,6 +237,52 @@ client.sql_query("""
 """)
 ```
 
+## Finding Pre-Computed Analysis Results
+
+IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology.
+
+```python
+from idc_index import IDCClient
+client = IDCClient()
+client.fetch_index("analysis_results_index")
+
+# Find analysis results that include pathology annotations or segmentations
+client.sql_query("""
+    SELECT
+        ar.analysis_result_id,
+        ar.analysis_result_title,
+        ar.Modalities,
+        ar.Subjects,
+        ar.Collections
+    FROM analysis_results_index ar
+    WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
+    ORDER BY ar.Subjects DESC
+""")
+```
+
+### Find analysis results for a specific slide
+
+```python
+# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
+client.fetch_index("ann_index")
+client.sql_query("""
+    SELECT
+        i.analysis_result_id,
+        i.PatientID,
+        a.referenced_SeriesInstanceUID as source_slide,
+        g.AnnotationGroupLabel,
+        g.NumberOfAnnotations,
+        g.AlgorithmName
+    FROM ann_group_index g
+    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
+    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
+    WHERE i.collection_id = 'tcga_brca'
+    LIMIT 10
+""")
+```
+
+Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation.
+
 ## Filter by AnnotationGroupLabel
 
 `AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.