You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
enhance digital pathology guide with tissue type, specimen prep, and analysis results
Add three new sections to the digital pathology guide:
- Identifying Tumor vs Normal Slides: two approaches using
primaryAnatomicStructureModifier_CodeMeaning (DICOM-native, all collections)
and ContainerIdentifier TCGA barcode parsing (catches metastatic edge cases)
- Filter by Specimen Preparation: query staining, embedding, fixative
metadata with array_to_string() syntax for array-typed columns
- Finding Pre-Computed Analysis Results: discover derived datasets
(nuclei segmentations, TIL maps) via analysis_results_index, with
note about per-annotation measurements and link to IDC tutorial
Also updates idc-index version references from 0.11.9 to 0.11.10
(adds ContainerIdentifier column to sm_index) and expands the
sm_index table description.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TCGA-BRCA worked examples showing tumor vs normal slide counts
16
+
- Documentation references to GDC TCGA barcode format and sample type codes
17
+
- Specimen preparation query examples: filtering by staining (H&E), embedding medium (FFPE vs frozen), and fixative, with note about array column syntax (`array_to_string`, `list_contains`)
18
+
- "Finding Pre-Computed Analysis Results" section: discovering derived datasets (nuclei segmentations, TIL maps) via `analysis_results_index`, with example joining annotations back to source slides
19
+
- Note about per-annotation measurements in DICOM ANN objects (extractable via highdicom after download), with link to [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial
20
+
21
+
### Changed
22
+
23
+
- Updated to idc-index 0.11.10 (adds `ContainerIdentifier` column to `sm_index`)
24
+
- Updated `sm_index` table description to reflect newly available columns (container/slide ID, tissue type, anatomic structure, diagnosis)
Copy file name to clipboardExpand all lines: references/digital_pathology_guide.md
+151-2Lines changed: 151 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Digital Pathology Guide for IDC
2
2
3
-
**Tested with:** IDC data version v23, idc-index 0.11.9
3
+
**Tested with:** IDC data version v23, idc-index 0.11.10
4
4
5
5
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
6
6
@@ -10,7 +10,7 @@ Five specialized index tables provide curated metadata without needing BigQuery:
10
10
11
11
| Table | Row Granularity | Description |
12
12
|-------|-----------------|-------------|
13
-
|`sm_index`| 1 row = 1 SM series | Slide Microscopy series metadata: lens power, pixel spacing, image dimensions |
13
+
|`sm_index`| 1 row = 1 SM series | Slide Microscopy series metadata: container/slide ID, tissue type, anatomic structure, diagnosis, lens power, pixel spacing, image dimensions |
14
14
|`sm_instance_index`| 1 row = 1 SM instance | Instance-level (SOPInstanceUID) metadata for individual slide images |
15
15
|`seg_index`| 1 row = 1 SEG series | DICOM Segmentation metadata: algorithm, segment count, reference to source series. Used for both radiology and pathology — filter by source Modality to find pathology-specific segmentations |
16
16
|`ann_index`| 1 row = 1 ANN series | Microscopy Bulk Simple Annotations series metadata; includes `referenced_SeriesInstanceUID` linking to the annotated slide |
@@ -57,6 +57,109 @@ client.sql_query("""
57
57
""")
58
58
```
59
59
60
+
### Filter by specimen preparation
61
+
62
+
The `sm_index` includes staining, embedding, and fixative metadata. These columns are **arrays** (e.g., `[hematoxylin stain, water soluble eosin stain]` for H&E) — use `array_to_string()` with `LIKE` or `list_contains()` to filter.
63
+
64
+
```python
65
+
# Find H&E-stained slides in a collection
66
+
client.fetch_index("sm_index")
67
+
client.sql_query("""
68
+
SELECT
69
+
i.PatientID,
70
+
s.staining_usingSubstance_CodeMeaning as staining,
71
+
s.embeddingMedium_CodeMeaning as embedding,
72
+
s.tissueFixative_CodeMeaning as fixative
73
+
FROM sm_index s
74
+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
75
+
WHERE i.collection_id = 'tcga_brca'
76
+
AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'
77
+
LIMIT 10
78
+
""")
79
+
```
80
+
81
+
```python
82
+
# Compare FFPE vs frozen slides across collections
83
+
client.sql_query("""
84
+
SELECT
85
+
i.collection_id,
86
+
s.embeddingMedium_CodeMeaning as embedding,
87
+
COUNT(*) as slide_count
88
+
FROM sm_index s
89
+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
90
+
GROUP BY i.collection_id, embedding
91
+
ORDER BY i.collection_id, slide_count DESC
92
+
""")
93
+
```
94
+
95
+
## Identifying Tumor vs Normal Slides
96
+
97
+
The `sm_index` table provides two ways to identify tissue type:
98
+
99
+
| Column | Use Case |
100
+
|--------|----------|
101
+
|`primaryAnatomicStructureModifier_CodeMeaning`| Structured tissue type from DICOM specimen metadata (e.g., `Neoplasm, Primary`, `Normal`, `Tumor`, `Neoplasm, Metastatic`). Works across all collections with SM data. |
102
+
|`ContainerIdentifier`| Slide/container identifier. For TCGA collections, contains the [TCGA barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) where the [sample type code](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) (positions 14-15) encodes tissue origin: `01`-`09` = tumor, `10`-`19` = normal. |
103
+
104
+
### Using structured tissue type metadata
105
+
106
+
```python
107
+
from idc_index import IDCClient
108
+
client = IDCClient()
109
+
client.fetch_index("sm_index")
110
+
111
+
# Discover tissue type values across all SM data
112
+
client.sql_query("""
113
+
SELECT
114
+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
115
+
COUNT(*) as slide_count
116
+
FROM sm_index s
117
+
WHERE s.primaryAnatomicStructureModifier_CodeMeaning IS NOT NULL
118
+
GROUP BY tissue_type
119
+
ORDER BY slide_count DESC
120
+
""")
121
+
```
122
+
123
+
#### Example: Tumor vs normal slides in TCGA-BRCA
124
+
125
+
```python
126
+
# Tissue type breakdown for TCGA-BRCA
127
+
client.sql_query("""
128
+
SELECT
129
+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
130
+
COUNT(*) as slide_count,
131
+
COUNT(DISTINCT i.PatientID) as patient_count
132
+
FROM sm_index s
133
+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
134
+
WHERE i.collection_id = 'tcga_brca'
135
+
GROUP BY tissue_type
136
+
ORDER BY slide_count DESC
137
+
""")
138
+
# Returns: Neoplasm, Primary (2704 slides), Normal (399 slides)
139
+
```
140
+
141
+
### Using TCGA barcode (TCGA collections only)
142
+
143
+
For TCGA collections, `ContainerIdentifier` contains the slide barcode (e.g., `TCGA-E9-A3X8-01A-03-TSC`). Extract the sample type code to classify tissue:
144
+
145
+
```python
146
+
# Parse sample type from TCGA barcode
147
+
client.sql_query("""
148
+
SELECT
149
+
SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
150
+
s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
151
+
COUNT(*) as slide_count
152
+
FROM sm_index s
153
+
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
The barcode approach catches cases where structured metadata is NULL (e.g., `06` = Metastatic slides have `primaryAnatomicStructureModifier_CodeMeaning` = NULL in TCGA-BRCA).
162
+
60
163
## Annotation Queries (ANN)
61
164
62
165
DICOM Microscopy Bulk Simple Annotations (Modality = 'ANN') are annotations **on** slide microscopy images. They appear in `ann_index` (series-level) and `ann_group_index` (group-level detail). Each ANN series references the slide it annotates via `referenced_SeriesInstanceUID`.
@@ -134,6 +237,52 @@ client.sql_query("""
134
237
""")
135
238
```
136
239
240
+
## Finding Pre-Computed Analysis Results
241
+
242
+
IDC hosts derived datasets (nuclei segmentations, TIL maps, AI annotations) identified by `analysis_result_id` in the main `index` table. Use `analysis_results_index` to discover what's available for pathology.
243
+
244
+
```python
245
+
from idc_index import IDCClient
246
+
client = IDCClient()
247
+
client.fetch_index("analysis_results_index")
248
+
249
+
# Find analysis results that include pathology annotations or segmentations
250
+
client.sql_query("""
251
+
SELECT
252
+
ar.analysis_result_id,
253
+
ar.analysis_result_title,
254
+
ar.Modalities,
255
+
ar.Subjects,
256
+
ar.Collections
257
+
FROM analysis_results_index ar
258
+
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
259
+
ORDER BY ar.Subjects DESC
260
+
""")
261
+
```
262
+
263
+
### Find analysis results for a specific slide
264
+
265
+
```python
266
+
# Find all derived data (annotations, segmentations) for TCGA-BRCA slides
267
+
client.fetch_index("ann_index")
268
+
client.sql_query("""
269
+
SELECT
270
+
i.analysis_result_id,
271
+
i.PatientID,
272
+
a.referenced_SeriesInstanceUID as source_slide,
273
+
g.AnnotationGroupLabel,
274
+
g.NumberOfAnnotations,
275
+
g.AlgorithmName
276
+
FROM ann_group_index g
277
+
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
278
+
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
279
+
WHERE i.collection_id = 'tcga_brca'
280
+
LIMIT 10
281
+
""")
282
+
```
283
+
284
+
Annotation objects can also contain per-annotation **measurements** (e.g., nucleus area, eccentricity) stored within the DICOM file. These are not in the index tables — extract them after download using [highdicom](https://github.com/ImagingDataCommons/highdicom) (`ann.get_annotation_groups()`, `group.get_measurements()`). See the [microscopy_dicom_ann_intro](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb) tutorial for a worked example including spatial analysis and cellularity computation.
285
+
137
286
## Filter by AnnotationGroupLabel
138
287
139
288
`AnnotationGroupLabel` is the most direct column for finding annotation groups by name or semantic content. Use `LIKE` with wildcards for text search.
0 commit comments