Skip to content

Commit b90664d

Browse files
authored
Merge pull request #20 from ImagingDataCommons/skill-md-reduce-redundancy
Update idc-index version, improve capabilities after debugging
2 parents 7f7a48b + 7d8b8c3 commit b90664d

5 files changed

Lines changed: 195 additions & 76 deletions

File tree

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,27 @@ All notable changes to the IDC Claude Skill are documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/),
66
and this project adheres to [Semantic Versioning](https://semver.org/).
77

8+
## [1.6.3] - 2026-05-09
9+
10+
### Added
11+
12+
- `ct_index`, `mr_index`, `pt_index` tables (idc-index 0.12.3 / idc-index-data 24.2.0): modality-specific acquisition and reconstruction parameter indices, one row per series, all joining on `SeriesInstanceUID`
13+
- `ct_index` (21 columns): pixel spacing, slice thickness, kVp, convolution kernel, tube current min/max (dose-modulated), exposure, spiral pitch, scan options
14+
- `mr_index` (22 columns): field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions
15+
- `pt_index` (21 columns): radionuclide, injected dose, reconstruction method, decay/scatter/attenuation correction, frame duration (array for dynamic PET), number of time slices
16+
- SQL query patterns for all three new tables in `references/sql_patterns.md`
17+
- Join column entries for `ct_index`, `mr_index`, `pt_index` in `references/index_tables_guide.md` and SKILL.md
18+
- Parquet file entries for `ct_index.parquet`, `mr_index.parquet`, `pt_index.parquet` in `references/parquet_access_guide.md`
19+
20+
### Changed
21+
22+
- Added concrete `indices_overview` code example showing how to search for a column across all tables and read column schemas without fetching the table; directly addresses the failure mode where agents query `index` for modality-specific parameters (SliceThickness, KVP, etc.) instead of using `ct_index`/`mr_index`/`pt_index`
23+
- Added troubleshooting entry "Column not found in `index` table" with a working `indices_overview` search snippet and join example, covering common acquisition/reconstruction parameters that live in the modality-specific index tables
24+
- Updated idc-index reference to 0.12.3
25+
- Clarified `download_from_selection` API: added explicit warning that it takes filter keyword arguments (not a DataFrame), comparison table vs `download_dicom_series` (which has a different first-argument order), and restructured the download example as a step-by-step query → extract UIDs → pass list flow
26+
- Documented `download_dicom_series` as an alternative download method with its own signature (`seriesInstanceUID` as first arg, then `downloadDir`)
27+
- Reduced redundancy and duplication in SKILL.md for cleaner reading
28+
829
## [1.6.2] - 2026-05-08
930

1031
### Changed

SKILL.md

Lines changed: 87 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@ name: imaging-data-commons
33
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
44
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
55
metadata:
6-
version: 1.6.2
6+
version: 1.6.3
77
skill-author: Andrey Fedorov, @fedorov
8-
idc-index: "0.12.2"
8+
idc-index: "0.12.3"
99
idc-data-version: "v24"
1010
repository: https://github.com/ImagingDataCommons/idc-claude-skill
1111
---
@@ -82,7 +82,7 @@ print(stats)
8282
- IDC Data Model - Collection and analysis result hierarchy
8383
- Index Tables - Available tables and joining patterns
8484
- Installation - Package setup and version verification
85-
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
85+
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations)
8686
- Best Practices - Usage guidelines
8787
- Troubleshooting - Common issues and solutions
8888

@@ -91,7 +91,7 @@ print(stats)
9191
| Guide | When to Load |
9292
|-------|--------------|
9393
| `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access |
94-
| `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) |
94+
| `use_cases.md` | End-to-end workflows: training datasets, batch downloads, DICOM reading with pydicom/SimpleITK, pipeline integration |
9595
| `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation |
9696
| `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping |
9797
| `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping |
@@ -126,6 +126,25 @@ The `idc-index` package provides multiple metadata index tables, accessible via
126126

127127
**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
128128

129+
```python
130+
from idc_index import IDCClient
131+
132+
client = IDCClient()
133+
134+
# Find which table(s) contain a specific column (no fetch required)
135+
target = "SliceThickness"
136+
for table_name, info in client.indices_overview.items():
137+
if any(c["name"] == target for c in info["schema"]["columns"]):
138+
print(f"'{target}' is in: {table_name}")
139+
# → 'SliceThickness' is in: ct_index
140+
141+
# List all columns in a table from the schema (no fetch required)
142+
ct_cols = [c["name"] for c in client.indices_overview["ct_index"]["schema"]["columns"]]
143+
print("ct_index columns:", ct_cols)
144+
# → ['SeriesInstanceUID', 'PixelSpacing_row_mm', 'PixelSpacing_col_mm', 'Rows',
145+
# 'Columns', 'SliceThickness', 'KVP', 'ConvolutionKernel', ...]
146+
```
147+
129148
### Available Tables
130149

131150
Always call `client.fetch_index("table_name")` before querying any index table — it is safe and idempotent for all tables, including those loaded automatically at startup.
@@ -145,6 +164,9 @@ Always call `client.fetch_index("table_name")` before querying any index table
145164
| `contrast_index` | 1 row = 1 series with contrast info | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) |
146165
| `volume_geometry_index` | 1 row = 1 CT/MR/PT series | 3D volume geometry validation for single-frame CT, MR, and PT series; boolean checks for orientation, spacing, dimensions, and slice positions; composite `regularly_spaced_3d_volume` flag |
147166
| `rtstruct_index` | 1 row = 1 RTSTRUCT series | RT Structure Set metadata: total ROI count, ROI names, generation algorithms, interpreted types, and the referenced image series UID |
167+
| `ct_index` | 1 row = 1 CT series | CT acquisition/reconstruction parameters: pixel spacing, slice thickness, kVp, convolution kernel, tube current (min/max for dose-modulated), exposure, spiral pitch, scan options |
168+
| `mr_index` | 1 row = 1 MR series | MR acquisition/sequence parameters: field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions |
169+
| `pt_index` | 1 row = 1 PET series | PET acquisition/reconstruction/radiopharmaceutical parameters: series type, units, decay/scatter/attenuation correction, reconstruction method, radionuclide, injected dose, frame duration (array for dynamic PET) |
148170
| `prior_versions_index` | 1 row = 1 DICOM series | Series that have been removed or superseded in previous IDC releases; use only to download deprecated/historical data — do not query for current data |
149171

150172
### Joining Tables
@@ -161,11 +183,13 @@ Always call `client.fetch_index("table_name")` before querying any index table
161183
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
162184
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
163185
| `Modality` | index, prior_versions_index | Filter by imaging modality |
164-
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata |
186+
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index, volume_geometry_index | Link series to seg/ann/contrast/geometry index tables |
165187
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
166188
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
167-
| `SeriesInstanceUID` | index, volume_geometry_index | Link series to its 3D geometry validation result (join index.SeriesInstanceUID = volume_geometry_index.SeriesInstanceUID) |
168189
| `SeriesInstanceUID` / `referenced_SeriesInstanceUID` | index, rtstruct_index | Join RTSTRUCT series to its metadata (index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID); use rtstruct_index.referenced_SeriesInstanceUID to find the source image series |
190+
| `SeriesInstanceUID` | index, ct_index | Link CT series to acquisition/reconstruction parameters |
191+
| `SeriesInstanceUID` | index, mr_index | Link MR series to sequence/acquisition parameters |
192+
| `SeriesInstanceUID` | index, pt_index | Link PET series to acquisition/radiopharmaceutical parameters |
169193

170194
**Note:** `subjects`, `updated`, and `description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
171195

@@ -237,14 +261,6 @@ pip install --upgrade idc-index
237261

238262
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
239263

240-
**IMPORTANT:** IDC data version v24 is current. Always verify your version:
241-
```python
242-
print(client.get_idc_version()) # Should return "v24"
243-
```
244-
If you see an older version, upgrade with: `pip install --upgrade idc-index`
245-
246-
**Tested with:** idc-index 0.12.2 (IDC data version v24)
247-
248264
**Optional (for data analysis):**
249265
```bash
250266
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
@@ -372,7 +388,16 @@ results = client.sql_query("""
372388

373389
### 3. Downloading DICOM Files
374390

375-
Download imaging data efficiently from IDC's cloud storage:
391+
Download imaging data efficiently from IDC's cloud storage.
392+
393+
**IMPORTANT — two download methods with different signatures:**
394+
395+
| Method | First arg | Second arg | Use when |
396+
|--------|-----------|------------|----------|
397+
| `download_from_selection` | `downloadDir` (required) | filter kwargs (optional) | Filtering by collection, patient, study, or series |
398+
| `download_dicom_series` | `seriesInstanceUID` (required) | `downloadDir` (required) | Downloading specific series by UID only |
399+
400+
**`download_from_selection` takes filter keyword arguments, NOT a DataFrame.** The name "from_selection" refers to filtering the IDC index by criteria — not accepting a pandas DataFrame. To download the results of a query, extract UIDs from the DataFrame and pass them as a list.
376401

377402
**Download entire collection:**
378403
```python
@@ -381,15 +406,16 @@ from idc_index import IDCClient
381406
client = IDCClient()
382407

383408
# Download small collection (RIDER Pilot ~1GB)
409+
# downloadDir is the FIRST positional argument
384410
client.download_from_selection(
385-
collection_id="rider_pilot",
386-
downloadDir="./data/rider"
411+
downloadDir="./data/rider",
412+
collection_id="rider_pilot"
387413
)
388414
```
389415

390-
**Download specific series:**
416+
**Download specific series (from a query result):**
391417
```python
392-
# First, query for series UIDs
418+
# Step 1: Query for series UIDs
393419
series_df = client.sql_query("""
394420
SELECT SeriesInstanceUID
395421
FROM index
@@ -399,11 +425,27 @@ series_df = client.sql_query("""
399425
LIMIT 5
400426
""")
401427

402-
# Download only those series
428+
# Step 2: Extract UIDs as a list from the DataFrame
429+
uids = list(series_df['SeriesInstanceUID'].values)
430+
431+
# Step 3: Pass the list to download_from_selection (NOT the DataFrame itself)
403432
client.download_from_selection(
404-
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
433+
downloadDir="./data/lung_ct",
434+
seriesInstanceUID=uids # list of strings, not a DataFrame
435+
)
436+
437+
# Alternative: download_dicom_series has seriesInstanceUID as FIRST arg (different order!)
438+
client.download_dicom_series(
439+
seriesInstanceUID=uids, # FIRST arg here
405440
downloadDir="./data/lung_ct"
406441
)
442+
443+
# Download from Google Storage instead of AWS
444+
client.download_from_selection(
445+
downloadDir="./data/lung_ct",
446+
seriesInstanceUID=uids,
447+
source_bucket_location="gcs"
448+
)
407449
```
408450

409451
**Custom directory structure:**
@@ -413,16 +455,16 @@ Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%S
413455
```python
414456
# Simplified hierarchy (omit StudyInstanceUID level)
415457
client.download_from_selection(
416-
collection_id="tcga_luad",
417458
downloadDir="./data",
459+
collection_id="tcga_luad",
418460
dirTemplate="%collection_id/%PatientID/%Modality"
419461
)
420462
# Results in: ./data/tcga_luad/TCGA-05-4244/CT/
421463

422464
# Flat structure (all files in one directory)
423465
client.download_from_selection(
424-
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
425466
downloadDir="./data/flat",
467+
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
426468
dirTemplate=""
427469
)
428470
# Results in: ./data/flat/*.dcm
@@ -606,13 +648,7 @@ bibtex_citations = client.citations_from_selection(
606648

607649
**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.
608650

609-
### 6. Batch Processing and Filtering
610-
611-
For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.
612-
613-
See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.
614-
615-
### 7. Advanced Queries with BigQuery
651+
### 6. Advanced Queries with BigQuery
616652

617653
For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.
618654

@@ -638,7 +674,7 @@ Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_grou
638674

639675
See `references/bigquery_guide.md` for schemas, column descriptions, and query examples for these tables.
640676

641-
### 8. Tool Selection Guide
677+
### 7. Tool Selection Guide
642678

643679
| Task | Tool | Reference |
644680
|------|------|-----------|
@@ -649,20 +685,6 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
649685

650686
**Default choice:** Use `idc-index` for most tasks (no auth, easy API, batch downloads).
651687

652-
### 9. Integration with Analysis Pipelines
653-
654-
After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.
655-
656-
See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.
657-
658-
## Common Use Cases
659-
660-
See `references/use_cases.md` for complete end-to-end workflow examples including:
661-
- Building deep learning training datasets from lung CT scans
662-
- Comparing image quality across scanner manufacturers
663-
- Previewing data in browser before downloading
664-
- License-aware batch downloads for commercial use
665-
666688
## Best Practices
667689

668690
- **Never use web search for IDC data content questions** - Always query the idc-index directly using `client.sql_query()`. Web sources (release notes, blog posts, documentation pages) are frequently out of date and will produce incorrect answers. The local DuckDB index is the authoritative source; use it even when web search is available.
@@ -700,6 +722,25 @@ See `references/use_cases.md` for complete end-to-end workflow examples includin
700722
- Use `LIMIT 5` to test query first
701723
- Check field names against metadata schema documentation
702724

725+
**Issue: Column not found in `index` table (e.g., `SliceThickness`, `PixelSpacing`, `KVP`, `EchoTime`, `InjectedDose`)**
726+
- **Cause:** The `index` table contains series-level metadata only; modality-specific acquisition and reconstruction parameters live in dedicated tables (`ct_index`, `mr_index`, `pt_index`)
727+
- **Solution:** Search `client.indices_overview` to find the right table, then fetch and join on `SeriesInstanceUID`:
728+
```python
729+
target = "SliceThickness"
730+
for table_name, info in client.indices_overview.items():
731+
if any(c["name"] == target for c in info["schema"]["columns"]):
732+
print(f"Found in: {table_name}")
733+
# → Found in: ct_index
734+
735+
client.fetch_index("ct_index")
736+
result = client.sql_query("""
737+
SELECT i.SeriesInstanceUID, i.Modality, c.SliceThickness, c.KVP, c.PixelSpacing_row_mm
738+
FROM index i
739+
JOIN ct_index c USING (SeriesInstanceUID)
740+
WHERE i.collection_id = 'your_collection'
741+
""")
742+
```
743+
703744
**Issue: Downloaded DICOM files won't open**
704745
- **Cause:** Corrupted download or incompatible viewer
705746
- **Solution:**
@@ -718,38 +759,10 @@ See `references/sql_patterns.md` for quick-reference SQL patterns including:
718759
- Download size estimation
719760
- Clinical data linking
720761

721-
For segmentation and annotation details, also see `references/digital_pathology_guide.md`.
722-
723-
## Related Skills
724-
725-
The following skills complement IDC workflows for downstream analysis and visualization:
726-
727-
### DICOM Processing
728-
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
729-
730-
### Pathology and Slide Microscopy
731-
See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).
732-
733-
### Metadata Visualization
734-
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
735-
- **seaborn** - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
736-
- **plotly** - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.
737-
738-
### Data Exploration
739-
- **exploratory-data-analysis** - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
762+
For digital pathology related see `references/digital_pathology_guide.md`.
740763

741764
## Resources
742765

743-
### Schema Reference (Primary Source)
744-
745-
**Always use `client.indices_overview` for current column schemas.** This ensures accuracy with the installed idc-index version:
746-
747-
```python
748-
# Get all column names and types for any table
749-
schema = client.indices_overview["index"]["schema"]
750-
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
751-
```
752-
753766
### Reference Documentation
754767

755768
See the Quick Navigation section at the top for the full list of reference guides with decision triggers.

0 commit comments

Comments
 (0)