Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,27 @@ All notable changes to the IDC Claude Skill are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/),
and this project adheres to [Semantic Versioning](https://semver.org/).

## [1.6.3] - 2026-05-09

### Added

- `ct_index`, `mr_index`, `pt_index` tables (idc-index 0.12.3 / idc-index-data 24.2.0): modality-specific acquisition and reconstruction parameter indices, one row per series, all joining on `SeriesInstanceUID`
- `ct_index` (21 columns): pixel spacing, slice thickness, kVp, convolution kernel, tube current min/max (dose-modulated), exposure, spiral pitch, scan options
- `mr_index` (22 columns): field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions
- `pt_index` (21 columns): radionuclide, injected dose, reconstruction method, decay/scatter/attenuation correction, frame duration (array for dynamic PET), number of time slices
- SQL query patterns for all three new tables in `references/sql_patterns.md`
- Join column entries for `ct_index`, `mr_index`, `pt_index` in `references/index_tables_guide.md` and SKILL.md
- Parquet file entries for `ct_index.parquet`, `mr_index.parquet`, `pt_index.parquet` in `references/parquet_access_guide.md`

### Changed

- Added concrete `indices_overview` code example showing how to search for a column across all tables and read column schemas without fetching the table; directly addresses the failure mode where agents query `index` for modality-specific parameters (SliceThickness, KVP, etc.) instead of using `ct_index`/`mr_index`/`pt_index`
- Added troubleshooting entry "Column not found in `index` table" with a working `indices_overview` search snippet and join example, covering common acquisition/reconstruction parameters that live in the modality-specific index tables
- Updated idc-index reference to 0.12.3
- Clarified `download_from_selection` API: added explicit warning that it takes filter keyword arguments (not a DataFrame), comparison table vs `download_dicom_series` (which has a different first-argument order), and restructured the download example as a step-by-step query → extract UIDs → pass list flow
- Documented `download_dicom_series` as an alternative download method with its own signature (`seriesInstanceUID` as first arg, then `downloadDir`)
- Reduced redundancy and duplication in SKILL.md for cleaner reading

## [1.6.2] - 2026-05-08

### Changed
Expand Down
161 changes: 87 additions & 74 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata:
version: 1.6.2
version: 1.6.3
skill-author: Andrey Fedorov, @fedorov
idc-index: "0.12.2"
idc-index: "0.12.3"
idc-data-version: "v24"
repository: https://github.com/ImagingDataCommons/idc-claude-skill
---
Expand Down Expand Up @@ -82,7 +82,7 @@ print(stats)
- IDC Data Model - Collection and analysis result hierarchy
- Index Tables - Available tables and joining patterns
- Installation - Package setup and version verification
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations)
- Best Practices - Usage guidelines
- Troubleshooting - Common issues and solutions

Expand All @@ -91,7 +91,7 @@ print(stats)
| Guide | When to Load |
|-------|--------------|
| `index_tables_guide.md` | Complex JOINs, schema discovery, DataFrame access |
| `use_cases.md` | End-to-end workflow examples (training datasets, batch downloads) |
| `use_cases.md` | End-to-end workflows: training datasets, batch downloads, DICOM reading with pydicom/SimpleITK, pipeline integration |
| `sql_patterns.md` | Quick SQL patterns for filter discovery, annotations, size estimation |
| `clinical_data_guide.md` | Clinical/tabular data, imaging+clinical joins, value mapping |
| `cloud_storage_guide.md` | Direct S3/GCS access, versioning, UUID mapping |
Expand Down Expand Up @@ -126,6 +126,25 @@ The `idc-index` package provides multiple metadata index tables, accessible via

**Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.

```python
from idc_index import IDCClient

client = IDCClient()

# Find which table(s) contain a specific column (no fetch required)
target = "SliceThickness"
for table_name, info in client.indices_overview.items():
if any(c["name"] == target for c in info["schema"]["columns"]):
print(f"'{target}' is in: {table_name}")
# → 'SliceThickness' is in: ct_index

# List all columns in a table from the schema (no fetch required)
ct_cols = [c["name"] for c in client.indices_overview["ct_index"]["schema"]["columns"]]
print("ct_index columns:", ct_cols)
# → ['SeriesInstanceUID', 'PixelSpacing_row_mm', 'PixelSpacing_col_mm', 'Rows',
# 'Columns', 'SliceThickness', 'KVP', 'ConvolutionKernel', ...]
```

### Available Tables

Always call `client.fetch_index("table_name")` before querying any index table — it is safe and idempotent for all tables, including those loaded automatically at startup.
Expand All @@ -145,6 +164,9 @@ Always call `client.fetch_index("table_name")` before querying any index table
| `contrast_index` | 1 row = 1 series with contrast info | Contrast agent metadata: agent name, ingredient, administration route (CT, MR, PT, XA, RF) |
| `volume_geometry_index` | 1 row = 1 CT/MR/PT series | 3D volume geometry validation for single-frame CT, MR, and PT series; boolean checks for orientation, spacing, dimensions, and slice positions; composite `regularly_spaced_3d_volume` flag |
| `rtstruct_index` | 1 row = 1 RTSTRUCT series | RT Structure Set metadata: total ROI count, ROI names, generation algorithms, interpreted types, and the referenced image series UID |
| `ct_index` | 1 row = 1 CT series | CT acquisition/reconstruction parameters: pixel spacing, slice thickness, kVp, convolution kernel, tube current (min/max for dose-modulated), exposure, spiral pitch, scan options |
| `mr_index` | 1 row = 1 MR series | MR acquisition/sequence parameters: field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions |
| `pt_index` | 1 row = 1 PET series | PET acquisition/reconstruction/radiopharmaceutical parameters: series type, units, decay/scatter/attenuation correction, reconstruction method, radionuclide, injected dose, frame duration (array for dynamic PET) |
| `prior_versions_index` | 1 row = 1 DICOM series | Series that have been removed or superseded in previous IDC releases; use only to download deprecated/historical data — do not query for current data |

### Joining Tables
Expand All @@ -161,11 +183,13 @@ Always call `client.fetch_index("table_name")` before querying any index table
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
| `Modality` | index, prior_versions_index | Filter by imaging modality |
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index | Link segmentation/annotation/contrast series to its index metadata |
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index, volume_geometry_index | Link series to seg/ann/contrast/geometry index tables |
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
| `SeriesInstanceUID` | index, volume_geometry_index | Link series to its 3D geometry validation result (join index.SeriesInstanceUID = volume_geometry_index.SeriesInstanceUID) |
| `SeriesInstanceUID` / `referenced_SeriesInstanceUID` | index, rtstruct_index | Join RTSTRUCT series to its metadata (index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID); use rtstruct_index.referenced_SeriesInstanceUID to find the source image series |
| `SeriesInstanceUID` | index, ct_index | Link CT series to acquisition/reconstruction parameters |
| `SeriesInstanceUID` | index, mr_index | Link MR series to sequence/acquisition parameters |
| `SeriesInstanceUID` | index, pt_index | Link PET series to acquisition/radiopharmaceutical parameters |

**Note:** `subjects`, `updated`, and `description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).

Expand Down Expand Up @@ -237,14 +261,6 @@ pip install --upgrade idc-index

**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.

**IMPORTANT:** IDC data version v24 is current. Always verify your version:
```python
print(client.get_idc_version()) # Should return "v24"
```
If you see an older version, upgrade with: `pip install --upgrade idc-index`

**Tested with:** idc-index 0.12.2 (IDC data version v24)

**Optional (for data analysis):**
```bash
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
Expand Down Expand Up @@ -372,7 +388,16 @@ results = client.sql_query("""

### 3. Downloading DICOM Files

Download imaging data efficiently from IDC's cloud storage:
Download imaging data efficiently from IDC's cloud storage.

**IMPORTANT — two download methods with different signatures:**

| Method | First arg | Second arg | Use when |
|--------|-----------|------------|----------|
| `download_from_selection` | `downloadDir` (required) | filter kwargs (optional) | Filtering by collection, patient, study, or series |
| `download_dicom_series` | `seriesInstanceUID` (required) | `downloadDir` (required) | Downloading specific series by UID only |

**`download_from_selection` takes filter keyword arguments, NOT a DataFrame.** The name "from_selection" refers to filtering the IDC index by criteria — not accepting a pandas DataFrame. To download the results of a query, extract UIDs from the DataFrame and pass them as a list.

**Download entire collection:**
```python
Expand All @@ -381,15 +406,16 @@ from idc_index import IDCClient
client = IDCClient()

# Download small collection (RIDER Pilot ~1GB)
# downloadDir is the FIRST positional argument
client.download_from_selection(
collection_id="rider_pilot",
downloadDir="./data/rider"
downloadDir="./data/rider",
collection_id="rider_pilot"
)
```

**Download specific series:**
**Download specific series (from a query result):**
```python
# First, query for series UIDs
# Step 1: Query for series UIDs
series_df = client.sql_query("""
SELECT SeriesInstanceUID
FROM index
Expand All @@ -399,11 +425,27 @@ series_df = client.sql_query("""
LIMIT 5
""")

# Download only those series
# Step 2: Extract UIDs as a list from the DataFrame
uids = list(series_df['SeriesInstanceUID'].values)

# Step 3: Pass the list to download_from_selection (NOT the DataFrame itself)
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/lung_ct",
seriesInstanceUID=uids # list of strings, not a DataFrame
)

# Alternative: download_dicom_series has seriesInstanceUID as FIRST arg (different order!)
client.download_dicom_series(
seriesInstanceUID=uids, # FIRST arg here
downloadDir="./data/lung_ct"
)

# Download from Google Storage instead of AWS
client.download_from_selection(
downloadDir="./data/lung_ct",
seriesInstanceUID=uids,
source_bucket_location="gcs"
)
```

**Custom directory structure:**
Expand All @@ -413,16 +455,16 @@ Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%S
```python
# Simplified hierarchy (omit StudyInstanceUID level)
client.download_from_selection(
collection_id="tcga_luad",
downloadDir="./data",
collection_id="tcga_luad",
dirTemplate="%collection_id/%PatientID/%Modality"
)
# Results in: ./data/tcga_luad/TCGA-05-4244/CT/

# Flat structure (all files in one directory)
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/flat",
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
dirTemplate=""
)
# Results in: ./data/flat/*.dcm
Expand Down Expand Up @@ -606,13 +648,7 @@ bibtex_citations = client.citations_from_selection(

**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.

### 6. Batch Processing and Filtering

For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.

See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.

### 7. Advanced Queries with BigQuery
### 6. Advanced Queries with BigQuery

For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.

Expand All @@ -638,7 +674,7 @@ Common specialized indices: `seg_index` (segmentations), `ann_index` / `ann_grou

See `references/bigquery_guide.md` for schemas, column descriptions, and query examples for these tables.

### 8. Tool Selection Guide
### 7. Tool Selection Guide

| Task | Tool | Reference |
|------|------|-----------|
Expand All @@ -649,20 +685,6 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e

**Default choice:** Use `idc-index` for most tasks (no auth, easy API, batch downloads).

### 9. Integration with Analysis Pipelines

After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.

See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.

## Common Use Cases

See `references/use_cases.md` for complete end-to-end workflow examples including:
- Building deep learning training datasets from lung CT scans
- Comparing image quality across scanner manufacturers
- Previewing data in browser before downloading
- License-aware batch downloads for commercial use

## Best Practices

- **Never use web search for IDC data content questions** - Always query the idc-index directly using `client.sql_query()`. Web sources (release notes, blog posts, documentation pages) are frequently out of date and will produce incorrect answers. The local DuckDB index is the authoritative source; use it even when web search is available.
Expand Down Expand Up @@ -700,6 +722,25 @@ See `references/use_cases.md` for complete end-to-end workflow examples includin
- Use `LIMIT 5` to test query first
- Check field names against metadata schema documentation

**Issue: Column not found in `index` table (e.g., `SliceThickness`, `PixelSpacing`, `KVP`, `EchoTime`, `InjectedDose`)**
- **Cause:** The `index` table contains series-level metadata only; modality-specific acquisition and reconstruction parameters live in dedicated tables (`ct_index`, `mr_index`, `pt_index`)
- **Solution:** Search `client.indices_overview` to find the right table, then fetch and join on `SeriesInstanceUID`:
```python
target = "SliceThickness"
for table_name, info in client.indices_overview.items():
if any(c["name"] == target for c in info["schema"]["columns"]):
print(f"Found in: {table_name}")
# → Found in: ct_index

client.fetch_index("ct_index")
result = client.sql_query("""
SELECT i.SeriesInstanceUID, i.Modality, c.SliceThickness, c.KVP, c.PixelSpacing_row_mm
FROM index i
JOIN ct_index c USING (SeriesInstanceUID)
WHERE i.collection_id = 'your_collection'
""")
```

**Issue: Downloaded DICOM files won't open**
- **Cause:** Corrupted download or incompatible viewer
- **Solution:**
Expand All @@ -718,38 +759,10 @@ See `references/sql_patterns.md` for quick-reference SQL patterns including:
- Download size estimation
- Clinical data linking

For segmentation and annotation details, also see `references/digital_pathology_guide.md`.

## Related Skills

The following skills complement IDC workflows for downstream analysis and visualization:

### DICOM Processing
- **pydicom** - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).

### Pathology and Slide Microscopy
See `references/digital_pathology_guide.md` for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).

### Metadata Visualization
- **matplotlib** - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
- **seaborn** - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
- **plotly** - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.

### Data Exploration
- **exploratory-data-analysis** - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
For digital pathology related see `references/digital_pathology_guide.md`.

## Resources

### Schema Reference (Primary Source)

**Always use `client.indices_overview` for current column schemas.** This ensures accuracy with the installed idc-index version:

```python
# Get all column names and types for any table
schema = client.indices_overview["index"]["schema"]
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
```

### Reference Documentation

See the Quick Navigation section at the top for the full list of reference guides with decision triggers.
Expand Down
Loading
Loading