Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,33 @@ All notable changes to the IDC Claude Skill are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/),
and this project adheres to [Semantic Versioning](https://semver.org/).

## [1.6.0] - 2026-05-07

### Added

- `tests/test_bq_snippets.py`: BigQuery snippet validation using `bq query --dry_run` — 33 tests covering all SQL examples in `references/bigquery_guide.md` (dicom_all, original_collections_metadata, segmentations, quantitative_measurements, qualitative_measurements, private elements, and clinical tables); skips automatically when `bq` CLI is unavailable or unauthenticated

### Security

- Fixed auto-upgrade subprocess call to pin `idc-index` to `REQUIRED_VERSION` (was `"idc-index"`, now `f"idc-index=={REQUIRED_VERSION}"`), ensuring the installed version always matches the tested version declared in the frontmatter
- Added network access transparency note to Overview documenting expected external endpoints (GCS, S3, BigQuery, DICOMweb proxy, Google Healthcare API) and clarifying that no credentials or environment variables are accessed by the skill
- Added tested-with version comment to optional dependency install block (`pandas>=1.5, numpy>=1.23, pydicom>=2.3`)

### Changed

- Updated frontmatter description to be directive about skill triggering: now explicitly instructs invocation for IDC-related queries even without the word "IDC" in the prompt
- Extracted "Batch Processing and Filtering" (section 6) from SKILL.md to `references/use_cases.md` (Use Case 5); replaced inline code block with a 2-sentence summary and pointer
- Extracted "Integration with Analysis Pipelines" (section 9) from SKILL.md to `references/use_cases.md` (Use Case 6); replaced inline pydicom/SimpleITK code blocks with a 2-sentence summary and pointer
- SKILL.md reduced from 865 → 775 lines (−90 lines); `references/use_cases.md` expanded from 187 → 278 lines
- Updated to idc-index 0.12.1 (idc-index-data 24.0.4, IDC data version v24)
- IDC v24 adds 15 new collections (161 → 176), ~39K new series, ~4 TB new data (99.27 TB total, 85,682 cases)
- Updated `collections_index` column names to snake_case (idc-index-data 24.0.0 breaking change):
`CancerTypes` → `cancer_types`, `TumorLocations` → `tumor_locations`,
`Subjects` → `subjects`, `Species` → `species`, `Sources` → `sources`,
`SupportingData` → `supporting_data`, `Program` → `program_id`
- Updated `analysis_results_index` column names to snake_case (idc-index-data 24.0.4 breaking change):
`Subjects` → `subjects`, `Collections` → `collections`, `Modalities` → `modalities`

## [1.5.0] - 2026-04-08

### Added
Expand Down
139 changes: 25 additions & 114 deletions SKILL.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata:
version: 1.4.0
version: 1.6.0
skill-author: Andrey Fedorov, @fedorov
idc-index: "0.11.14"
idc-data-version: "v23"
idc-index: "0.12.1"
idc-data-version: "v24"
repository: https://github.com/ImagingDataCommons/idc-claude-skill
---

Expand All @@ -16,7 +16,9 @@ metadata:

Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.

**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
**Expected network access:** `idc-index` queries a local DuckDB index (no network for metadata). File downloads use public GCS (`storage.googleapis.com`) and AWS S3 (`s3.amazonaws.com`) — no authentication required. DICOMweb access uses either the public IDC proxy (`proxy.imaging.datacommons.cancer.gov`, no auth) or the Google Cloud Healthcare API (`healthcare.googleapis.com`, requires GCP authentication). Optional BigQuery queries (`bigquery.googleapis.com`) also require GCP authentication. No credentials or environment variables are accessed by this skill.

**Current IDC Data Version: v24** (always verify with `IDCClient().get_idc_version()`)

**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))

Expand All @@ -25,13 +27,13 @@ Use the `idc-index` Python package to query and download public cancer imaging d
```python
import idc_index

REQUIRED_VERSION = "0.11.14" # Must match metadata.idc-index in this file
REQUIRED_VERSION = "0.12.1" # Must match metadata.idc-index in this file
installed = idc_index.__version__

if installed < REQUIRED_VERSION:
print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
import subprocess
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", f"idc-index=={REQUIRED_VERSION}"], check=True)
print("Upgrade complete. Restart Python to use new version.")
else:
print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
Expand All @@ -43,7 +45,7 @@ else:
from idc_index import IDCClient
client = IDCClient()

# Verify IDC data version (should be "v23")
# Verify IDC data version (should be "v24")
print(f"IDC data version: {client.get_idc_version()}")

# Get collection count and total series
Expand Down Expand Up @@ -130,8 +132,8 @@ The `idc-index` package provides multiple metadata index tables, accessible via
|-------|-----------------|--------|-------------|
| `index` | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
| `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
| `collections_index` | 1 row = 1 collection | Auto | Collection-level metadata and descriptions |
| `analysis_results_index` | 1 row = 1 analysis result collection | Auto | Metadata about derived datasets (annotations, segmentations) |
| `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
| `analysis_results_index` | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
| `clinical_index` | 1 row = 1 (collection, table, column) triple | fetch_index() | Dictionary mapping clinical data table columns to collections |
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
Expand Down Expand Up @@ -235,16 +237,17 @@ pip install --upgrade idc-index

**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.

**IMPORTANT:** IDC data version v23 is current. Always verify your version:
**IMPORTANT:** IDC data version v24 is current. Always verify your version:
```python
print(client.get_idc_version()) # Should return "v23"
print(client.get_idc_version()) # Should return "v24"
```
If you see an older version, upgrade with: `pip install --upgrade idc-index`

**Tested with:** idc-index 0.11.14 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

**Optional (for data analysis):**
```bash
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
pip install pandas numpy pydicom
```

Expand Down Expand Up @@ -275,14 +278,14 @@ collections_summary = client.sql_query(query)
# For richer collection metadata, use collections_index
client.fetch_index("collections_index")
collections_info = client.sql_query("""
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
SELECT collection_id, cancer_types, tumor_locations, species, subjects, supporting_data
FROM collections_index
""")

# For analysis results (annotations, segmentations), use analysis_results_index
client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
SELECT analysis_result_id, analysis_result_title, subjects, collections, modalities
FROM analysis_results_index
""")
```
Expand Down Expand Up @@ -351,7 +354,7 @@ results = client.sql_query("""
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE c.CancerTypes LIKE '%Breast%'
WHERE c.cancer_types LIKE '%Breast%'
AND i.Modality = 'MR'
LIMIT 20
""")
Expand All @@ -364,7 +367,7 @@ results = client.sql_query("""
- Descriptions: StudyDescription, SeriesDescription
- Licensing: license_short_name

**Note:** Cancer type is in `collections_index.CancerTypes`, not in the primary `index` table.
**Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table.

### 3. Downloading DICOM Files

Expand Down Expand Up @@ -604,43 +607,9 @@ bibtex_citations = client.citations_from_selection(

### 6. Batch Processing and Filtering

Process large datasets efficiently with filtering:
For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.

```python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()

# Find chest CT scans from GE scanners
query = """
SELECT
SeriesInstanceUID,
PatientID,
collection_id,
ManufacturerModelName
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND Manufacturer = 'GE MEDICAL SYSTEMS'
AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""

results = client.sql_query(query)

# Save manifest for later
results.to_csv('lung_ct_manifest.csv', index=False)

# Download in batches to avoid timeout
batch_size = 10
for i in range(0, len(results), batch_size):
batch = results.iloc[i:i+batch_size]
client.download_from_selection(
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
downloadDir=f"./data/batch_{i//batch_size}"
)
```
See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.

### 7. Advanced Queries with BigQuery

Expand Down Expand Up @@ -681,67 +650,9 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e

### 9. Integration with Analysis Pipelines

Integrate IDC data into imaging analysis workflows:

**Read downloaded DICOM files:**
```python
import pydicom
import os

# Read DICOM files from downloaded series
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
if f.endswith('.dcm')]

# Load first image
ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")
```

**Build 3D volume from CT series:**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
"""Load CT series as 3D numpy array"""
files = sorted(Path(series_path).glob('*.dcm'))
slices = [pydicom.dcmread(str(f)) for f in files]

# Sort by slice location
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

# Stack into 3D array
volume = np.stack([s.pixel_array for s in slices])
After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.

return volume, slices[0] # Return volume and first slice for metadata

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}") # (z, y, x)
```

**Integrate with SimpleITK:**
```python
import SimpleITK as sitk
from pathlib import Path

# Read DICOM series
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()

# Apply processing
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

# Save as NIfTI
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
```
See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.

## Common Use Cases

Expand All @@ -753,7 +664,7 @@ See `references/use_cases.md` for complete end-to-end workflow examples includin

## Best Practices

- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index`
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
Expand Down
2 changes: 1 addition & 1 deletion references/bigquery_guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BigQuery Guide for IDC

**Tested with:** IDC data version v23
**Tested with:** idc-index 0.12.1 (IDC data version v24)

For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.

Expand Down
2 changes: 1 addition & 1 deletion references/clinical_data_guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Clinical Data Guide for IDC

**Tested with:** idc-index 0.11.7 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.

Expand Down
4 changes: 2 additions & 2 deletions references/cloud_storage_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r

### How Versioning Works

1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
1. **Snapshots**: Each IDC version (v1, v2, ..., v24, etc.) represents a complete snapshot of all data at release time
2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders

Expand All @@ -223,7 +223,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r

For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details.
- `bigquery-public-data.idc_current` — alias to latest version
- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version)
- `bigquery-public-data.idc_v24` — specific version (replace 24 with desired version)

### Reproducing a Previous Analysis

Expand Down
4 changes: 2 additions & 2 deletions references/dicomweb_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Replace `{VERSION}` with the IDC release number. To find the current version:
```python
from idc_index import IDCClient
client = IDCClient()
print(client.get_idc_version()) # e.g., "23" for v23
print(client.get_idc_version()) # e.g., "v24" for current version
```

- **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets)
Expand Down Expand Up @@ -334,7 +334,7 @@ credentials, project = default()
credentials.refresh(Request())

# Build authenticated request
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb"
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v24/dicomWeb"

response = requests.get(
f"{base_url}/studies",
Expand Down
12 changes: 6 additions & 6 deletions references/digital_pathology_guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Digital Pathology Guide for IDC

**Tested with:** IDC data version v23, idc-index 0.11.10
**Tested with:** idc-index 0.12.1 (IDC data version v24)

For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.

Expand Down Expand Up @@ -251,12 +251,12 @@ client.sql_query("""
SELECT
ar.analysis_result_id,
ar.analysis_result_title,
ar.Modalities,
ar.Subjects,
ar.Collections
ar.modalities,
ar.subjects,
ar.collections
FROM analysis_results_index ar
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
ORDER BY ar.Subjects DESC
WHERE ar.modalities LIKE '%ANN%' OR ar.modalities LIKE '%SEG%'
ORDER BY ar.subjects DESC
""")
```

Expand Down
4 changes: 2 additions & 2 deletions references/index_tables_guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Index Tables Guide for IDC

**Tested with:** idc-index 0.11.14 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.

Expand Down Expand Up @@ -34,7 +34,7 @@ results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")

# Fetch and query additional indices
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
collections = client.sql_query("SELECT collection_id, cancer_types, tumor_locations FROM collections_index")

client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
Expand Down
4 changes: 2 additions & 2 deletions references/sql_patterns.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SQL Query Patterns for IDC

**Tested with:** idc-index 0.11.14 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.

Expand Down Expand Up @@ -74,7 +74,7 @@ client.sql_query("""
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
SELECT analysis_result_id, analysis_result_title, collections, modalities
FROM analysis_results_index
""")

Expand Down
Loading
Loading