Skip to content

Commit 1cf36bc

Browse files
authored
Merge pull request #16 from ImagingDataCommons/idc-v24
Update to IDC v24 + cleanup/improvements
2 parents e75d15f + 1d9495a commit 1cf36bc

13 files changed

Lines changed: 749 additions & 141 deletions

CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,33 @@ All notable changes to the IDC Claude Skill are documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/),
66
and this project adheres to [Semantic Versioning](https://semver.org/).
77

8+
## [1.6.0] - 2026-05-07
9+
10+
### Added
11+
12+
- `tests/test_bq_snippets.py`: BigQuery snippet validation using `bq query --dry_run` — 33 tests covering all SQL examples in `references/bigquery_guide.md` (dicom_all, original_collections_metadata, segmentations, quantitative_measurements, qualitative_measurements, private elements, and clinical tables); skips automatically when `bq` CLI is unavailable or unauthenticated
13+
14+
### Security
15+
16+
- Fixed auto-upgrade subprocess call to pin `idc-index` to `REQUIRED_VERSION` (was `"idc-index"`, now `f"idc-index=={REQUIRED_VERSION}"`), ensuring the installed version always matches the tested version declared in the frontmatter
17+
- Added network access transparency note to Overview documenting expected external endpoints (GCS, S3, BigQuery, DICOMweb proxy, Google Healthcare API) and clarifying that no credentials or environment variables are accessed by the skill
18+
- Added tested-with version comment to optional dependency install block (`pandas>=1.5, numpy>=1.23, pydicom>=2.3`)
19+
20+
### Changed
21+
22+
- Updated frontmatter description to be directive about skill triggering: now explicitly instructs invocation for IDC-related queries even without the word "IDC" in the prompt
23+
- Extracted "Batch Processing and Filtering" (section 6) from SKILL.md to `references/use_cases.md` (Use Case 5); replaced inline code block with a 2-sentence summary and pointer
24+
- Extracted "Integration with Analysis Pipelines" (section 9) from SKILL.md to `references/use_cases.md` (Use Case 6); replaced inline pydicom/SimpleITK code blocks with a 2-sentence summary and pointer
25+
- SKILL.md reduced from 865 → 775 lines (−90 lines); `references/use_cases.md` expanded from 187 → 278 lines
26+
- Updated to idc-index 0.12.1 (idc-index-data 24.0.4, IDC data version v24)
27+
- IDC v24 adds 15 new collections (161 → 176), ~39K new series, ~4 TB new data (99.27 TB total, 85,682 cases)
28+
- Updated `collections_index` column names to snake_case (idc-index-data 24.0.0 breaking change):
29+
`CancerTypes``cancer_types`, `TumorLocations``tumor_locations`,
30+
`Subjects``subjects`, `Species``species`, `Sources``sources`,
31+
`SupportingData``supporting_data`, `Program``program_id`
32+
- Updated `analysis_results_index` column names to snake_case (idc-index-data 24.0.4 breaking change):
33+
`Subjects``subjects`, `Collections``collections`, `Modalities``modalities`
34+
835
## [1.5.0] - 2026-04-08
936

1037
### Added

SKILL.md

Lines changed: 25 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
name: imaging-data-commons
3-
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
3+
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
44
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
55
metadata:
6-
version: 1.4.0
6+
version: 1.6.0
77
skill-author: Andrey Fedorov, @fedorov
8-
idc-index: "0.11.14"
9-
idc-data-version: "v23"
8+
idc-index: "0.12.1"
9+
idc-data-version: "v24"
1010
repository: https://github.com/ImagingDataCommons/idc-claude-skill
1111
---
1212

@@ -16,7 +16,9 @@ metadata:
1616

1717
Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
1818

19-
**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
19+
**Expected network access:** `idc-index` queries a local DuckDB index (no network for metadata). File downloads use public GCS (`storage.googleapis.com`) and AWS S3 (`s3.amazonaws.com`) — no authentication required. DICOMweb access uses either the public IDC proxy (`proxy.imaging.datacommons.cancer.gov`, no auth) or the Google Cloud Healthcare API (`healthcare.googleapis.com`, requires GCP authentication). Optional BigQuery queries (`bigquery.googleapis.com`) also require GCP authentication. No credentials or environment variables are accessed by this skill.
20+
21+
**Current IDC Data Version: v24** (always verify with `IDCClient().get_idc_version()`)
2022

2123
**Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
2224

@@ -25,13 +27,13 @@ Use the `idc-index` Python package to query and download public cancer imaging d
2527
```python
2628
import idc_index
2729

28-
REQUIRED_VERSION = "0.11.14" # Must match metadata.idc-index in this file
30+
REQUIRED_VERSION = "0.12.1" # Must match metadata.idc-index in this file
2931
installed = idc_index.__version__
3032

3133
if installed < REQUIRED_VERSION:
3234
print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
3335
import subprocess
34-
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
36+
subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", f"idc-index=={REQUIRED_VERSION}"], check=True)
3537
print("Upgrade complete. Restart Python to use new version.")
3638
else:
3739
print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
@@ -43,7 +45,7 @@ else:
4345
from idc_index import IDCClient
4446
client = IDCClient()
4547

46-
# Verify IDC data version (should be "v23")
48+
# Verify IDC data version (should be "v24")
4749
print(f"IDC data version: {client.get_idc_version()}")
4850

4951
# Get collection count and total series
@@ -130,8 +132,8 @@ The `idc-index` package provides multiple metadata index tables, accessible via
130132
|-------|-----------------|--------|-------------|
131133
| `index` | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
132134
| `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
133-
| `collections_index` | 1 row = 1 collection | Auto | Collection-level metadata and descriptions |
134-
| `analysis_results_index` | 1 row = 1 analysis result collection | Auto | Metadata about derived datasets (annotations, segmentations) |
135+
| `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
136+
| `analysis_results_index` | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
135137
| `clinical_index` | 1 row = 1 (collection, table, column) triple | fetch_index() | Dictionary mapping clinical data table columns to collections |
136138
| `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
137139
| `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
@@ -235,16 +237,17 @@ pip install --upgrade idc-index
235237

236238
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
237239

238-
**IMPORTANT:** IDC data version v23 is current. Always verify your version:
240+
**IMPORTANT:** IDC data version v24 is current. Always verify your version:
239241
```python
240-
print(client.get_idc_version()) # Should return "v23"
242+
print(client.get_idc_version()) # Should return "v24"
241243
```
242244
If you see an older version, upgrade with: `pip install --upgrade idc-index`
243245

244-
**Tested with:** idc-index 0.11.14 (IDC data version v23)
246+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
245247

246248
**Optional (for data analysis):**
247249
```bash
250+
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
248251
pip install pandas numpy pydicom
249252
```
250253

@@ -275,14 +278,14 @@ collections_summary = client.sql_query(query)
275278
# For richer collection metadata, use collections_index
276279
client.fetch_index("collections_index")
277280
collections_info = client.sql_query("""
278-
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
281+
SELECT collection_id, cancer_types, tumor_locations, species, subjects, supporting_data
279282
FROM collections_index
280283
""")
281284

282285
# For analysis results (annotations, segmentations), use analysis_results_index
283286
client.fetch_index("analysis_results_index")
284287
analysis_info = client.sql_query("""
285-
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
288+
SELECT analysis_result_id, analysis_result_title, subjects, collections, modalities
286289
FROM analysis_results_index
287290
""")
288291
```
@@ -351,7 +354,7 @@ results = client.sql_query("""
351354
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
352355
FROM index i
353356
JOIN collections_index c ON i.collection_id = c.collection_id
354-
WHERE c.CancerTypes LIKE '%Breast%'
357+
WHERE c.cancer_types LIKE '%Breast%'
355358
AND i.Modality = 'MR'
356359
LIMIT 20
357360
""")
@@ -364,7 +367,7 @@ results = client.sql_query("""
364367
- Descriptions: StudyDescription, SeriesDescription
365368
- Licensing: license_short_name
366369

367-
**Note:** Cancer type is in `collections_index.CancerTypes`, not in the primary `index` table.
370+
**Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table.
368371

369372
### 3. Downloading DICOM Files
370373

@@ -604,43 +607,9 @@ bibtex_citations = client.citations_from_selection(
604607

605608
### 6. Batch Processing and Filtering
606609

607-
Process large datasets efficiently with filtering:
610+
For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.
608611

609-
```python
610-
from idc_index import IDCClient
611-
import pandas as pd
612-
613-
client = IDCClient()
614-
615-
# Find chest CT scans from GE scanners
616-
query = """
617-
SELECT
618-
SeriesInstanceUID,
619-
PatientID,
620-
collection_id,
621-
ManufacturerModelName
622-
FROM index
623-
WHERE Modality = 'CT'
624-
AND BodyPartExamined = 'CHEST'
625-
AND Manufacturer = 'GE MEDICAL SYSTEMS'
626-
AND license_short_name = 'CC BY 4.0'
627-
LIMIT 100
628-
"""
629-
630-
results = client.sql_query(query)
631-
632-
# Save manifest for later
633-
results.to_csv('lung_ct_manifest.csv', index=False)
634-
635-
# Download in batches to avoid timeout
636-
batch_size = 10
637-
for i in range(0, len(results), batch_size):
638-
batch = results.iloc[i:i+batch_size]
639-
client.download_from_selection(
640-
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
641-
downloadDir=f"./data/batch_{i//batch_size}"
642-
)
643-
```
612+
See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.
644613

645614
### 7. Advanced Queries with BigQuery
646615

@@ -681,67 +650,9 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
681650

682651
### 9. Integration with Analysis Pipelines
683652

684-
Integrate IDC data into imaging analysis workflows:
685-
686-
**Read downloaded DICOM files:**
687-
```python
688-
import pydicom
689-
import os
690-
691-
# Read DICOM files from downloaded series
692-
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
693-
694-
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
695-
if f.endswith('.dcm')]
696-
697-
# Load first image
698-
ds = pydicom.dcmread(dicom_files[0])
699-
print(f"Patient ID: {ds.PatientID}")
700-
print(f"Modality: {ds.Modality}")
701-
print(f"Image shape: {ds.pixel_array.shape}")
702-
```
703-
704-
**Build 3D volume from CT series:**
705-
```python
706-
import pydicom
707-
import numpy as np
708-
from pathlib import Path
709-
710-
def load_ct_series(series_path):
711-
"""Load CT series as 3D numpy array"""
712-
files = sorted(Path(series_path).glob('*.dcm'))
713-
slices = [pydicom.dcmread(str(f)) for f in files]
714-
715-
# Sort by slice location
716-
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
717-
718-
# Stack into 3D array
719-
volume = np.stack([s.pixel_array for s in slices])
653+
After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.
720654

721-
return volume, slices[0] # Return volume and first slice for metadata
722-
723-
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
724-
print(f"Volume shape: {volume.shape}") # (z, y, x)
725-
```
726-
727-
**Integrate with SimpleITK:**
728-
```python
729-
import SimpleITK as sitk
730-
from pathlib import Path
731-
732-
# Read DICOM series
733-
series_path = "./data/ct_series"
734-
reader = sitk.ImageSeriesReader()
735-
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
736-
reader.SetFileNames(dicom_names)
737-
image = reader.Execute()
738-
739-
# Apply processing
740-
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
741-
742-
# Save as NIfTI
743-
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
744-
```
655+
See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.
745656

746657
## Common Use Cases
747658

@@ -753,7 +664,7 @@ See `references/use_cases.md` for complete end-to-end workflow examples includin
753664

754665
## Best Practices
755666

756-
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
667+
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index`
757668
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
758669
- **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
759670
- **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure

references/bigquery_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BigQuery Guide for IDC
22

3-
**Tested with:** IDC data version v23
3+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
44

55
For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.
66

references/clinical_data_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Clinical Data Guide for IDC
22

3-
**Tested with:** idc-index 0.11.7 (IDC data version v23)
3+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
44

55
Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.
66

references/cloud_storage_guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r
205205

206206
### How Versioning Works
207207

208-
1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
208+
1. **Snapshots**: Each IDC version (v1, v2, ..., v24, etc.) represents a complete snapshot of all data at release time
209209
2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
210210
3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders
211211

@@ -223,7 +223,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r
223223

224224
For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details.
225225
- `bigquery-public-data.idc_current` — alias to latest version
226-
- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version)
226+
- `bigquery-public-data.idc_v24` — specific version (replace 24 with desired version)
227227

228228
### Reproducing a Previous Analysis
229229

references/dicomweb_guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Replace `{VERSION}` with the IDC release number. To find the current version:
3939
```python
4040
from idc_index import IDCClient
4141
client = IDCClient()
42-
print(client.get_idc_version()) # e.g., "23" for v23
42+
print(client.get_idc_version()) # e.g., "v24" for current version
4343
```
4444

4545
- **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets)
@@ -334,7 +334,7 @@ credentials, project = default()
334334
credentials.refresh(Request())
335335

336336
# Build authenticated request
337-
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb"
337+
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v24/dicomWeb"
338338

339339
response = requests.get(
340340
f"{base_url}/studies",

references/digital_pathology_guide.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Digital Pathology Guide for IDC
22

3-
**Tested with:** IDC data version v23, idc-index 0.11.10
3+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
44

55
For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
66

@@ -251,12 +251,12 @@ client.sql_query("""
251251
SELECT
252252
ar.analysis_result_id,
253253
ar.analysis_result_title,
254-
ar.Modalities,
255-
ar.Subjects,
256-
ar.Collections
254+
ar.modalities,
255+
ar.subjects,
256+
ar.collections
257257
FROM analysis_results_index ar
258-
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
259-
ORDER BY ar.Subjects DESC
258+
WHERE ar.modalities LIKE '%ANN%' OR ar.modalities LIKE '%SEG%'
259+
ORDER BY ar.subjects DESC
260260
""")
261261
```
262262

references/index_tables_guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Index Tables Guide for IDC
22

3-
**Tested with:** idc-index 0.11.14 (IDC data version v23)
3+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
44

55
This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.
66

@@ -34,7 +34,7 @@ results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
3434

3535
# Fetch and query additional indices
3636
client.fetch_index("collections_index")
37-
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
37+
collections = client.sql_query("SELECT collection_id, cancer_types, tumor_locations FROM collections_index")
3838

3939
client.fetch_index("analysis_results_index")
4040
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")

references/sql_patterns.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SQL Query Patterns for IDC
22

3-
**Tested with:** idc-index 0.11.14 (IDC data version v23)
3+
**Tested with:** idc-index 0.12.1 (IDC data version v24)
44

55
Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
66

@@ -74,7 +74,7 @@ client.sql_query("""
7474
# List analysis result collections (curated derived datasets)
7575
client.fetch_index("analysis_results_index")
7676
client.sql_query("""
77-
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
77+
SELECT analysis_result_id, analysis_result_title, collections, modalities
7878
FROM analysis_results_index
7979
""")
8080

0 commit comments

Comments
 (0)