Merge pull request #16 from ImagingDataCommons/idc-v24

fedorov · web-flow · commit 1cf36bcbe9aa · 2026-05-07T17:16:43.000-04:00
Update to IDC v24 + cleanup/improvements
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,33 @@ All notable changes to the IDC Claude Skill are documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/),
 and this project adheres to [Semantic Versioning](https://semver.org/).
 
+## [1.6.0] - 2026-05-07
+
+### Added
+
+- `tests/test_bq_snippets.py`: BigQuery snippet validation using `bq query --dry_run` — 33 tests covering all SQL examples in `references/bigquery_guide.md` (dicom_all, original_collections_metadata, segmentations, quantitative_measurements, qualitative_measurements, private elements, and clinical tables); skips automatically when `bq` CLI is unavailable or unauthenticated
+
+### Security
+
+- Fixed auto-upgrade subprocess call to pin `idc-index` to `REQUIRED_VERSION` (was `"idc-index"`, now `f"idc-index=={REQUIRED_VERSION}"`), ensuring the installed version always matches the tested version declared in the frontmatter
+- Added network access transparency note to Overview documenting expected external endpoints (GCS, S3, BigQuery, DICOMweb proxy, Google Healthcare API) and clarifying that no credentials or environment variables are accessed by the skill
+- Added tested-with version comment to optional dependency install block (`pandas>=1.5, numpy>=1.23, pydicom>=2.3`)
+
+### Changed
+
+- Updated frontmatter description to be directive about skill triggering: now explicitly instructs invocation for IDC-related queries even without the word "IDC" in the prompt
+- Extracted "Batch Processing and Filtering" (section 6) from SKILL.md to `references/use_cases.md` (Use Case 5); replaced inline code block with a 2-sentence summary and pointer
+- Extracted "Integration with Analysis Pipelines" (section 9) from SKILL.md to `references/use_cases.md` (Use Case 6); replaced inline pydicom/SimpleITK code blocks with a 2-sentence summary and pointer
+- SKILL.md reduced from 865 → 775 lines (−90 lines); `references/use_cases.md` expanded from 187 → 278 lines
+- Updated to idc-index 0.12.1 (idc-index-data 24.0.4, IDC data version v24)
+- IDC v24 adds 15 new collections (161 → 176), ~39K new series, ~4 TB new data (99.27 TB total, 85,682 cases)
+- Updated `collections_index` column names to snake_case (idc-index-data 24.0.0 breaking change):
+  `CancerTypes` → `cancer_types`, `TumorLocations` → `tumor_locations`,
+  `Subjects` → `subjects`, `Species` → `species`, `Sources` → `sources`,
+  `SupportingData` → `supporting_data`, `Program` → `program_id`
+- Updated `analysis_results_index` column names to snake_case (idc-index-data 24.0.4 breaking change):
+  `Subjects` → `subjects`, `Collections` → `collections`, `Modalities` → `modalities`
+
 ## [1.5.0] - 2026-04-08
 
 ### Added
diff --git a/SKILL.md b/SKILL.md
@@ -1,12 +1,12 @@
 ---
 name: imaging-data-commons
-description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
+description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
 license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
 metadata:
-    version: 1.4.0
+    version: 1.6.0
     skill-author: Andrey Fedorov, @fedorov
-    idc-index: "0.11.14"
-    idc-data-version: "v23"
+    idc-index: "0.12.1"
+    idc-data-version: "v24"
     repository: https://github.com/ImagingDataCommons/idc-claude-skill
 ---
 
@@ -16,7 +16,9 @@ metadata:
 
 Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
 
-**Current IDC Data Version: v23** (always verify with `IDCClient().get_idc_version()`)
+**Expected network access:** `idc-index` queries a local DuckDB index (no network for metadata). File downloads use public GCS (`storage.googleapis.com`) and AWS S3 (`s3.amazonaws.com`) — no authentication required. DICOMweb access uses either the public IDC proxy (`proxy.imaging.datacommons.cancer.gov`, no auth) or the Google Cloud Healthcare API (`healthcare.googleapis.com`, requires GCP authentication). Optional BigQuery queries (`bigquery.googleapis.com`) also require GCP authentication. No credentials or environment variables are accessed by this skill.
+
+**Current IDC Data Version: v24** (always verify with `IDCClient().get_idc_version()`)
 
 **Primary tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index))
 
@@ -25,13 +27,13 @@ Use the `idc-index` Python package to query and download public cancer imaging d
 ```python
 import idc_index
 
-REQUIRED_VERSION = "0.11.14"  # Must match metadata.idc-index in this file
+REQUIRED_VERSION = "0.12.1"  # Must match metadata.idc-index in this file
 installed = idc_index.__version__
 
 if installed < REQUIRED_VERSION:
     print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
     import subprocess
-    subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
+    subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", f"idc-index=={REQUIRED_VERSION}"], check=True)
     print("Upgrade complete. Restart Python to use new version.")
 else:
     print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")
@@ -43,7 +45,7 @@ else:
 from idc_index import IDCClient
 client = IDCClient()
 
-# Verify IDC data version (should be "v23")
+# Verify IDC data version (should be "v24")
 print(f"IDC data version: {client.get_idc_version()}")
 
 # Get collection count and total series
@@ -130,8 +132,8 @@ The `idc-index` package provides multiple metadata index tables, accessible via
 |-------|-----------------|--------|-------------|
 | `index` | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
 | `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
-| `collections_index` | 1 row = 1 collection | Auto | Collection-level metadata and descriptions |
-| `analysis_results_index` | 1 row = 1 analysis result collection | Auto | Metadata about derived datasets (annotations, segmentations) |
+| `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
+| `analysis_results_index` | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
 | `clinical_index` | 1 row = 1 (collection, table, column) triple | fetch_index() | Dictionary mapping clinical data table columns to collections |
 | `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
 | `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
@@ -235,16 +237,17 @@ pip install --upgrade idc-index
 
 **Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
 
-**IMPORTANT:** IDC data version v23 is current. Always verify your version:
+**IMPORTANT:** IDC data version v24 is current. Always verify your version:
 ```python
-print(client.get_idc_version())  # Should return "v23"
+print(client.get_idc_version())  # Should return "v24"
 ```
 If you see an older version, upgrade with: `pip install --upgrade idc-index`
 
-**Tested with:** idc-index 0.11.14 (IDC data version v23)
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 **Optional (for data analysis):**
 ```bash
+# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
 pip install pandas numpy pydicom
 ```
 
@@ -275,14 +278,14 @@ collections_summary = client.sql_query(query)
 # For richer collection metadata, use collections_index
 client.fetch_index("collections_index")
 collections_info = client.sql_query("""
-    SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
+    SELECT collection_id, cancer_types, tumor_locations, species, subjects, supporting_data
     FROM collections_index
 """)
 
 # For analysis results (annotations, segmentations), use analysis_results_index
 client.fetch_index("analysis_results_index")
 analysis_info = client.sql_query("""
-    SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
+    SELECT analysis_result_id, analysis_result_title, subjects, collections, modalities
     FROM analysis_results_index
 """)
 ```
@@ -351,7 +354,7 @@ results = client.sql_query("""
     SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
     FROM index i
     JOIN collections_index c ON i.collection_id = c.collection_id
-    WHERE c.CancerTypes LIKE '%Breast%'
+    WHERE c.cancer_types LIKE '%Breast%'
       AND i.Modality = 'MR'
     LIMIT 20
 """)
@@ -364,7 +367,7 @@ results = client.sql_query("""
 - Descriptions: StudyDescription, SeriesDescription
 - Licensing: license_short_name
 
-**Note:** Cancer type is in `collections_index.CancerTypes`, not in the primary `index` table.
+**Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table.
 
 ### 3. Downloading DICOM Files
 
@@ -604,43 +607,9 @@ bibtex_citations = client.citations_from_selection(
 
 ### 6. Batch Processing and Filtering
 
-Process large datasets efficiently with filtering:
+For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.
 
-```python
-from idc_index import IDCClient
-import pandas as pd
-
-client = IDCClient()
-
-# Find chest CT scans from GE scanners
-query = """
-SELECT
-  SeriesInstanceUID,
-  PatientID,
-  collection_id,
-  ManufacturerModelName
-FROM index
-WHERE Modality = 'CT'
-  AND BodyPartExamined = 'CHEST'
-  AND Manufacturer = 'GE MEDICAL SYSTEMS'
-  AND license_short_name = 'CC BY 4.0'
-LIMIT 100
-"""
-
-results = client.sql_query(query)
-
-# Save manifest for later
-results.to_csv('lung_ct_manifest.csv', index=False)
-
-# Download in batches to avoid timeout
-batch_size = 10
-for i in range(0, len(results), batch_size):
-    batch = results.iloc[i:i+batch_size]
-    client.download_from_selection(
-        seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
-        downloadDir=f"./data/batch_{i//batch_size}"
-    )
-```
+See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.
 
 ### 7. Advanced Queries with BigQuery
 
@@ -681,67 +650,9 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
 
 ### 9. Integration with Analysis Pipelines
 
-Integrate IDC data into imaging analysis workflows:
-
-**Read downloaded DICOM files:**
-```python
-import pydicom
-import os
-
-# Read DICOM files from downloaded series
-series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
-
-dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
-               if f.endswith('.dcm')]
-
-# Load first image
-ds = pydicom.dcmread(dicom_files[0])
-print(f"Patient ID: {ds.PatientID}")
-print(f"Modality: {ds.Modality}")
-print(f"Image shape: {ds.pixel_array.shape}")
-```
-
-**Build 3D volume from CT series:**
-```python
-import pydicom
-import numpy as np
-from pathlib import Path
-
-def load_ct_series(series_path):
-    """Load CT series as 3D numpy array"""
-    files = sorted(Path(series_path).glob('*.dcm'))
-    slices = [pydicom.dcmread(str(f)) for f in files]
-
-    # Sort by slice location
-    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
-
-    # Stack into 3D array
-    volume = np.stack([s.pixel_array for s in slices])
+After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.
 
-    return volume, slices[0]  # Return volume and first slice for metadata
-
-volume, metadata = load_ct_series("./data/lung_ct/series_dir")
-print(f"Volume shape: {volume.shape}")  # (z, y, x)
-```
-
-**Integrate with SimpleITK:**
-```python
-import SimpleITK as sitk
-from pathlib import Path
-
-# Read DICOM series
-series_path = "./data/ct_series"
-reader = sitk.ImageSeriesReader()
-dicom_names = reader.GetGDCMSeriesFileNames(series_path)
-reader.SetFileNames(dicom_names)
-image = reader.Execute()
-
-# Apply processing
-smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
-
-# Save as NIfTI
-sitk.WriteImage(smoothed, "processed_volume.nii.gz")
-```
+See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.
 
 ## Common Use Cases
 
@@ -753,7 +664,7 @@ See `references/use_cases.md` for complete end-to-end workflow examples includin
 
 ## Best Practices
 
-- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend `pip install --upgrade idc-index`
+- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index`
 - **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
 - **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values; include these in publications
 - **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and understand data structure
diff --git a/references/bigquery_guide.md b/references/bigquery_guide.md
@@ -1,6 +1,6 @@
 # BigQuery Guide for IDC
 
-**Tested with:** IDC data version v23
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.
 
diff --git a/references/clinical_data_guide.md b/references/clinical_data_guide.md
@@ -1,6 +1,6 @@
 # Clinical Data Guide for IDC
 
-**Tested with:** idc-index 0.11.7 (IDC data version v23)
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.
 
diff --git a/references/cloud_storage_guide.md b/references/cloud_storage_guide.md
@@ -205,7 +205,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r
 
 ### How Versioning Works
 
-1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
+1. **Snapshots**: Each IDC version (v1, v2, ..., v24, etc.) represents a complete snapshot of all data at release time
 2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
 3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders
 
@@ -223,7 +223,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r
 
 For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details.
 - `bigquery-public-data.idc_current` — alias to latest version
-- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version)
+- `bigquery-public-data.idc_v24` — specific version (replace 24 with desired version)
 
 ### Reproducing a Previous Analysis
 
diff --git a/references/dicomweb_guide.md b/references/dicomweb_guide.md
@@ -39,7 +39,7 @@ Replace `{VERSION}` with the IDC release number. To find the current version:
 ```python
 from idc_index import IDCClient
 client = IDCClient()
-print(client.get_idc_version())  # e.g., "23" for v23
+print(client.get_idc_version())  # e.g., "v24" for current version
 ```
 
 - **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets)
@@ -334,7 +334,7 @@ credentials, project = default()
 credentials.refresh(Request())
 
 # Build authenticated request
-base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb"
+base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v24/dicomWeb"
 
 response = requests.get(
     f"{base_url}/studies",
diff --git a/references/digital_pathology_guide.md b/references/digital_pathology_guide.md
@@ -1,6 +1,6 @@
 # Digital Pathology Guide for IDC
 
-**Tested with:** IDC data version v23, idc-index 0.11.10
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.
 
@@ -251,12 +251,12 @@ client.sql_query("""
     SELECT
         ar.analysis_result_id,
         ar.analysis_result_title,
-        ar.Modalities,
-        ar.Subjects,
-        ar.Collections
+        ar.modalities,
+        ar.subjects,
+        ar.collections
     FROM analysis_results_index ar
-    WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
-    ORDER BY ar.Subjects DESC
+    WHERE ar.modalities LIKE '%ANN%' OR ar.modalities LIKE '%SEG%'
+    ORDER BY ar.subjects DESC
 """)
 ```
 
diff --git a/references/index_tables_guide.md b/references/index_tables_guide.md
@@ -1,6 +1,6 @@
 # Index Tables Guide for IDC
 
-**Tested with:** idc-index 0.11.14 (IDC data version v23)
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.
 
@@ -34,7 +34,7 @@ results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
 
 # Fetch and query additional indices
 client.fetch_index("collections_index")
-collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
+collections = client.sql_query("SELECT collection_id, cancer_types, tumor_locations FROM collections_index")
 
 client.fetch_index("analysis_results_index")
 analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
diff --git a/references/sql_patterns.md b/references/sql_patterns.md
@@ -1,6 +1,6 @@
 # SQL Query Patterns for IDC
 
-**Tested with:** idc-index 0.11.14 (IDC data version v23)
+**Tested with:** idc-index 0.12.1 (IDC data version v24)
 
 Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
 
@@ -74,7 +74,7 @@ client.sql_query("""
 # List analysis result collections (curated derived datasets)
 client.fetch_index("analysis_results_index")
 client.sql_query("""
-    SELECT analysis_result_id, analysis_result_title, Collections, Modalities
+    SELECT analysis_result_id, analysis_result_title, collections, modalities
     FROM analysis_results_index
 """)
 
diff --git a/references/use_cases.md b/references/use_cases.md
diff --git a/tests/requirements-test.txt b/tests/requirements-test.txt
diff --git a/tests/test_bq_snippets.py b/tests/test_bq_snippets.py
diff --git a/tests/test_snippets.py b/tests/test_snippets.py