diff --git a/CHANGELOG.md b/CHANGELOG.md index 00680e5..beb819d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,15 @@ All notable changes to the IDC Claude Skill are documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/), and this project adheres to [Semantic Versioning](https://semver.org/). +## [1.6.4] - 2026-05-22 + +### Changed + +- Added version tracking guidance: "what's new in vX" workflow using `series_init_idc_version`/`series_revised_idc_version` in `index`; clarified `prior_versions_index` is for reproducibility only (zero overlap with `index`, column names differ from main index version columns) +- Collapsed five `SeriesInstanceUID` join rows into a single universal-key statement; table now covers only non-obvious join columns +- Removed Installation and Setup section (duplicated the CRITICAL version-check block); folded optional deps into `ModuleNotFoundError` Troubleshooting entry +- Trimmed "Command-Line Download" inline section from ~60 lines to 5; full CLI coverage (`download-from-manifest`, `download-from-selection`, all options) remains in `references/cli_guide.md` + ## [1.6.3] - 2026-05-09 ### Added diff --git a/SKILL.md b/SKILL.md index 07f7e3e..32d955c 100644 --- a/SKILL.md +++ b/SKILL.md @@ -3,7 +3,7 @@ name: imaging-data-commons description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required. license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. metadata: - version: 1.6.3 + version: 1.6.4 skill-author: Andrey Fedorov, @fedorov idc-index: "0.12.3" idc-data-version: "v24" @@ -81,7 +81,6 @@ print(stats) **Core Sections (inline):** - IDC Data Model - Collection and analysis result hierarchy - Index Tables - Available tables and joining patterns -- Installation - Package setup and version verification - Core Capabilities - Essential API patterns (query, download, visualize, license, citations) - Best Practices - Usage guidelines - Troubleshooting - Common issues and solutions @@ -167,32 +166,24 @@ Always call `client.fetch_index("table_name")` before querying any index table | `ct_index` | 1 row = 1 CT series | CT acquisition/reconstruction parameters: pixel spacing, slice thickness, kVp, convolution kernel, tube current (min/max for dose-modulated), exposure, spiral pitch, scan options | | `mr_index` | 1 row = 1 MR series | MR acquisition/sequence parameters: field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions | | `pt_index` | 1 row = 1 PET series | PET acquisition/reconstruction/radiopharmaceutical parameters: series type, units, decay/scatter/attenuation correction, reconstruction method, radionuclide, injected dose, frame duration (array for dynamic PET) | -| `prior_versions_index` | 1 row = 1 DICOM series | Series that have been removed or superseded in previous IDC releases; use only to download deprecated/historical data — do not query for current data | +| `prior_versions_index` | 1 row = 1 DICOM series | **Reproducibility only.** Contains series permanently removed from IDC (all `max_idc_version` < current version; zero overlap with `index`). Use ONLY when a user explicitly needs to reproduce work from a prior IDC version using data no longer in the current release. Do NOT use for version history or "what's new" questions — those use `series_init_idc_version`/`series_revised_idc_version` in the main `index` table. Column names `min_idc_version`/`max_idc_version` here are NOT equivalent to `series_init_idc_version`/`series_revised_idc_version` in `index`. | ### Joining Tables -**Key columns are not explicitly labeled, the following is a subset that can be used in joins.** +**`SeriesInstanceUID` is the universal join key** for all series-level specialized tables: `sm_index`, `sm_instance_index`, `seg_index`, `ann_index`, `ann_group_index`, `contrast_index`, `volume_geometry_index`, `rtstruct_index`, `ct_index`, `mr_index`, `pt_index`. Always join these to `index` on `SeriesInstanceUID`. The exceptions below use different column names. | Join Column | Tables | Use Case | |-------------|--------|----------| | `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data | -| `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details | -| `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data | -| `PatientID` | index, prior_versions_index | Link patients across current and historical data | | `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) | | `source_DOI` | index, analysis_results_index | Link by publication DOI | -| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier | -| `Modality` | index, prior_versions_index | Filter by imaging modality | -| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index, volume_geometry_index | Link series to seg/ann/contrast/geometry index tables | -| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) | -| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) | -| `SeriesInstanceUID` / `referenced_SeriesInstanceUID` | index, rtstruct_index | Join RTSTRUCT series to its metadata (index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID); use rtstruct_index.referenced_SeriesInstanceUID to find the source image series | -| `SeriesInstanceUID` | index, ct_index | Link CT series to acquisition/reconstruction parameters | -| `SeriesInstanceUID` | index, mr_index | Link MR series to sequence/acquisition parameters | -| `SeriesInstanceUID` | index, pt_index | Link PET series to acquisition/radiopharmaceutical parameters | +| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (`seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID`) | +| `referenced_SeriesInstanceUID` | ann_index → index, rtstruct_index → index | Link annotation or RTSTRUCT to its source image series | **Note:** `subjects`, `updated`, and `description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts). +**Note on `prior_versions_index`:** Joining `prior_versions_index` with `index` on `SeriesInstanceUID` always returns zero rows — there is no overlap. This table is for historical reproducibility only; never join it with `index` to answer questions about current data or version history. + For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`. ### Clinical Data Access @@ -252,21 +243,6 @@ All idc-index metadata tables are published as Parquet files to a public GCS buc See `references/parquet_access_guide.md` for URL patterns, available files, and DuckDB query examples. -## Installation and Setup - -**Required (for basic access):** -```bash -pip install --upgrade idc-index -``` - -**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility. - -**Optional (for data analysis):** -```bash -# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3 -pip install pandas numpy pydicom -``` - ## Core Capabilities ### 1. Data Discovery and Exploration @@ -386,6 +362,55 @@ results = client.sql_query(""" **Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table. +**Version tracking — "what's new in IDC vX?"** + +Use `series_init_idc_version` and `series_revised_idc_version` in the main `index` table. Do NOT use `prior_versions_index` for this — it contains only removed series. + +```python +from idc_index import IDCClient +client = IDCClient() + +VERSION = 24 # Replace with target version + +# Series added for the first time in vVERSION +new_series = client.sql_query(f""" + SELECT collection_id, + COUNT(DISTINCT SeriesInstanceUID) as new_series, + ROUND(SUM(series_size_MB)/1000, 2) as size_GB + FROM index + WHERE series_init_idc_version = {VERSION} + GROUP BY collection_id + ORDER BY new_series DESC +""") + +# Series revised (updated content) in vVERSION but originally added earlier +revised_series = client.sql_query(f""" + SELECT collection_id, + COUNT(DISTINCT SeriesInstanceUID) as revised_series + FROM index + WHERE series_revised_idc_version = {VERSION} + AND series_init_idc_version < {VERSION} + GROUP BY collection_id + ORDER BY revised_series DESC +""") + +# When was each collection first added to IDC? +client.fetch_index("version_metadata_index") +first_appearance = client.sql_query(""" + WITH first_versions AS ( + SELECT collection_id, MIN(series_init_idc_version) as first_version + FROM index + GROUP BY collection_id + ) + SELECT f.collection_id, f.first_version, v.version_timestamp as first_release_date + FROM first_versions f + JOIN version_metadata_index v ON f.first_version = v.idc_version + ORDER BY f.first_version DESC +""") +``` + +To verify column names and descriptions before writing queries, use `client.get_index_schema('index')` or `client.indices_overview` — see Best Practices. + ### 3. Downloading DICOM Files Download imaging data efficiently from IDC's cloud storage. @@ -481,69 +506,14 @@ To identify files, use the `crdc_instance_uuid` column in queries or read DICOM ### Command-Line Download -The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`. - -**Auto-detects input type:** manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid). +`idc download` is available after installing `idc-index`. Auto-detects input type: collection ID, series UID, or manifest file path. ```bash -# Download entire collection idc download rider_pilot --download-dir ./data - -# Download specific series by UID -idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data - -# Download multiple items (comma-separated) -idc download "tcga_luad,tcga_lusc" --download-dir ./data - -# Download from manifest file (auto-detected) idc download manifest.txt --download-dir ./data ``` -**Options:** - -| Option | Description | -|--------|-------------| -| `--download-dir` | Output directory (default: current directory) | -| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) | -| `--log-level` | Verbosity: debug, info, warning, error, critical | - -**Manifest files:** - -Manifest files contain S3 URLs (one per line) and can be: -- Exported from the IDC Portal after cohort selection -- Shared by collaborators for reproducible data access -- Generated programmatically from query results - -Format (one S3 URL per line): -``` -s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* -s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/* -``` - -**Example: Generate manifest from Python query:** - -```python -from idc_index import IDCClient - -client = IDCClient() - -# Query for series URLs -results = client.sql_query(""" - SELECT series_aws_url - FROM index - WHERE collection_id = 'rider_pilot' AND Modality = 'CT' -""") - -# Save as manifest file -with open('ct_manifest.txt', 'w') as f: - for url in results['series_aws_url']: - f.write(url + '\n') -``` - -Then download: -```bash -idc download ct_manifest.txt --download-dir ./ct_data -``` +See `references/cli_guide.md` for full options, `idc download-from-manifest` (resume support), and `idc download-from-selection` (filter-based). ### 4. Visualizing IDC Images @@ -687,6 +657,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e ## Best Practices +- **Check schema before writing queries** — Use `client.get_index_schema('index')` (reads cached metadata, no SQL executed) or `client.indices_overview` to see all available columns and their descriptions. The version-tracking columns `series_init_idc_version` and `series_revised_idc_version` in the main `index` table directly answer "what's new / when was this added" questions without touching `prior_versions_index`. - **Never use web search for IDC data content questions** - Always query the idc-index directly using `client.sql_query()`. Web sources (release notes, blog posts, documentation pages) are frequently out of date and will produce incorrect answers. The local DuckDB index is the authoritative source; use it even when web search is available. - **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index` - **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC) @@ -701,7 +672,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e **Issue: `ModuleNotFoundError: No module named 'idc_index'`** - **Cause:** idc-index package not installed -- **Solution:** Install with `pip install --upgrade idc-index` +- **Solution:** Install with `pip install --upgrade idc-index`; for data analysis also install `pip install pandas numpy pydicom` (tested with pandas>=1.5, numpy>=1.23, pydicom>=2.3) **Issue: Download fails with connection timeout** - **Cause:** Network instability or large download size