Skip to content

Commit 26d4fa5

Browse files
authored
Merge pull request #22 from ImagingDataCommons/improve-versioning-related
Improve versioning-related capabilities and reduce redundancy
2 parents bd31529 + 0a30e16 commit 26d4fa5

2 files changed

Lines changed: 69 additions & 89 deletions

File tree

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ All notable changes to the IDC Claude Skill are documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/),
66
and this project adheres to [Semantic Versioning](https://semver.org/).
77

8+
## [1.6.4] - 2026-05-22
9+
10+
### Changed
11+
12+
- Added version tracking guidance: "what's new in vX" workflow using `series_init_idc_version`/`series_revised_idc_version` in `index`; clarified `prior_versions_index` is for reproducibility only (zero overlap with `index`, column names differ from main index version columns)
13+
- Collapsed five `SeriesInstanceUID` join rows into a single universal-key statement; table now covers only non-obvious join columns
14+
- Removed Installation and Setup section (duplicated the CRITICAL version-check block); folded optional deps into `ModuleNotFoundError` Troubleshooting entry
15+
- Trimmed "Command-Line Download" inline section from ~60 lines to 5; full CLI coverage (`download-from-manifest`, `download-from-selection`, all options) remains in `references/cli_guide.md`
16+
817
## [1.6.3] - 2026-05-09
918

1019
### Added

SKILL.md

Lines changed: 60 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: imaging-data-commons
33
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
44
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
55
metadata:
6-
version: 1.6.3
6+
version: 1.6.4
77
skill-author: Andrey Fedorov, @fedorov
88
idc-index: "0.12.3"
99
idc-data-version: "v24"
@@ -81,7 +81,6 @@ print(stats)
8181
**Core Sections (inline):**
8282
- IDC Data Model - Collection and analysis result hierarchy
8383
- Index Tables - Available tables and joining patterns
84-
- Installation - Package setup and version verification
8584
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations)
8685
- Best Practices - Usage guidelines
8786
- Troubleshooting - Common issues and solutions
@@ -167,32 +166,24 @@ Always call `client.fetch_index("table_name")` before querying any index table
167166
| `ct_index` | 1 row = 1 CT series | CT acquisition/reconstruction parameters: pixel spacing, slice thickness, kVp, convolution kernel, tube current (min/max for dose-modulated), exposure, spiral pitch, scan options |
168167
| `mr_index` | 1 row = 1 MR series | MR acquisition/sequence parameters: field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions |
169168
| `pt_index` | 1 row = 1 PET series | PET acquisition/reconstruction/radiopharmaceutical parameters: series type, units, decay/scatter/attenuation correction, reconstruction method, radionuclide, injected dose, frame duration (array for dynamic PET) |
170-
| `prior_versions_index` | 1 row = 1 DICOM series | Series that have been removed or superseded in previous IDC releases; use only to download deprecated/historical data — do not query for current data |
169+
| `prior_versions_index` | 1 row = 1 DICOM series | **Reproducibility only.** Contains series permanently removed from IDC (all `max_idc_version` < current version; zero overlap with `index`). Use ONLY when a user explicitly needs to reproduce work from a prior IDC version using data no longer in the current release. Do NOT use for version history or "what's new" questions — those use `series_init_idc_version`/`series_revised_idc_version` in the main `index` table. Column names `min_idc_version`/`max_idc_version` here are NOT equivalent to `series_init_idc_version`/`series_revised_idc_version` in `index`. |
171170

172171
### Joining Tables
173172

174-
**Key columns are not explicitly labeled, the following is a subset that can be used in joins.**
173+
**`SeriesInstanceUID` is the universal join key** for all series-level specialized tables: `sm_index`, `sm_instance_index`, `seg_index`, `ann_index`, `ann_group_index`, `contrast_index`, `volume_geometry_index`, `rtstruct_index`, `ct_index`, `mr_index`, `pt_index`. Always join these to `index` on `SeriesInstanceUID`. The exceptions below use different column names.
175174

176175
| Join Column | Tables | Use Case |
177176
|-------------|--------|----------|
178177
| `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
179-
| `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
180-
| `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data |
181-
| `PatientID` | index, prior_versions_index | Link patients across current and historical data |
182178
| `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
183179
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
184-
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
185-
| `Modality` | index, prior_versions_index | Filter by imaging modality |
186-
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index, volume_geometry_index | Link series to seg/ann/contrast/geometry index tables |
187-
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
188-
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
189-
| `SeriesInstanceUID` / `referenced_SeriesInstanceUID` | index, rtstruct_index | Join RTSTRUCT series to its metadata (index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID); use rtstruct_index.referenced_SeriesInstanceUID to find the source image series |
190-
| `SeriesInstanceUID` | index, ct_index | Link CT series to acquisition/reconstruction parameters |
191-
| `SeriesInstanceUID` | index, mr_index | Link MR series to sequence/acquisition parameters |
192-
| `SeriesInstanceUID` | index, pt_index | Link PET series to acquisition/radiopharmaceutical parameters |
180+
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (`seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID`) |
181+
| `referenced_SeriesInstanceUID` | ann_index → index, rtstruct_index → index | Link annotation or RTSTRUCT to its source image series |
193182

194183
**Note:** `subjects`, `updated`, and `description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
195184

185+
**Note on `prior_versions_index`:** Joining `prior_versions_index` with `index` on `SeriesInstanceUID` always returns zero rows — there is no overlap. This table is for historical reproducibility only; never join it with `index` to answer questions about current data or version history.
186+
196187
For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`.
197188

198189
### Clinical Data Access
@@ -252,21 +243,6 @@ All idc-index metadata tables are published as Parquet files to a public GCS buc
252243

253244
See `references/parquet_access_guide.md` for URL patterns, available files, and DuckDB query examples.
254245

255-
## Installation and Setup
256-
257-
**Required (for basic access):**
258-
```bash
259-
pip install --upgrade idc-index
260-
```
261-
262-
**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.
263-
264-
**Optional (for data analysis):**
265-
```bash
266-
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
267-
pip install pandas numpy pydicom
268-
```
269-
270246
## Core Capabilities
271247

272248
### 1. Data Discovery and Exploration
@@ -386,6 +362,55 @@ results = client.sql_query("""
386362

387363
**Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table.
388364

365+
**Version tracking — "what's new in IDC vX?"**
366+
367+
Use `series_init_idc_version` and `series_revised_idc_version` in the main `index` table. Do NOT use `prior_versions_index` for this — it contains only removed series.
368+
369+
```python
370+
from idc_index import IDCClient
371+
client = IDCClient()
372+
373+
VERSION = 24 # Replace with target version
374+
375+
# Series added for the first time in vVERSION
376+
new_series = client.sql_query(f"""
377+
SELECT collection_id,
378+
COUNT(DISTINCT SeriesInstanceUID) as new_series,
379+
ROUND(SUM(series_size_MB)/1000, 2) as size_GB
380+
FROM index
381+
WHERE series_init_idc_version = {VERSION}
382+
GROUP BY collection_id
383+
ORDER BY new_series DESC
384+
""")
385+
386+
# Series revised (updated content) in vVERSION but originally added earlier
387+
revised_series = client.sql_query(f"""
388+
SELECT collection_id,
389+
COUNT(DISTINCT SeriesInstanceUID) as revised_series
390+
FROM index
391+
WHERE series_revised_idc_version = {VERSION}
392+
AND series_init_idc_version < {VERSION}
393+
GROUP BY collection_id
394+
ORDER BY revised_series DESC
395+
""")
396+
397+
# When was each collection first added to IDC?
398+
client.fetch_index("version_metadata_index")
399+
first_appearance = client.sql_query("""
400+
WITH first_versions AS (
401+
SELECT collection_id, MIN(series_init_idc_version) as first_version
402+
FROM index
403+
GROUP BY collection_id
404+
)
405+
SELECT f.collection_id, f.first_version, v.version_timestamp as first_release_date
406+
FROM first_versions f
407+
JOIN version_metadata_index v ON f.first_version = v.idc_version
408+
ORDER BY f.first_version DESC
409+
""")
410+
```
411+
412+
To verify column names and descriptions before writing queries, use `client.get_index_schema('index')` or `client.indices_overview` — see Best Practices.
413+
389414
### 3. Downloading DICOM Files
390415

391416
Download imaging data efficiently from IDC's cloud storage.
@@ -481,69 +506,14 @@ To identify files, use the `crdc_instance_uuid` column in queries or read DICOM
481506

482507
### Command-Line Download
483508

484-
The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.
485-
486-
**Auto-detects input type:** manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
509+
`idc download` is available after installing `idc-index`. Auto-detects input type: collection ID, series UID, or manifest file path.
487510

488511
```bash
489-
# Download entire collection
490512
idc download rider_pilot --download-dir ./data
491-
492-
# Download specific series by UID
493-
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
494-
495-
# Download multiple items (comma-separated)
496-
idc download "tcga_luad,tcga_lusc" --download-dir ./data
497-
498-
# Download from manifest file (auto-detected)
499513
idc download manifest.txt --download-dir ./data
500514
```
501515

502-
**Options:**
503-
504-
| Option | Description |
505-
|--------|-------------|
506-
| `--download-dir` | Output directory (default: current directory) |
507-
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
508-
| `--log-level` | Verbosity: debug, info, warning, error, critical |
509-
510-
**Manifest files:**
511-
512-
Manifest files contain S3 URLs (one per line) and can be:
513-
- Exported from the IDC Portal after cohort selection
514-
- Shared by collaborators for reproducible data access
515-
- Generated programmatically from query results
516-
517-
Format (one S3 URL per line):
518-
```
519-
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
520-
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
521-
```
522-
523-
**Example: Generate manifest from Python query:**
524-
525-
```python
526-
from idc_index import IDCClient
527-
528-
client = IDCClient()
529-
530-
# Query for series URLs
531-
results = client.sql_query("""
532-
SELECT series_aws_url
533-
FROM index
534-
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
535-
""")
536-
537-
# Save as manifest file
538-
with open('ct_manifest.txt', 'w') as f:
539-
for url in results['series_aws_url']:
540-
f.write(url + '\n')
541-
```
542-
543-
Then download:
544-
```bash
545-
idc download ct_manifest.txt --download-dir ./ct_data
546-
```
516+
See `references/cli_guide.md` for full options, `idc download-from-manifest` (resume support), and `idc download-from-selection` (filter-based).
547517

548518
### 4. Visualizing IDC Images
549519

@@ -687,6 +657,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
687657

688658
## Best Practices
689659

660+
- **Check schema before writing queries** — Use `client.get_index_schema('index')` (reads cached metadata, no SQL executed) or `client.indices_overview` to see all available columns and their descriptions. The version-tracking columns `series_init_idc_version` and `series_revised_idc_version` in the main `index` table directly answer "what's new / when was this added" questions without touching `prior_versions_index`.
690661
- **Never use web search for IDC data content questions** - Always query the idc-index directly using `client.sql_query()`. Web sources (release notes, blog posts, documentation pages) are frequently out of date and will produce incorrect answers. The local DuckDB index is the authoritative source; use it even when web search is available.
691662
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index`
692663
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
@@ -701,7 +672,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
701672

702673
**Issue: `ModuleNotFoundError: No module named 'idc_index'`**
703674
- **Cause:** idc-index package not installed
704-
- **Solution:** Install with `pip install --upgrade idc-index`
675+
- **Solution:** Install with `pip install --upgrade idc-index`; for data analysis also install `pip install pandas numpy pydicom` (tested with pandas>=1.5, numpy>=1.23, pydicom>=2.3)
705676

706677
**Issue: Download fails with connection timeout**
707678
- **Cause:** Network instability or large download size

0 commit comments

Comments
 (0)