Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@ All notable changes to the IDC Claude Skill are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/),
and this project adheres to [Semantic Versioning](https://semver.org/).

## [1.6.4] - 2026-05-22

### Changed

- Added version tracking guidance: "what's new in vX" workflow using `series_init_idc_version`/`series_revised_idc_version` in `index`; clarified `prior_versions_index` is for reproducibility only (zero overlap with `index`, column names differ from main index version columns)
- Collapsed five `SeriesInstanceUID` join rows into a single universal-key statement; table now covers only non-obvious join columns
- Removed Installation and Setup section (duplicated the CRITICAL version-check block); folded optional deps into `ModuleNotFoundError` Troubleshooting entry
- Trimmed "Command-Line Download" inline section from ~60 lines to 5; full CLI coverage (`download-from-manifest`, `download-from-selection`, all options) remains in `references/cli_guide.md`

## [1.6.3] - 2026-05-09

### Added
Expand Down
149 changes: 60 additions & 89 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: imaging-data-commons
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
metadata:
version: 1.6.3
version: 1.6.4
skill-author: Andrey Fedorov, @fedorov
idc-index: "0.12.3"
idc-data-version: "v24"
Expand Down Expand Up @@ -81,7 +81,6 @@ print(stats)
**Core Sections (inline):**
- IDC Data Model - Collection and analysis result hierarchy
- Index Tables - Available tables and joining patterns
- Installation - Package setup and version verification
- Core Capabilities - Essential API patterns (query, download, visualize, license, citations)
- Best Practices - Usage guidelines
- Troubleshooting - Common issues and solutions
Expand Down Expand Up @@ -167,32 +166,24 @@ Always call `client.fetch_index("table_name")` before querying any index table
| `ct_index` | 1 row = 1 CT series | CT acquisition/reconstruction parameters: pixel spacing, slice thickness, kVp, convolution kernel, tube current (min/max for dose-modulated), exposure, spiral pitch, scan options |
| `mr_index` | 1 row = 1 MR series | MR acquisition/sequence parameters: field strength, scanning sequence, TE (array for multi-echo), TR, flip angle, DiffusionBValue (array for DWI), pixel bandwidth, receive coil, number of temporal positions |
| `pt_index` | 1 row = 1 PET series | PET acquisition/reconstruction/radiopharmaceutical parameters: series type, units, decay/scatter/attenuation correction, reconstruction method, radionuclide, injected dose, frame duration (array for dynamic PET) |
| `prior_versions_index` | 1 row = 1 DICOM series | Series that have been removed or superseded in previous IDC releases; use only to download deprecated/historical data — do not query for current data |
| `prior_versions_index` | 1 row = 1 DICOM series | **Reproducibility only.** Contains series permanently removed from IDC (all `max_idc_version` < current version; zero overlap with `index`). Use ONLY when a user explicitly needs to reproduce work from a prior IDC version using data no longer in the current release. Do NOT use for version history or "what's new" questions — those use `series_init_idc_version`/`series_revised_idc_version` in the main `index` table. Column names `min_idc_version`/`max_idc_version` here are NOT equivalent to `series_init_idc_version`/`series_revised_idc_version` in `index`. |

### Joining Tables

**Key columns are not explicitly labeled, the following is a subset that can be used in joins.**
**`SeriesInstanceUID` is the universal join key** for all series-level specialized tables: `sm_index`, `sm_instance_index`, `seg_index`, `ann_index`, `ann_group_index`, `contrast_index`, `volume_geometry_index`, `rtstruct_index`, `ct_index`, `mr_index`, `pt_index`. Always join these to `index` on `SeriesInstanceUID`. The exceptions below use different column names.

| Join Column | Tables | Use Case |
|-------------|--------|----------|
| `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
| `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
| `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data |
| `PatientID` | index, prior_versions_index | Link patients across current and historical data |
| `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
| `source_DOI` | index, analysis_results_index | Link by publication DOI |
| `crdc_series_uuid` | index, prior_versions_index | Link by CRDC unique identifier |
| `Modality` | index, prior_versions_index | Filter by imaging modality |
| `SeriesInstanceUID` | index, seg_index, ann_index, ann_group_index, contrast_index, volume_geometry_index | Link series to seg/ann/contrast/geometry index tables |
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
| `referenced_SeriesInstanceUID` | ann_index → index | Link annotation to its source image series (join ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID) |
| `SeriesInstanceUID` / `referenced_SeriesInstanceUID` | index, rtstruct_index | Join RTSTRUCT series to its metadata (index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID); use rtstruct_index.referenced_SeriesInstanceUID to find the source image series |
| `SeriesInstanceUID` | index, ct_index | Link CT series to acquisition/reconstruction parameters |
| `SeriesInstanceUID` | index, mr_index | Link MR series to sequence/acquisition parameters |
| `SeriesInstanceUID` | index, pt_index | Link PET series to acquisition/radiopharmaceutical parameters |
| `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (`seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID`) |
| `referenced_SeriesInstanceUID` | ann_index → index, rtstruct_index → index | Link annotation or RTSTRUCT to its source image series |

**Note:** `subjects`, `updated`, and `description` appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).

**Note on `prior_versions_index`:** Joining `prior_versions_index` with `index` on `SeriesInstanceUID` always returns zero rows — there is no overlap. This table is for historical reproducibility only; never join it with `index` to answer questions about current data or version history.

For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see `references/index_tables_guide.md`.

### Clinical Data Access
Expand Down Expand Up @@ -252,21 +243,6 @@ All idc-index metadata tables are published as Parquet files to a public GCS buc

See `references/parquet_access_guide.md` for URL patterns, available files, and DuckDB query examples.

## Installation and Setup

**Required (for basic access):**
```bash
pip install --upgrade idc-index
```

**Important:** New IDC data release will always trigger a new version of `idc-index`. Always use `--upgrade` flag while installing, unless an older version is needed for reproducibility.

**Optional (for data analysis):**
```bash
# Tested with: pandas>=1.5, numpy>=1.23, pydicom>=2.3
pip install pandas numpy pydicom
```

## Core Capabilities

### 1. Data Discovery and Exploration
Expand Down Expand Up @@ -386,6 +362,55 @@ results = client.sql_query("""

**Note:** Cancer type is in `collections_index.cancer_types`, not in the primary `index` table.

**Version tracking — "what's new in IDC vX?"**

Use `series_init_idc_version` and `series_revised_idc_version` in the main `index` table. Do NOT use `prior_versions_index` for this — it contains only removed series.

```python
from idc_index import IDCClient
client = IDCClient()

VERSION = 24 # Replace with target version

# Series added for the first time in vVERSION
new_series = client.sql_query(f"""
SELECT collection_id,
COUNT(DISTINCT SeriesInstanceUID) as new_series,
ROUND(SUM(series_size_MB)/1000, 2) as size_GB
FROM index
WHERE series_init_idc_version = {VERSION}
GROUP BY collection_id
ORDER BY new_series DESC
""")

# Series revised (updated content) in vVERSION but originally added earlier
revised_series = client.sql_query(f"""
SELECT collection_id,
COUNT(DISTINCT SeriesInstanceUID) as revised_series
FROM index
WHERE series_revised_idc_version = {VERSION}
AND series_init_idc_version < {VERSION}
GROUP BY collection_id
ORDER BY revised_series DESC
""")

# When was each collection first added to IDC?
client.fetch_index("version_metadata_index")
first_appearance = client.sql_query("""
WITH first_versions AS (
SELECT collection_id, MIN(series_init_idc_version) as first_version
FROM index
GROUP BY collection_id
)
SELECT f.collection_id, f.first_version, v.version_timestamp as first_release_date
FROM first_versions f
JOIN version_metadata_index v ON f.first_version = v.idc_version
ORDER BY f.first_version DESC
""")
```

To verify column names and descriptions before writing queries, use `client.get_index_schema('index')` or `client.indices_overview` — see Best Practices.

### 3. Downloading DICOM Files

Download imaging data efficiently from IDC's cloud storage.
Expand Down Expand Up @@ -481,69 +506,14 @@ To identify files, use the `crdc_instance_uuid` column in queries or read DICOM

### Command-Line Download

The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`.

**Auto-detects input type:** manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
`idc download` is available after installing `idc-index`. Auto-detects input type: collection ID, series UID, or manifest file path.

```bash
# Download entire collection
idc download rider_pilot --download-dir ./data

# Download specific series by UID
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

# Download multiple items (comma-separated)
idc download "tcga_luad,tcga_lusc" --download-dir ./data

# Download from manifest file (auto-detected)
idc download manifest.txt --download-dir ./data
```

**Options:**

| Option | Description |
|--------|-------------|
| `--download-dir` | Output directory (default: current directory) |
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | Verbosity: debug, info, warning, error, critical |

**Manifest files:**

Manifest files contain S3 URLs (one per line) and can be:
- Exported from the IDC Portal after cohort selection
- Shared by collaborators for reproducible data access
- Generated programmatically from query results

Format (one S3 URL per line):
```
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
```

**Example: Generate manifest from Python query:**

```python
from idc_index import IDCClient

client = IDCClient()

# Query for series URLs
results = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")

# Save as manifest file
with open('ct_manifest.txt', 'w') as f:
for url in results['series_aws_url']:
f.write(url + '\n')
```

Then download:
```bash
idc download ct_manifest.txt --download-dir ./ct_data
```
See `references/cli_guide.md` for full options, `idc download-from-manifest` (resume support), and `idc download-from-selection` (filter-based).

### 4. Visualizing IDC Images

Expand Down Expand Up @@ -687,6 +657,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e

## Best Practices

- **Check schema before writing queries** — Use `client.get_index_schema('index')` (reads cached metadata, no SQL executed) or `client.indices_overview` to see all available columns and their descriptions. The version-tracking columns `series_init_idc_version` and `series_revised_idc_version` in the main `index` table directly answer "what's new / when was this added" questions without touching `prior_versions_index`.
- **Never use web search for IDC data content questions** - Always query the idc-index directly using `client.sql_query()`. Web sources (release notes, blog posts, documentation pages) are frequently out of date and will produce incorrect answers. The local DuckDB index is the authoritative source; use it even when web search is available.
- **Verify IDC version before generating responses** - Always call `client.get_idc_version()` at the start of a session to confirm you're using the expected data version (currently v24). If using an older version, recommend `pip install --upgrade idc-index`
- **Check licenses before use** - Always query the `license_short_name` field and respect licensing terms (CC BY vs CC BY-NC)
Expand All @@ -701,7 +672,7 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e

**Issue: `ModuleNotFoundError: No module named 'idc_index'`**
- **Cause:** idc-index package not installed
- **Solution:** Install with `pip install --upgrade idc-index`
- **Solution:** Install with `pip install --upgrade idc-index`; for data analysis also install `pip install pandas numpy pydicom` (tested with pandas>=1.5, numpy>=1.23, pydicom>=2.3)

**Issue: Download fails with connection timeout**
- **Cause:** Network instability or large download size
Expand Down
Loading