Skip to content

Commit fd8314b

Browse files
fedorovclaude
andcommitted
Improve skill structure: description, allowed-tools, size reduction
- Tighten frontmatter description to be directive about triggering - Add compatibility/allowed-tools: [Bash, Read, WebFetch, WebSearch] - Extract batch processing and analysis pipeline sections to use_cases.md - SKILL.md: 865 → 775 lines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 826237e commit fd8314b

3 files changed

Lines changed: 104 additions & 97 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
1919

2020
### Changed
2121

22+
- Updated frontmatter description to be directive about skill triggering: now explicitly instructs invocation for IDC-related queries even without the word "IDC" in the prompt
23+
- Added `compatibility: allowed-tools: [Bash, Read, WebFetch, WebSearch]` to frontmatter to declare required tool scope: Bash for running code, Read for loading reference guides, WebFetch for documentation pages, WebSearch for current information
24+
- Extracted "Batch Processing and Filtering" (section 6) from SKILL.md to `references/use_cases.md` (Use Case 5); replaced inline code block with a 2-sentence summary and pointer
25+
- Extracted "Integration with Analysis Pipelines" (section 9) from SKILL.md to `references/use_cases.md` (Use Case 6); replaced inline pydicom/SimpleITK code blocks with a 2-sentence summary and pointer
26+
- SKILL.md reduced from 865 → 775 lines (−90 lines); `references/use_cases.md` expanded from 187 → 278 lines
2227
- Updated to idc-index 0.12.1 (idc-index-data 24.0.4, IDC data version v24)
2328
- IDC v24 adds 15 new collections (161 → 176), ~39K new series, ~4 TB new data (99.27 TB total, 85,682 cases)
2429
- Updated `collections_index` column names to snake_case (idc-index-data 24.0.0 breaking change):

SKILL.md

Lines changed: 7 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
---
22
name: imaging-data-commons
3-
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
3+
description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required.
44
license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.
5+
compatibility:
6+
allowed-tools: [Bash, Read, WebFetch, WebSearch]
57
metadata:
68
version: 1.6.0
79
skill-author: Andrey Fedorov, @fedorov
@@ -607,43 +609,9 @@ bibtex_citations = client.citations_from_selection(
607609

608610
### 6. Batch Processing and Filtering
609611

610-
Process large datasets efficiently with filtering:
612+
For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts.
611613

612-
```python
613-
from idc_index import IDCClient
614-
import pandas as pd
615-
616-
client = IDCClient()
617-
618-
# Find chest CT scans from GE scanners
619-
query = """
620-
SELECT
621-
SeriesInstanceUID,
622-
PatientID,
623-
collection_id,
624-
ManufacturerModelName
625-
FROM index
626-
WHERE Modality = 'CT'
627-
AND BodyPartExamined = 'CHEST'
628-
AND Manufacturer = 'GE MEDICAL SYSTEMS'
629-
AND license_short_name = 'CC BY 4.0'
630-
LIMIT 100
631-
"""
632-
633-
results = client.sql_query(query)
634-
635-
# Save manifest for later
636-
results.to_csv('lung_ct_manifest.csv', index=False)
637-
638-
# Download in batches to avoid timeout
639-
batch_size = 10
640-
for i in range(0, len(results), batch_size):
641-
batch = results.iloc[i:i+batch_size]
642-
client.download_from_selection(
643-
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
644-
downloadDir=f"./data/batch_{i//batch_size}"
645-
)
646-
```
614+
See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads.
647615

648616
### 7. Advanced Queries with BigQuery
649617

@@ -684,67 +652,9 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e
684652

685653
### 9. Integration with Analysis Pipelines
686654

687-
Integrate IDC data into imaging analysis workflows:
688-
689-
**Read downloaded DICOM files:**
690-
```python
691-
import pydicom
692-
import os
693-
694-
# Read DICOM files from downloaded series
695-
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
655+
After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`.
696656

697-
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
698-
if f.endswith('.dcm')]
699-
700-
# Load first image
701-
ds = pydicom.dcmread(dicom_files[0])
702-
print(f"Patient ID: {ds.PatientID}")
703-
print(f"Modality: {ds.Modality}")
704-
print(f"Image shape: {ds.pixel_array.shape}")
705-
```
706-
707-
**Build 3D volume from CT series:**
708-
```python
709-
import pydicom
710-
import numpy as np
711-
from pathlib import Path
712-
713-
def load_ct_series(series_path):
714-
"""Load CT series as 3D numpy array"""
715-
files = sorted(Path(series_path).glob('*.dcm'))
716-
slices = [pydicom.dcmread(str(f)) for f in files]
717-
718-
# Sort by slice location
719-
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
720-
721-
# Stack into 3D array
722-
volume = np.stack([s.pixel_array for s in slices])
723-
724-
return volume, slices[0] # Return volume and first slice for metadata
725-
726-
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
727-
print(f"Volume shape: {volume.shape}") # (z, y, x)
728-
```
729-
730-
**Integrate with SimpleITK:**
731-
```python
732-
import SimpleITK as sitk
733-
from pathlib import Path
734-
735-
# Read DICOM series
736-
series_path = "./data/ct_series"
737-
reader = sitk.ImageSeriesReader()
738-
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
739-
reader.SetFileNames(dicom_names)
740-
image = reader.Execute()
741-
742-
# Apply processing
743-
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
744-
745-
# Save as NIfTI
746-
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
747-
```
657+
See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK.
748658

749659
## Common Use Cases
750660

references/use_cases.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,98 @@ client.download_from_selection(
178178
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
179179
```
180180

181+
## Use Case 5: Batch Download with Filtering
182+
183+
**Objective:** Download a large filtered dataset in batches to avoid timeouts
184+
185+
**Steps:**
186+
```python
187+
from idc_index import IDCClient
188+
import pandas as pd
189+
190+
client = IDCClient()
191+
192+
# Find chest CT scans from GE scanners with a permissive license
193+
query = """
194+
SELECT
195+
SeriesInstanceUID,
196+
PatientID,
197+
collection_id,
198+
ManufacturerModelName
199+
FROM index
200+
WHERE Modality = 'CT'
201+
AND BodyPartExamined = 'CHEST'
202+
AND Manufacturer = 'GE MEDICAL SYSTEMS'
203+
AND license_short_name = 'CC BY 4.0'
204+
LIMIT 100
205+
"""
206+
207+
results = client.sql_query(query)
208+
209+
# Save manifest for reproducibility
210+
results.to_csv('lung_ct_manifest.csv', index=False)
211+
212+
# Download in batches to avoid timeout
213+
batch_size = 10
214+
for i in range(0, len(results), batch_size):
215+
batch = results.iloc[i:i+batch_size]
216+
client.download_from_selection(
217+
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
218+
downloadDir=f"./data/batch_{i//batch_size}"
219+
)
220+
```
221+
222+
## Use Case 6: Integration with Analysis Pipelines
223+
224+
**Objective:** Load downloaded DICOM files into Python for processing
225+
226+
**Read individual DICOM files with pydicom:**
227+
```python
228+
import pydicom
229+
import os
230+
231+
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
232+
233+
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
234+
if f.endswith('.dcm')]
235+
236+
ds = pydicom.dcmread(dicom_files[0])
237+
print(f"Patient ID: {ds.PatientID}")
238+
print(f"Modality: {ds.Modality}")
239+
print(f"Image shape: {ds.pixel_array.shape}")
240+
```
241+
242+
**Build 3D volume from CT series:**
243+
```python
244+
import pydicom
245+
import numpy as np
246+
from pathlib import Path
247+
248+
def load_ct_series(series_path):
249+
files = sorted(Path(series_path).glob('*.dcm'))
250+
slices = [pydicom.dcmread(str(f)) for f in files]
251+
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
252+
volume = np.stack([s.pixel_array for s in slices])
253+
return volume, slices[0]
254+
255+
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
256+
print(f"Volume shape: {volume.shape}") # (z, y, x)
257+
```
258+
259+
**Load DICOM series with SimpleITK (recommended for correct geometry):**
260+
```python
261+
import SimpleITK as sitk
262+
263+
series_path = "./data/ct_series"
264+
reader = sitk.ImageSeriesReader()
265+
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
266+
reader.SetFileNames(dicom_names)
267+
image = reader.Execute()
268+
269+
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
270+
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
271+
```
272+
181273
## Resources
182274

183275
- Main SKILL.md for core API patterns (query, download, visualize)

0 commit comments

Comments
 (0)