|
1 | 1 | --- |
2 | 2 | name: imaging-data-commons |
3 | | -description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses. |
| 3 | +description: Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Invoke for any question about IDC collections, cancer imaging datasets, DICOM data access, radiology (CT, MR, PET) or pathology AI training sets, metadata queries, visualization, or license checks — even when the user doesn't explicitly mention "IDC". No authentication required. |
4 | 4 | license: This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data. |
| 5 | +compatibility: |
| 6 | + allowed-tools: [Bash, Read, WebFetch, WebSearch] |
5 | 7 | metadata: |
6 | 8 | version: 1.6.0 |
7 | 9 | skill-author: Andrey Fedorov, @fedorov |
@@ -607,43 +609,9 @@ bibtex_citations = client.citations_from_selection( |
607 | 609 |
|
608 | 610 | ### 6. Batch Processing and Filtering |
609 | 611 |
|
610 | | -Process large datasets efficiently with filtering: |
| 612 | +For large downloads, query first to build a manifest, save it to CSV for reproducibility, then iterate over slices of the result DataFrame with `download_from_selection()` using a `batch_size` of 10–20 series to avoid timeouts. |
611 | 613 |
|
612 | | -```python |
613 | | -from idc_index import IDCClient |
614 | | -import pandas as pd |
615 | | - |
616 | | -client = IDCClient() |
617 | | - |
618 | | -# Find chest CT scans from GE scanners |
619 | | -query = """ |
620 | | -SELECT |
621 | | - SeriesInstanceUID, |
622 | | - PatientID, |
623 | | - collection_id, |
624 | | - ManufacturerModelName |
625 | | -FROM index |
626 | | -WHERE Modality = 'CT' |
627 | | - AND BodyPartExamined = 'CHEST' |
628 | | - AND Manufacturer = 'GE MEDICAL SYSTEMS' |
629 | | - AND license_short_name = 'CC BY 4.0' |
630 | | -LIMIT 100 |
631 | | -""" |
632 | | - |
633 | | -results = client.sql_query(query) |
634 | | - |
635 | | -# Save manifest for later |
636 | | -results.to_csv('lung_ct_manifest.csv', index=False) |
637 | | - |
638 | | -# Download in batches to avoid timeout |
639 | | -batch_size = 10 |
640 | | -for i in range(0, len(results), batch_size): |
641 | | - batch = results.iloc[i:i+batch_size] |
642 | | - client.download_from_selection( |
643 | | - seriesInstanceUID=list(batch['SeriesInstanceUID'].values), |
644 | | - downloadDir=f"./data/batch_{i//batch_size}" |
645 | | - ) |
646 | | -``` |
| 614 | +See `references/use_cases.md` (Use Case 5) for a complete worked example with manufacturer filtering, manifest saving, and batched downloads. |
647 | 615 |
|
648 | 616 | ### 7. Advanced Queries with BigQuery |
649 | 617 |
|
@@ -684,67 +652,9 @@ See `references/bigquery_guide.md` for schemas, column descriptions, and query e |
684 | 652 |
|
685 | 653 | ### 9. Integration with Analysis Pipelines |
686 | 654 |
|
687 | | -Integrate IDC data into imaging analysis workflows: |
688 | | - |
689 | | -**Read downloaded DICOM files:** |
690 | | -```python |
691 | | -import pydicom |
692 | | -import os |
693 | | - |
694 | | -# Read DICOM files from downloaded series |
695 | | -series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..." |
| 655 | +After downloading DICOM files, use `pydicom` to read individual files or build 3D numpy arrays sorted by `ImagePositionPatient`. For a more robust reader with automatic series sorting and ITK image output, use `SimpleITK.ImageSeriesReader`. |
696 | 656 |
|
697 | | -dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) |
698 | | - if f.endswith('.dcm')] |
699 | | - |
700 | | -# Load first image |
701 | | -ds = pydicom.dcmread(dicom_files[0]) |
702 | | -print(f"Patient ID: {ds.PatientID}") |
703 | | -print(f"Modality: {ds.Modality}") |
704 | | -print(f"Image shape: {ds.pixel_array.shape}") |
705 | | -``` |
706 | | - |
707 | | -**Build 3D volume from CT series:** |
708 | | -```python |
709 | | -import pydicom |
710 | | -import numpy as np |
711 | | -from pathlib import Path |
712 | | - |
713 | | -def load_ct_series(series_path): |
714 | | - """Load CT series as 3D numpy array""" |
715 | | - files = sorted(Path(series_path).glob('*.dcm')) |
716 | | - slices = [pydicom.dcmread(str(f)) for f in files] |
717 | | - |
718 | | - # Sort by slice location |
719 | | - slices.sort(key=lambda x: float(x.ImagePositionPatient[2])) |
720 | | - |
721 | | - # Stack into 3D array |
722 | | - volume = np.stack([s.pixel_array for s in slices]) |
723 | | - |
724 | | - return volume, slices[0] # Return volume and first slice for metadata |
725 | | - |
726 | | -volume, metadata = load_ct_series("./data/lung_ct/series_dir") |
727 | | -print(f"Volume shape: {volume.shape}") # (z, y, x) |
728 | | -``` |
729 | | - |
730 | | -**Integrate with SimpleITK:** |
731 | | -```python |
732 | | -import SimpleITK as sitk |
733 | | -from pathlib import Path |
734 | | - |
735 | | -# Read DICOM series |
736 | | -series_path = "./data/ct_series" |
737 | | -reader = sitk.ImageSeriesReader() |
738 | | -dicom_names = reader.GetGDCMSeriesFileNames(series_path) |
739 | | -reader.SetFileNames(dicom_names) |
740 | | -image = reader.Execute() |
741 | | - |
742 | | -# Apply processing |
743 | | -smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5) |
744 | | - |
745 | | -# Save as NIfTI |
746 | | -sitk.WriteImage(smoothed, "processed_volume.nii.gz") |
747 | | -``` |
| 657 | +See `references/use_cases.md` (Use Case 6) for code examples reading DICOM with pydicom, building 3D CT volumes, and integrating with SimpleITK. |
748 | 658 |
|
749 | 659 | ## Common Use Cases |
750 | 660 |
|
|
0 commit comments