|
| 1 | +# ENCODE Toolkit |
| 2 | + |
| 3 | +MCP server for the ENCODE Project (encodeproject.org) — the largest public catalog of functional genomic elements. Version 0.3.0-beta. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +```bash |
| 8 | +python -m venv .venv && source .venv/bin/activate |
| 9 | +pip install -e ".[dev]" |
| 10 | +pytest # 506 tests, 98% coverage |
| 11 | +ruff check src/ # lint |
| 12 | +ruff format src/ # auto-format |
| 13 | +encode-toolkit # run MCP server |
| 14 | +``` |
| 15 | + |
| 16 | +## Source Architecture |
| 17 | + |
| 18 | +``` |
| 19 | +src/encode_connector/ |
| 20 | + server/main.py # MCP server — 20 tools, ~1500 lines (entry point) |
| 21 | + client/encode_client.py # Async ENCODE API client, ~585 lines, 1-hour TTL cache |
| 22 | + client/downloader.py # File download manager, ~305 lines, MD5 verification |
| 23 | + client/auth.py # OS keyring + Fernet credential storage, ~262 lines |
| 24 | + client/models.py # Pydantic models for API responses, ~332 lines |
| 25 | + client/constants.py # API URLs, filter values, ~348 lines |
| 26 | + client/tracker.py # SQLite experiment tracker, ~1129 lines |
| 27 | + client/validation.py # Input validation, ~188 lines |
| 28 | +skills/ # 47 skills, each with SKILL.md + references/ + scripts/ |
| 29 | +tests/ # 506 tests (pytest-asyncio, asyncio_mode=auto), 98% coverage |
| 30 | +``` |
| 31 | + |
| 32 | +## Package Identity |
| 33 | + |
| 34 | +- **PyPI / console command**: `encode-toolkit` |
| 35 | +- **npm**: `encode-toolkit` (thin wrapper → uvx encode-toolkit) |
| 36 | +- **Plugin marketplace**: `encode-toolkit` |
| 37 | +- **Python module**: `encode_connector` |
| 38 | + |
| 39 | +## Development Gotchas |
| 40 | + |
| 41 | +- Use `.venv/bin/python` on macOS (`python` may not exist) |
| 42 | +- `asyncio_mode = "auto"` in pytest — no need for `@pytest.mark.asyncio` |
| 43 | +- `main.py` is ~53KB — don't send full contents to subagents (causes timeouts) |
| 44 | +- MCP SDK: use `instructions=` parameter, not `description=` |
| 45 | +- Integration tests hit live ENCODE API — deselect with `-m "not integration"` |
| 46 | +- `server.json` is for MCP registry; `.claude-plugin/plugin.json` is for Claude marketplace — keep both in sync |
| 47 | +- `conftest.py` sets up shared fixtures (tmp_path, mock tracker) — read before adding tests |
| 48 | +- npm `package.json` + `index.js` are thin wrappers that call `uvx encode-toolkit` — no JS logic |
| 49 | + |
| 50 | +## What This Server Does |
| 51 | + |
| 52 | +Provides 20 tools to search, download, and track ENCODE data: |
| 53 | +- **Search**: Find experiments by assay, organ, biosample, target, organism |
| 54 | +- **Download**: Get BED, FASTQ, BAM, bigWig files with MD5 verification |
| 55 | +- **Track**: Local experiment tracking with publications, citations, provenance |
| 56 | +- **Cross-reference**: Link to PubMed, bioRxiv, ClinicalTrials, GEO |
| 57 | + |
| 58 | +## Key Concepts |
| 59 | + |
| 60 | +**Assay types**: Histone ChIP-seq, TF ChIP-seq, ATAC-seq, DNase-seq, RNA-seq, WGBS, Hi-C, scRNA-seq, scATAC-seq, CRISPR screen, STARR-seq, MPRA, eCLIP, CUT&RUN, CUT&Tag |
| 61 | + |
| 62 | +**Biosample hierarchy**: tissue > primary cell > cell line > in vitro differentiated > organoid |
| 63 | + |
| 64 | +**Tier 1 cell lines** (most data): K562, GM12878, H1-hESC |
| 65 | + |
| 66 | +**File selection priority**: preferred_default=True > IDR thresholded peaks > fold change over control |
| 67 | + |
| 68 | +**Assembly**: Use GRCh38 for human, mm10 for mouse. Never mix assemblies. |
| 69 | + |
| 70 | +## Tool Selection Guide |
| 71 | + |
| 72 | +| User wants to... | Use tool | |
| 73 | +|---|---| |
| 74 | +| Find experiments | `encode_search_experiments` | |
| 75 | +| Explore what data exists (live counts) | `encode_get_facets` | |
| 76 | +| Get valid filter strings (static list) | `encode_get_metadata` | |
| 77 | +| Get experiment details | `encode_get_experiment` | |
| 78 | +| Find specific file types | `encode_search_files` | |
| 79 | +| List files for experiment | `encode_list_files` | |
| 80 | +| Get file details | `encode_get_file_info` | |
| 81 | +| Download specific files by accession | `encode_download_files` | |
| 82 | +| Search + download in one step | `encode_batch_download` | |
| 83 | +| Track experiments locally | `encode_track_experiment` | |
| 84 | +| Compare experiments | `encode_compare_experiments` | |
| 85 | +| Get citations | `encode_get_citations` | |
| 86 | +| Log derived files | `encode_log_derived_file` | |
| 87 | +| Link to PubMed/GEO | `encode_link_reference` | |
| 88 | +| List tracked experiments | `encode_list_tracked` | |
| 89 | +| Export tracking data | `encode_export_data` | |
| 90 | +| View file provenance | `encode_get_provenance` | |
| 91 | +| View linked references | `encode_get_references` | |
| 92 | +| Get collection summary | `encode_summarize_collection` | |
| 93 | +| Manage API credentials | `encode_manage_credentials` | |
| 94 | + |
| 95 | +### Example Queries |
| 96 | + |
| 97 | +**Search**: `encode_search_experiments(assay_title="Histone ChIP-seq", organ="pancreas", target="H3K27ac")` → finds all H3K27ac ChIP-seq in pancreas tissue |
| 98 | + |
| 99 | +**Download**: `encode_download_files(file_accessions=["ENCFF123ABC"], download_dir="/data/encode")` → downloads with MD5 verification |
| 100 | + |
| 101 | +**Track**: `encode_track_experiment(accession="ENCSR000ABC", notes="Liver H3K4me3 for enhancer analysis")` → saves to local SQLite with publications |
| 102 | + |
| 103 | +**Explore**: `encode_get_facets(assay_title="Histone ChIP-seq", organ="pancreas")` → shows available targets, labs, biosample types |
| 104 | + |
| 105 | +**Batch download**: `encode_batch_download(assay_title="ATAC-seq", organ="liver", file_format="bed", output_type="IDR thresholded peaks", dry_run=True)` → previews matching files before download |
| 106 | + |
| 107 | +**Compare**: `encode_compare_experiments(accession1="ENCSR123ABC", accession2="ENCSR456DEF")` → checks compatibility for combined analysis |
| 108 | + |
| 109 | +## 47 Skills Available |
| 110 | + |
| 111 | +**Core**: setup, search-encode, download-encode, track-experiments, cross-reference |
| 112 | + |
| 113 | +**Analysis**: quality-assessment, integrative-analysis, regulatory-elements, epigenome-profiling, compare-biosamples, visualization-workflow, motif-analysis, peak-annotation, batch-analysis |
| 114 | + |
| 115 | +**Functional Genomics**: functional-screen-analysis |
| 116 | + |
| 117 | +**Data Aggregation**: histone-aggregation, accessibility-aggregation, hic-aggregation, methylation-aggregation |
| 118 | + |
| 119 | +**External Databases**: gtex-expression, clinvar-annotation, cellxgene-context, gwas-catalog, jaspar-motifs, ensembl-annotation, geo-connector, gnomad-variants, ucsc-browser |
| 120 | + |
| 121 | +**Workflows**: data-provenance, cite-encode, variant-annotation, pipeline-guide, single-cell-encode, disease-research, publication-trust, bioinformatics-installer, scientific-writing, liftover-coordinates |
| 122 | + |
| 123 | +**Pipeline Execution**: pipeline-chipseq, pipeline-atacseq, pipeline-rnaseq, pipeline-wgbs, pipeline-hic, pipeline-dnaseseq, pipeline-cutandrun |
| 124 | + |
| 125 | +**Meta-Analysis**: scrna-meta-analysis, multi-omics-integration |
| 126 | + |
| 127 | +## Reference Files |
| 128 | + |
| 129 | +- `skills/histone-aggregation/references/histone-marks-reference.md` — Comprehensive chromatin biology catalog (1,442 lines, 74 references, 12 sections: histone marks, ChromHMM states, functional categories, contradictions, TF combinations, chromatin remodeling, DNA methylation interplay, nucleosome dynamics, 3D genome organization, chromatin in disease) |
| 130 | +- `skills/*/references/literature.md` — 34 literature reference documents (33 per-skill + 1 chromatin biology catalog, ~320 papers with DOI, PMID, citation counts, key findings) |
| 131 | + |
| 132 | +## Quality Awareness |
| 133 | + |
| 134 | +- ENCODE audits: ERROR > NOT_COMPLIANT > WARNING > INTERNAL_ACTION |
| 135 | +- ChIP-seq metrics: FRiP ≥1%, NSC >1.05, RSC >0.8, NRF ≥0.8 (Landt et al. 2012) |
| 136 | +- ATAC-seq metrics: TSS enrichment ≥6, fragment size nucleosomal ladder (Buenrostro et al. 2013) |
| 137 | +- RNA-seq: Mapping rate >80%, rRNA <10%, replicate correlation ≥0.9 (Conesa et al. 2016) |
| 138 | +- WGBS: Bisulfite conversion >99%, CpG coverage ≥10× for DMRs (Foox et al. 2021) |
| 139 | +- Hi-C: Cis/trans ratio >60%, long-range cis >40% (Yardimci et al. 2019) |
| 140 | +- CUT&RUN/CUT&Tag: Different QC profiles from ChIP-seq; use suspect list (Nordin et al. 2023) |
| 141 | +- Always use 2+ biological replicates |
| 142 | +- Always apply ENCODE Blacklist v2 (Amemiya et al. 2019) |
| 143 | +- No single metric is sufficient — interpret collectively |
| 144 | + |
| 145 | +## Provenance Standard |
| 146 | + |
| 147 | +Every operation should log: tool name + version, exact command, input accessions + MD5, reference files + source + MD5, output descriptions + counts, and statistics. Scripts stored with sequential numbering. Enables auto-generation of publication-ready methods sections. |
| 148 | + |
| 149 | +## Cross-Database Integration |
| 150 | + |
| 151 | +This plugin works with MCP servers: |
| 152 | +- **PubMed** (search_articles) — Literature search and citation |
| 153 | +- **bioRxiv** (search_preprints) — Preprint discovery |
| 154 | +- **ClinicalTrials.gov** (search_trials) — Clinical trial cross-reference |
| 155 | +- **Open Targets** (query_open_targets_graphql) — Drug target identification |
| 156 | +- **Consensus** (search) — Academic paper search across 200M+ papers |
| 157 | + |
| 158 | +And via skills (REST API/CLI): |
| 159 | +- **UCSC Genome Browser** — cCRE tracks, TF binding, sequence retrieval via REST API |
| 160 | +- **NCBI GEO** — Complementary expression/epigenomic datasets via E-utilities |
| 161 | +- **gnomAD** — Population allele frequencies and gene constraint via GraphQL |
| 162 | +- **Ensembl** — VEP variant annotation, Regulatory Build, coordinate liftover via REST API |
| 163 | +- **NCBI SRA** — Raw sequencing reads linked from GEO (via E-utilities elink) |
| 164 | +- **GTEx** — Tissue-specific gene expression for ENCODE regulatory element interpretation via REST API |
| 165 | +- **ClinVar** — Clinical variant significance for ENCODE-identified regulatory variants via E-utilities |
| 166 | +- **CELLxGENE** — Single-cell expression context for ENCODE bulk data via REST API |
| 167 | +- **GWAS Catalog** — GWAS associations in ENCODE regulatory regions via REST API |
| 168 | +- **JASPAR** — Transcription factor binding motifs for ENCODE ChIP-seq peak analysis via REST API |
0 commit comments