Skip to content

Commit e557ddd

Browse files
committed
feat: Add 12 new skills covering AI evaluation, model routing, and advanced RAG
Add comprehensive coverage for LLM evaluation, model routing/selection, and advanced RAG techniques: LLM Evaluation (5 new skills): - llm-benchmarks-evaluation.md (724 lines): MMLU, HellaSwag, BBH, HumanEval, TruthfulQA, GSM8K; lm-evaluation-harness, LightEval; data contamination detection - llm-evaluation-frameworks.md (921 lines): Arize Phoenix (OpenTelemetry, self-hostable, LLM evals), Braintrust (86x faster search), LangSmith (LangChain integration), Langfuse (open-source) - llm-as-judge.md (1089 lines): Pairwise/pointwise/reference-guided patterns, Prometheus 2 models (fine-tuned evaluators, BGB variant), G-Eval (GPT-4 with CoT), bias mitigation, uncertainty quantification - rag-evaluation-metrics.md (969 lines): RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall), LLM-as-judge for RAG, synthetic datasets, Arize Phoenix/Langfuse integration - custom-llm-evaluation.md (1053 lines): Domain-specific metrics (medical, legal, code), RLHF reward models, adversarial testing (jailbreaks, prompt injection), bias/toxicity detection Model Routing & Selection (3 new skills): - llm-model-routing.md (562 lines): RouteLLM (ICLR 2025, 85% cost reduction), RoRF random forest, semantic routing (vLLM, ModernBERT), rule-based routing, model strengths (Claude 3.5 HumanEval 92%, GPT-4o MMLU 88.7%, Gemini Flash 370 tok/s, DeepSeek 27.4x cheaper) - llm-model-selection.md (551 lines): 2025 model landscape (GPT-4o/o1, Claude 3.5/4, Gemini 2.5, Grok 3, DeepSeek R1/V3, LLaMA 3.3), capability matrix, pricing analysis (Premium $10-75, Mid $1-5, Budget $0.40-1 per million tokens), strategic stack approach - multi-model-orchestration.md (721 lines): Pipeline/ensemble/specialist/cascade/hybrid patterns, context management, error handling with fallback chains, Arize Phoenix multi-model tracing with span analysis Advanced RAG (4 new skills): - hybrid-search-rag.md (656 lines): Vector + BM25 fusion, Reciprocal Rank Fusion (RRF), parallel/sequential architectures, score normalization (min-max, z-score, softmax), Elasticsearch/Weaviate/Qdrant/Pinecone, 15-30% improvement benchmarks - rag-reranking-techniques.md (623 lines): Multi-stage retrieval (fast → rerank → generate), cross-encoder models (ms-marco, BGE), tensor-based reranking (ColBERT - 2024-2025 trend), LLM-as-reranker (GPT-4, Claude), Cohere Reranker API, nDCG/MAP/MRR metrics - graph-rag.md (696 lines): Microsoft GraphRAG (2024), entity extraction, Leiden community detection, hierarchical summarization, local vs global queries, multihop reasoning, SAM-RAG/ArchRAG/LightRAG variants, Neo4j/ArangoDB, 72.5% comprehensiveness for global queries - hierarchical-rag.md (694 lines): Multi-level document structures (chapter → section → paragraph), recursive summarization, parent-child chunks, top-down/bottom-up/hybrid retrieval, LlamaIndex/LangChain implementations, RAGAS hierarchical evaluation Documentation Updates: - skills/_INDEX.md: Added "LLM Evaluation & Routing (8 skills)" and "Advanced RAG (4 skills)" sections, updated totals (247 → 259 skills, 43 → 45 categories), added discovery patterns and quick reference entries - README.md: Added LLM Evaluation, Model Routing, and Advanced RAG sections, updated technology coverage matrix (ML/AI: 21 → 33 skills), updated all skill counts Key Technologies: - Evaluation: Arize Phoenix, Braintrust, LangSmith, Langfuse, Prometheus 2, G-Eval, RAGAS - Routing: RouteLLM, RoRF, vLLM Semantic Router, Unify - RAG: Elasticsearch, Weaviate, Qdrant, Pinecone, Cohere Rerank, ColBERT, Neo4j, Microsoft GraphRAG All skills include YAML frontmatter (agent-compatible), "Last Updated: 2025-10-26", code examples with 2024-2025 frameworks, anti-patterns, and related skills cross-references.
1 parent 98edf73 commit e557ddd

14 files changed

Lines changed: 9322 additions & 9 deletions

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Claude Code Development Reference
22

3-
A comprehensive skills library and development guidelines for working with Claude Code across 247 atomic, composable skills spanning 43 technology domains.
3+
A comprehensive skills library and development guidelines for working with Claude Code across 259 atomic, composable skills spanning 45 technology domains.
44

55
## Overview
66

@@ -125,6 +125,9 @@ This repository serves as a complete reference for software development best pra
125125
- **RFC Writing**: Structure/format, technical design, consensus building, decision documentation (4 skills)
126126

127127
**Machine Learning & AI:**
128+
- **LLM Evaluation**: Benchmarks (MMLU, HellaSwag, HumanEval), frameworks (Arize Phoenix, Braintrust, LangSmith), LLM-as-judge (Prometheus 2, G-Eval), RAGAS metrics, custom evaluation (5 skills)
129+
- **Model Routing & Selection**: RouteLLM framework, model comparison (GPT-4o, Claude, Gemini, DeepSeek), multi-model orchestration, cost optimization (3 skills)
130+
- **Advanced RAG**: Hybrid search (vector + BM25), reranking (cross-encoder, LLM-as-reranker), GraphRAG (Microsoft 2024), hierarchical retrieval (4 skills)
128131
- **DSPy Framework**: Signatures, modules, optimizers, RAG, assertions (7 skills)
129132
- **HuggingFace**: Hub, Transformers, Datasets, Spaces, AutoTrain (5 skills)
130133
- **LLM Fine-tuning**: Unsloth, LoRA/PEFT, dataset prep (3 skills)
@@ -417,7 +420,7 @@ skills/skill-creation.md # Template and guidelines
417420
| **Cloud** | AWS, GCP, Modal, Vercel, Cloudflare | 27 | Serverless, GPU, edge, compute, storage, networking |
418421
| **Database** | Postgres, Mongo, Redis, Redpanda, Iceberg, DuckDB | 11 | OLTP, NoSQL, streaming, analytics |
419422
| **Caching** | Redis, HTTP, CDN (Cloudflare/Fastly/CloudFront), Service Workers | 7 | Multi-layer caching, invalidation, performance monitoring |
420-
| **ML/AI** | DSPy, HuggingFace, Unsloth, Diffusion | 21 | LLM orchestration, fine-tuning, image generation, model hub |
423+
| **ML/AI** | DSPy, HuggingFace, Unsloth, Arize Phoenix, Prometheus, GraphRAG | 33 | LLM orchestration, evaluation (benchmarks, LLM-as-judge, RAGAS), model routing, advanced RAG (hybrid, reranking, GraphRAG), fine-tuning |
421424
| **IR** | Elasticsearch, Vector DBs, Ranking, Recommenders | 5 | Search, semantic retrieval, recommendations |
422425
| **Systems** | WebAssembly, eBPF | 8 | Browser/server wasm, observability, networking, security |
423426
| **Collaboration** | GitHub, PRD, RFC | 17 | Repository management, product specs, technical design |
@@ -432,8 +435,8 @@ skills/skill-creation.md # Template and guidelines
432435

433436
All skills are validated through automated CI/CD pipelines:
434437

435-
-**Code Block Validation**: 1000+ Python/Swift/TypeScript/Bash/C/Java blocks syntax-checked
436-
-**Frontmatter Validation**: 247 skills with proper YAML frontmatter
438+
-**Code Block Validation**: 1100+ Python/Swift/TypeScript/Bash/C/Java blocks syntax-checked
439+
-**Frontmatter Validation**: 259 skills with proper YAML frontmatter
437440
-**Date Validation**: All "Last Updated" dates verified (no future dates)
438441
-**Format Compliance**: Atomic skill guidelines enforced (~250-500 lines)
439442
-**Cross-References**: Related skills linked for discoverability
@@ -451,4 +454,4 @@ Feel free to fork and adapt for your own use.
451454

452455
---
453456

454-
**Total: 247 atomic skills** | **Average: 380 lines/skill** | **43 categories** | **100% CI-validated**
457+
**Total: 259 atomic skills** | **Average: 420 lines/skill** | **45 categories** | **100% CI-validated**

skills/_INDEX.md

Lines changed: 55 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -533,6 +533,45 @@ This index catalogs all atomic skills available in the skills system, organized
533533

534534
---
535535

536+
### Machine Learning - LLM Evaluation & Routing (8 skills)
537+
538+
| Skill | Use When | Lines |
539+
|-------|----------|-------|
540+
| `ml/llm-benchmarks-evaluation.md` | MMLU, HellaSwag, BBH, HumanEval benchmarks; lm-evaluation-harness, LightEval | ~724 |
541+
| `ml/llm-evaluation-frameworks.md` | Arize Phoenix (OpenTelemetry), Braintrust, LangSmith, Langfuse observability | ~921 |
542+
| `ml/llm-as-judge.md` | Pairwise/pointwise/reference-guided eval; Prometheus 2, G-Eval, bias mitigation | ~1089 |
543+
| `ml/rag-evaluation-metrics.md` | RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision/Recall) | ~969 |
544+
| `ml/custom-llm-evaluation.md` | Domain-specific metrics, RLHF, adversarial testing, bias evaluation | ~1053 |
545+
| `ml/llm-model-routing.md` | RouteLLM, RoRF, semantic routing; GPT-4o vs Claude vs Gemini routing | ~562 |
546+
| `ml/llm-model-selection.md` | 2025 model comparison, capability matrix, strategic stack approach | ~551 |
547+
| `ml/multi-model-orchestration.md` | Pipeline/ensemble/cascade patterns, Arize Phoenix multi-model tracing | ~721 |
548+
549+
**Common workflows:**
550+
- Evaluation setup: `llm-benchmarks-evaluation.md``llm-evaluation-frameworks.md` (Arize Phoenix)
551+
- RAG evaluation: `rag-evaluation-metrics.md``llm-as-judge.md``llm-evaluation-frameworks.md`
552+
- Custom evaluation: `custom-llm-evaluation.md``llm-as-judge.md` (Prometheus 2) → `llm-evaluation-frameworks.md`
553+
- Model routing: `llm-model-selection.md``llm-model-routing.md``multi-model-orchestration.md`
554+
- Cost optimization: `llm-model-selection.md` (pricing) → `llm-model-routing.md` (85% reduction)
555+
556+
---
557+
558+
### Machine Learning - Advanced RAG (4 skills)
559+
560+
| Skill | Use When | Lines |
561+
|-------|----------|-------|
562+
| `ml/hybrid-search-rag.md` | Vector + BM25 fusion, RRF, Elasticsearch/Weaviate/Qdrant/Pinecone | ~656 |
563+
| `ml/rag-reranking-techniques.md` | Cross-encoder, tensor-based, LLM-as-reranker; Cohere/BGE/ms-marco | ~623 |
564+
| `ml/graph-rag.md` | Microsoft GraphRAG, entity extraction, community detection, multihop reasoning | ~696 |
565+
| `ml/hierarchical-rag.md` | Multi-level structures, recursive summarization, parent-child chunks | ~694 |
566+
567+
**Common workflows:**
568+
- Hybrid RAG: `dspy-rag.md``hybrid-search-rag.md``rag-reranking-techniques.md`
569+
- Advanced RAG pipeline: `hybrid-search-rag.md``rag-reranking-techniques.md``rag-evaluation-metrics.md`
570+
- GraphRAG: `graph-rag.md``llm-as-judge.md` (quality eval) → `rag-evaluation-metrics.md`
571+
- Hierarchical docs: `hierarchical-rag.md``rag-reranking-techniques.md``rag-evaluation-metrics.md`
572+
573+
---
574+
536575
### Diffusion Models (3 skills)
537576

538577
| Skill | Use When | Lines |
@@ -852,6 +891,9 @@ This index catalogs all atomic skills available in the skills system, organized
852891
**Heroku:** Search `deployment/heroku-*.md`
853892
**Netlify:** Search `deployment/netlify-*.md`
854893
**LLM Fine-tuning:** Search `ml/unsloth-*.md`, `ml/huggingface-*.md`, `ml/llm-*.md`, `ml/lora-*.md`
894+
**LLM Evaluation:** Search `ml/llm-benchmarks-*.md`, `ml/llm-evaluation-*.md`, `ml/llm-as-judge.md`, `ml/rag-evaluation-*.md`, `ml/custom-llm-*.md`
895+
**LLM Routing:** Search `ml/llm-model-routing.md`, `ml/llm-model-selection.md`, `ml/multi-model-*.md`
896+
**Advanced RAG:** Search `ml/hybrid-search-*.md`, `ml/rag-reranking-*.md`, `ml/graph-rag.md`, `ml/hierarchical-rag.md`
855897
**DSPy Framework:** Search `ml/dspy-*.md`
856898
**Diffusion Models:** Search `ml/diffusion-*.md`, `ml/stable-diffusion-*.md`
857899
**Advanced Mathematics:** Search `math/*.md`, `math/graph/*.md` | Numerical: `math/linear-algebra-*.md`, `math/optimization-*.md`, `math/numerical-*.md`, `math/probability-*.md` | Pure math: `math/topology-*.md`, `math/category-theory-*.md`, `math/differential-equations.md`, `math/abstract-algebra.md`, `math/set-theory.md`, `math/number-theory.md` | Graph theory: `math/graph/graph-theory-fundamentals.md`, `math/graph/graph-data-structures.md`, `math/graph/graph-traversal-algorithms.md`, `math/graph/shortest-path-algorithms.md`, `math/graph/minimum-spanning-tree.md`, `math/graph/network-flow-algorithms.md`, `math/graph/advanced-graph-algorithms.md`, `math/graph/graph-applications.md`
@@ -1234,6 +1276,13 @@ This index catalogs all atomic skills available in the skills system, organized
12341276
| Fine-tune LLM | llm-dataset-preparation.md, unsloth-finetuning.md, lora-peft-techniques.md | 1→2→3 |
12351277
| Build DSPy QA system | dspy-setup.md, dspy-signatures.md, dspy-modules.md, dspy-optimizers.md | 1→2→3→4 |
12361278
| Build DSPy RAG pipeline | dspy-setup.md, dspy-rag.md, dspy-optimizers.md, dspy-evaluation.md | 1→2→3→4 |
1279+
| Evaluate LLM with benchmarks | llm-benchmarks-evaluation.md, llm-evaluation-frameworks.md | 1→2 |
1280+
| Setup LLM evaluation pipeline | llm-evaluation-frameworks.md (Arize Phoenix), llm-as-judge.md, rag-evaluation-metrics.md | 1→2→3 |
1281+
| Evaluate RAG system | rag-evaluation-metrics.md (RAGAS), llm-as-judge.md, llm-evaluation-frameworks.md | 1→2→3 |
1282+
| Route between multiple LLMs | llm-model-selection.md, llm-model-routing.md, multi-model-orchestration.md | 1→2→3 |
1283+
| Build hybrid search RAG | hybrid-search-rag.md, rag-reranking-techniques.md, rag-evaluation-metrics.md | 1→2→3 |
1284+
| Build GraphRAG system | graph-rag.md, llm-as-judge.md (quality eval), rag-evaluation-metrics.md | 1→2→3 |
1285+
| Build hierarchical RAG | hierarchical-rag.md, rag-reranking-techniques.md, rag-evaluation-metrics.md | 1→2→3 |
12371286
| Fine-tune diffusion model | diffusion-model-basics.md, diffusion-finetuning.md, stable-diffusion-deployment.md | 1→2→3 |
12381287
| Deploy to Heroku | heroku-deployment.md, heroku-addons.md | 1→2 |
12391288
| Deploy to Netlify | netlify-deployment.md, netlify-functions.md | 1→2 |
@@ -1255,8 +1304,8 @@ This index catalogs all atomic skills available in the skills system, organized
12551304

12561305
## Total Skills Count
12571306

1258-
- **247 atomic skills** across 43 categories
1259-
- **Average 350 lines** per skill
1307+
- **259 atomic skills** across 45 categories
1308+
- **Average 380 lines** per skill
12601309
- **100% focused** - each skill has single clear purpose
12611310
- **Cross-referenced** - related skills linked for discoverability
12621311

@@ -1297,10 +1346,12 @@ This index catalogs all atomic skills available in the skills system, organized
12971346
- Heroku: 3 skills
12981347
- Netlify: 3 skills *(moved from Specialized Domains)*
12991348

1300-
**Machine Learning & AI** (21 skills):
1349+
**Machine Learning & AI** (33 skills):
13011350
- LLM Fine-tuning: 4 skills
13021351
- HuggingFace: 5 skills (Hub, Transformers, Datasets, Spaces, AutoTrain)
13031352
- DSPy Framework: 7 skills
1353+
- LLM Evaluation & Routing: 8 skills (Benchmarks, Frameworks, LLM-as-judge, RAGAS, Custom eval, Routing, Selection, Orchestration)
1354+
- Advanced RAG: 4 skills (Hybrid search, Reranking, GraphRAG, Hierarchical)
13041355
- Diffusion Models: 3 skills
13051356
- Information Retrieval: 5 skills (Search, Vector Search, Ranking, Recommendations, Query Understanding)
13061357

@@ -1337,5 +1388,5 @@ See `MIGRATION_GUIDE.md` for detailed mapping.
13371388
---
13381389

13391390
**Last Updated:** 2025-10-26
1340-
**Total Skills:** 247
1391+
**Total Skills:** 259
13411392
**Format Version:** 1.0 (Atomic)

0 commit comments

Comments
 (0)