Skip to content

docs(retrieval): add fine-tuning guide#2306

Draft
oliverholworthy wants to merge 25 commits into
mainfrom
oholworthy/docs/retrieval-finetuning-guide
Draft

docs(retrieval): add fine-tuning guide#2306
oliverholworthy wants to merge 25 commits into
mainfrom
oholworthy/docs/retrieval-finetuning-guide

Conversation

@oliverholworthy
Copy link
Copy Markdown
Contributor

What does this PR do ?

Adds a retrieval fine-tuning guide and supporting retrieval utilities/tests for bi-encoder and cross-encoder workflows, including custom data formats, hard-negative mining, export/reload handoff, cache validation, and operational safety checks.

Changelog

  • Add a retrieval fine-tuning guide covering bi-encoder and cross-encoder configs, custom data, validation, mining, retraining, export, and troubleshooting.
  • Expand retrieval dataset docs and examples for corpus-backed JSON, inline JSONL, Hugging Face sources, and mined outputs.
  • Harden hard-negative mining cache validation, relative corpus path handling, atomic output writes, distributed cache reuse, and saved encoder loading.
  • Add retrieval data utility coverage for materialization, mined-negative audits, unrolled positives, encoder metadata export, and mining cache safety.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Validated locally:

  • uv run ruff format ...
  • uv run ruff check --fix ...
  • uv run pytest tests/unit_tests/recipes/test_mine_hard_negatives.py tests/unit_tests/datasets/llm/test_materialize_hf_retrieval_subset.py -q
  • uv run pytest tests/unit_tests/_transformers/test_retrieval.py tests/unit_tests/recipes/test_mine_hard_negatives.py tests/unit_tests/datasets/llm/test_audit_mined_negatives.py tests/unit_tests/datasets/llm/test_materialize_hf_retrieval_subset.py tests/unit_tests/models/bi_encoder/test_llama_bidirectional_model.py -q
  • uv run --group docs sphinx-build -b dummy docs /tmp/automodel-docs-review-final-final -W --keep-going
  • git diff --check

Additional Information

  • Opened as draft while reviewers take a look.
  • Final bounded persona review found no confirmed P0/P1 blockers.
  • CI may require /ok to test <commit-sha> depending on branch trust/signature policy.

rnyak and others added 25 commits May 22, 2026 22:07
Signed-off-by: Ronay Ak <ronaya@nvidia.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants