Skip to content

Releases: justi/ruby_llm-contract

v0.10.1 - gem packaging hygiene (supersedes 0.10.0)

01 Jun 17:28

Choose a tag to compare

Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.

Fixed

  • Gem no longer ships internal tracker / dev configs. Excluded from spec.files: TODO.md, .rspec, .rubycritic.yml, .simplecov, and the .revive/ directory. Pre-0.10.1 the published gem contained these files; adopters who already extracted 0.10.0 can safely delete them.

Also in this release (from 0.10.0)

This is the first publish to rubygems since 0.8.0. The 0.10.0 changelog entry consolidates work that was tagged but never published as 0.9.0 (multimodal input) and 0.9.1 (internal quality refactor). Highlights:

  • Breaking: validate("") / invariant("") raise ArgumentError at definition
  • Added: multimodal input via context: { attachment: ... } + attachment_token_estimate(n) macro
  • Behavioural change: max_cost / max_input with attachment but no attachment_token_estimate:limit_exceeded (fail-closed)
  • Anti-facade audit: 89/89 spec files under full per-test walk; +30 strengthened tests

Full details: CHANGELOG.md

Full diff since 0.8.0: v0.8.0...v0.10.1

v0.10.0 - validate/invariant require non-empty descriptions

01 Jun 15:21

Choose a tag to compare

⚠️ This version was yanked from rubygems.org due to a gem packaging issue (internal files leaked into the gem). Use v0.10.1 instead — it has the same code behavior, just clean packaging.

The release notes below are kept for historical reference; the consolidated entry covering all changes since 0.8.0 lives in the v0.10.1 CHANGELOG.


Breaking changes

  • validate(description, &block) and Definition#invariant(description, &block) now raise ArgumentError when description is nil or empty. Pre-0.10.0 the empty descriptor was silently accepted and produced "" entries in result.validation_errors, making debugging impossible. Codex audit found zero production use sites across lib/, examples/, README - only the regression-marker test certifying the bug.

Migration

Ensure every validate / invariant call has a non-empty descriptor (this is already how every README example writes them):

# Before (silently accepted, produced "" in validation_errors):
validate("") { |o| o[:score].between?(0, 100) }

# After (required):
validate("score in range 0-100") { |o| o[:score].between?(0, 100) }

Changed

  • run_eval (no args) return shape pinned to Hash<String, Report> keyed by eval name. Documents the existing contract used by RubyLLM::Contract::RakeTask#collect_host_reports and adopters. No runtime change vs 0.8.0 / 0.9.x - only the spec assertion now locks the shape.
  • Parser.parse(text, strategy: :json) first-bracket-wins boundary documented. Extraction commits to the first balanced { or [ structure and does NOT retry on later candidates. Empty {} followed by real JSON parses as the empty Hash; non-JSON {braces} before real JSON raises ParseError. No runtime change - this codifies long-standing behavior with explicit boundary tests.

Internal

  • Suite-wide anti-facade audit complete: 89/89 spec files under per-test 17-mode walk (Phase A: 26 specs, Phase C: 63 specs via parallel Codex fan-out). Net +30 strengthened tests against mutation-blind assertions, zero public API change beyond the breaking entry above.

Tests

  • Suite: 1401 examples / 0 failures / 7 pending (was 1371/0/8 at 0.9.1).

v0.9.1 — internal quality refactor

31 May 17:46

Choose a tag to compare

Pre-release

⚠️ This version was never published to rubygems.org. The code from this tag shipped as part of v0.10.1 (which consolidates 0.9.0 + 0.9.1 + 0.10.0 work into a single published release). Use v0.10.1 — the content below is kept for historical reference.


Internal quality refactor — zero public API change.

Patch-version bump. Adopters pinning ~> 0.9.0 pick up the safer code path automatically on next bundle update.

Fixed

  • Concurrency hazard in optimize_retry_policywith_retry_disabled used to mutate the step class's singleton retry_policy method around compare_models, then restored it in an ensure. Two parallel optimizer calls on the same step class would race. Refactored to pass retry_policy_override: nil through context: (the existing well-supported override path). [Codex finding #3]

  • CostCalculator.find_model exposed as publicStep::Base#estimate_cost used to bypass private_class_method with CostCalculator.send(:find_model, name). Visibility-bypass via send is invisible to grep-for-callers and fragile under refactor. Now public; the estimated_cost_for helper is gone — estimate_cost routes through the existing public CostCalculator.calculate(model_name:, usage:). [Codex finding #2]

  • stub_step unified on a single storage path — Block and non-block forms used to write to different state (thread-local hash vs RSpec allow/receive). Both now use RubyLLM::Contract.step_adapter_overrides (thread-local); cleanup between examples is handled by the existing around(:each) hook in rspec.rb. [Codex finding #8]

Internal

  • Dead code removal in concerns/eval_host.rb — The ObjectSpace.each_object(Class) fallback in register_subclasses was unreachable on every supported runtime: the gemspec requires Ruby >= 3.2.0 and Class#subclasses ships in 3.1+. Dropped. [Codex finding #7]

Test discipline

  • Characterization-first — every refactor wrote tests pinning the current behaviour BEFORE touching production code, then replaced with new contract tests after.
  • Suite: 1346 examples / 0 failures / 8 pending (+12 net new tests vs 0.9.0).

Refactor backlog (deferred to 0.10.0+)

Documented in TODO.md. Remaining Codex findings:

  • Runner.new 17 kwargs → RunnerConfig factory (Long Parameter List / Shotgun Surgery)
  • DSL inheritance walk DRY + UNSET sentinel (~200 LOC duplication across DSL accessors; one coordinated PR for step/dsl.rb)
  • RakeTask#define_task god method → SuiteGate value object extraction

Each requires more invasive surgery on shared surfaces and benefits from dedicated focus rather than tail-end inclusion in a patch.


Full changelog: CHANGELOG.md
Diff: v0.9.0 → v0.9.1

v0.9.0 — multimodal input

31 May 17:47

Choose a tag to compare

Pre-release

⚠️ This version was never published to rubygems.org. The code from this tag shipped as part of v0.10.1 (which consolidates 0.9.0 + 0.9.1 + 0.10.0 work into a single published release). Use v0.10.1 — the content below is kept for historical reference.


Multimodal input — route PDFs/images/audio through your contract.

First adopter-driven feature after the 0.8 narrative repositioning. Attachments now travel via Step.run(input, context: { attachment: ... }); max_cost, validate, retry_policy escalate(...), and trace observability still apply.

Added

  • Step.run(input, context: { attachment: file_or_io_or_url }) — adapter forwards chat.ask(content, with: attachment). RubyLLM ≥ 1.15 normalises wire format per provider (Anthropic url/base64, OpenAI image_url/file/input_audio, Gemini inline_data). Multi-attachment supported natively (with: [...] or with: { images: [...], pdfs: [...] }).

  • attachment_token_estimate(n) class macro — adopter-declared conservative estimate of attachment input tokens. Applied to BOTH runtime check_limits AND pre-flight estimate_cost — single source of truth, no estimate/runtime drift. Inherits from superclass, supports :default reset.

  • on_unknown_attachment_size(:refuse | :warn) class macro — mirrors on_unknown_pricing opt-out semantics. Default :refuse. Per-step only — never settable as global default.

  • estimate_cost(input:, model:, attachment: nil) — new kwarg, adds attachment tokens to input_tokens, same fail-closed behaviour as runtime.

Behavioural change — read before upgrading

Contracts with max_cost or max_input set AND now receiving context[:attachment] AND no attachment_token_estimate declared will refuse with :limit_exceeded. The gem cannot bound vision/PDF cost without an adopter-declared estimate. Opt out per-step with on_unknown_attachment_size :warn. Text-only contracts are unaffected.

Docs

  • New guide: Multimodal input (PDF / image / audio) — adopter walkthrough, attachment_token_estimate calibration table per provider, fail-closed semantics, testing recipe.
  • New README FAQ entry: "I upgraded to 0.9 and my contract started refusing — why?"

Deferred (not in 0.9.0)

  • add_history multi-turn replay of prior attachments (single-turn supported; follow-up on same document deferred).
  • Streaming + attachment (contract steps remain synchronous).
  • Provider-specific attachment size caps (surface only via attachment_token_estimate calibration).

Tests

  • 22 new specs in spec/ruby_llm/contract/step/multimodal_input_spec.rb covering DSL inheritance/reset/validation, adapter pass-through, runtime fail-closed (refuse + warn modes), estimate parity, multi-attachment array/hash routing, and RubyLLM 1.15 with: nil no-op contract.
  • 4 existing adapter specs updated to expect chat.ask(..., with: nil).
  • Full suite: 1336 examples / 0 failures.

Full changelog: CHANGELOG.md
Diff: v0.8.0 → v0.9.0

0.8.0 — Contracts + Evals for RubyLLM

26 Apr 16:47
3ada33a

Choose a tag to compare

Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.

Added

  • thinking(effort:, budget:) class macro on Step::Base — mirrors RubyLLM::Agent.thinking signature exactly. Stored as { effort:, budget: } hash; reader returns the hash; supports :default reset semantics; superclass inheritance like model/temperature. The convenience alias reasoning_effort(:low) is implemented as thinking(effort: :low) — single normalized state, not separate ivar.
  • Adapter wiring for with_thinking — when thinking is set on the Step class, OR when reasoning_effort: is passed through context, OR when an attempt config in retry_policy escalate(...) carries reasoning_effort:, the RubyLLM adapter resolves the effective { effort:, budget: } hash and forwards it via chat.with_thinking(**) — provider-agnostic (supports OpenAI reasoning_effort AND Anthropic extended-thinking budget). Precedence: per-attempt / context reasoning_effort overrides class-level thinking[:effort]; budget is taken from class-level thinking[:budget]. Behavioural change vs 0.7.x: reasoning_effort is now forwarded via with_thinking instead of with_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.

Dependencies

  • ruby_llm constraint bumped from ~> 1.0 to ~> 1.12Chat#with_thinking is the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters on ruby_llm < 1.12 need to bump RubyLLM before upgrading this gem to 0.8.0.

Changed

  • Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is Step → Runner → Adapters::RubyLLM → RubyLLM.chat directly.
  • TokenEstimator documented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages from LimitChecker now include (heuristic ±30%) suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer; RubyLLM::Tokens is post-hoc only.
  • CostCalculator repositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost, token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged: register_model, unregister_model, reset_custom_models!, calculate.
  • output_schema reframed in docs — described as "wrapper around RubyLLM::Schema + client-side validation step", not a standalone feature. The schema language is identical to what RubyLLM::Agent.schema accepts; the difference is what wraps it.
  • README retry framingretry_policy escalate(...) (model escalation on validation failure) is the marketed default. retry_policy attempts: N (same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.

Documentation

  • New disambiguation paragraphs in prompt_ast.md (Step.input_type vs RubyLLM::Agent.inputs; Prompt::Builder multi-role DSL vs Agent ERB single-string template loader), testing.md (Step.observe vs Chat#on_end_message / on_tool_call), output_schema.md (relation to Agent.schema), and optimizing_retry_policy.md (orthogonal model + thinking dimensions).
  • getting_started.md refusal message example updated to include the new (heuristic ±30%) suffix.

Issues closed

  • #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments. attempts: N retry stays in API; not marketed as a default.
  • #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.

Not in this release (deferred)

  • output_schema Proc form for runtime-input-aware schemas (parity with Agent.schema Proc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.
  • H4 (Step composing RubyLLM::Agent internally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.

0.7.3 — Adoption-friction release (docs + examples consolidation)

24 Apr 04:59
c101cfa

Choose a tag to compare

0.7.3 (2026-04-24)

Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.

Documentation

  • New guide: docs/guide/why.md — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
  • New guide: docs/guide/rails_integration.md — seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs, around_call observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
  • README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
  • TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (eval_first.md, testing.md, migration.md).
  • API coverage gaps closedestimate_cost / estimate_eval_cost, max_cost on_unknown_pricing: :warn, run_eval(..., concurrency:), around_call testing patterns now documented in getting_started.md, eval_first.md, testing.md.
  • Industry-standard terminologytemperature-lockedfixed-temperature, variance-inducedsampling variance, severity signalsseverity keywords, takeaway drifttone/takeaways mismatch.
  • docs/architecture.md refresh — diagram now reflects the current class layout: added Step::RetryPolicy, Pipeline::Result, Eval::AggregatedReport, Eval::BaselineDiff, Eval::PromptDiffComparator, Eval::EvalHistory, Eval::RetryOptimizer, OptimizeRakeTask. Replaced the outdated Eval::TraitEvaluator entry with Eval::ExpectationEvaluator.
  • Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.

Examples — consolidated on SummarizeArticle, renumbered 00-06

The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.

# File Answers
00 00_basics.rb How do I start? (seven incremental layers + real-LLM pointer)
01 01_fallback_showcase.rb Show me the gem in 30 seconds (zero API keys)
02 02_real_llm_minimal.rb How do I plug in a real LLM? (~30 lines)
03 03_summarize_with_keywords.rb How does the contract evolve? (growing prompt)
04 04_summarize_and_translate.rb Pipeline composition + pipeline-level run_eval
05 05_eval_dataset.rb How do I stop silent prompt regressions?
06 06_retry_variants.rb attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI)

Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).

Examples — bug fixes carried along

  • Schema pitfall fixed in 5 filesarray :x do; string :y; ...; end silently produces items: string and drops every declaration after the first, matching the documented pitfall in spec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped in object do...end.
  • examples/05_eval_dataset.rb (pre-renumber: 09_eval_dataset.rb) result[:passed]result.passed? — the previous code called [] on an Eval::CaseResult and raised NoMethodError at runtime.

Testing

  • New spec/integration/pipeline_eval_spec.rb — three cases guaranteeing pipeline-level run_eval stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate validate rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts step_status == :validation_failed and the validate's label in details, so a regression that short-circuits on schema instead of validate would fail loudly.

Deleted (private-project cleanup)

  • examples/01_classify_threads.rb, 02_generate_comment.rb, 03_target_audience.rb, 10_reddit_full_showcase.rb, spec/integration/reddit_pipeline_spec.rb — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
  • examples/02_output_schema.rb — fully covered by docs/guide/output_schema.md; deleting avoids duplication.

0.7.1 — Narrow run_once ArgumentError rescue

22 Apr 09:13
3fe3c86

Choose a tag to compare

Behavioral change (follow-up to v0.7.0)

Closes the known limitation called out in the v0.7.0 CHANGELOG.

Before: Step::Base#run_once wrapped the entire Runner chain in rescue ArgumentError to convert DSL misconfiguration (e.g. prompt has not been set) into :input_error. Side effect: any ArgumentError raised from adapter code during Runner#call — wrong arity, bad config arg, any programmer bug — was silently coerced into :input_error and re-tried as if the user had supplied bad input.

After: the rescue is scoped to the Runner-construction phase only. DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested). ArgumentError raised during Runner#call propagates to the caller.

Input-type validation failures continue to produce :input_error via InputValidator's own scoped rescue (Dry::Types::CoercionError, TypeError, ArgumentError around the type-check boundary) — unchanged.

Why it matters

v0.7.0's narrative was "programmer errors propagate, provider errors become :adapter_error". AdapterCaller already respected that (narrowed to RubyLLM::Error + Faraday::Error). But run_once's broader rescue ArgumentError was a backdoor that let adapter-code ArgumentError bugs slip back into :input_error and become retry targets.

This release closes that backdoor. Programmer bugs raised during an adapter call now surface loudly instead of being disguised as "user gave bad input".

Compatibility

Technically a behavioral change — callers previously relying on adapter-code ArgumentError to produce :input_error results will now see the exception propagate. If your adapter deliberately raises ArgumentError for expected validation flows, wrap that in RubyLLM::Error (becomes :adapter_error, respected by retry) or add explicit handling at the call site.

Test plan

  • bundle exec rspec — 1341 examples, 0 failures, 8 pending (all pending are API-key-gated live LLM tests).
  • New regression specs in retry_integration_spec.rb:
    • propagates ArgumentError from adapter code (programmer bug, not bad input) — adapter raising ArgumentError now propagates.
    • still converts DSL misconfiguration ArgumentError to :input_error (prompt missing)prompt has not been set still becomes :input_error.
  • Existing BUG 48 adversarial spec (step without prompt → :input_error) continues to pass.

0.7.0 — Remove :adapter_error default retry, narrow AdapterCaller rescue

21 Apr 13:53
0d6ed4b

Choose a tag to compare

Breaking changes

Both changes target redundancy between ruby_llm-contract and upstream ruby_llm 1.14.x.

1. :adapter_error removed from DEFAULT_RETRY_ON

New default: [:validation_failed, :parse_error].

ruby_llm's Faraday middleware already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) with backoff. Retrying on :adapter_error against the same model re-ran what transport had already retried — retry × retry with no change in context.

:adapter_error remains available as explicit opt-in. It is meaningful primarily paired with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.

2. AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error

Provider errors (the RubyLLM::Error hierarchy) and transport errors that escape ruby_llm's Faraday retry middleware after exhaustion (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before.

Programmer errors that are neither — NoMethodError, adapter-code bugs — now propagate instead of being silently converted to :adapter_error and retried. Bugs should be fixed, not retried.

Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor; tracked as a follow-up.

Migration

Restore pre-0.7 behavior:

retry_policy do
  attempts 3
  retry_on :validation_failed, :parse_error, :adapter_error
end

Preferred — pair with a model fallback chain:

retry_policy do
  escalate "gpt-4.1-nano", "gpt-4.1-mini"
  retry_on :validation_failed, :parse_error, :adapter_error
end

Why the narrative matters

Post-0.7, DEFAULT_RETRY_ON = [:validation_failed, :parse_error] reads cleanly as the gem's core value proposition: retry in ruby_llm-contract is against LLM output variance (malformed JSON, business-rule violations), not against transport or infrastructure. Transport concerns live in ruby_llm/Faraday where they belong; programmer bugs propagate for quick detection.

v0.6.4 — production_mode: retry-aware cost

19 Apr 19:11
34a4697

Choose a tag to compare

Highlights

  • production_mode: { fallback: "..." } on compare_models / optimize_retry_policy — measures retry-aware, end-to-end cost per successful output. Each candidate runs with a runtime-injected [candidate, fallback] retry chain.
  • New metrics: escalation_rate, single_shot_cost, effective_cost, single_shot_latency_ms, effective_latency_ms, latency_percentiles — on both Report and AggregatedReport (averaged across runs:).
  • Extended ModelComparison#table: Chain / single-shot / escalation / effective cost / latency / score. Edge case candidate == fallback → em-dash (not 0%), retry injection skipped so effective == single-shot by construction.
  • context[:retry_policy_override] — new context key for transient per-call retry-policy overrides without mutating the step class.

Scope

  • Single-fallback (2-tier) chains only.
  • Step-only: raises ArgumentError if used on Pipeline::Base subclasses (pipeline-wide fallback semantics are a separate design question).

Docs