Releases: justi/ruby_llm-contract
v0.10.1 - gem packaging hygiene (supersedes 0.10.0)
Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.
Fixed
- Gem no longer ships internal tracker / dev configs. Excluded from
spec.files:TODO.md,.rspec,.rubycritic.yml,.simplecov, and the.revive/directory. Pre-0.10.1 the published gem contained these files; adopters who already extracted 0.10.0 can safely delete them.
Also in this release (from 0.10.0)
This is the first publish to rubygems since 0.8.0. The 0.10.0 changelog entry consolidates work that was tagged but never published as 0.9.0 (multimodal input) and 0.9.1 (internal quality refactor). Highlights:
- Breaking:
validate("")/invariant("")raiseArgumentErrorat definition - Added: multimodal input via
context: { attachment: ... }+attachment_token_estimate(n)macro - Behavioural change:
max_cost/max_inputwith attachment but noattachment_token_estimate→:limit_exceeded(fail-closed) - Anti-facade audit: 89/89 spec files under full per-test walk; +30 strengthened tests
Full details: CHANGELOG.md
Full diff since 0.8.0: v0.8.0...v0.10.1
v0.10.0 - validate/invariant require non-empty descriptions
⚠️ This version was yanked from rubygems.org due to a gem packaging issue (internal files leaked into the gem). Use v0.10.1 instead — it has the same code behavior, just clean packaging.The release notes below are kept for historical reference; the consolidated entry covering all changes since 0.8.0 lives in the v0.10.1 CHANGELOG.
Breaking changes
validate(description, &block)andDefinition#invariant(description, &block)now raiseArgumentErrorwhendescriptionisnilor empty. Pre-0.10.0 the empty descriptor was silently accepted and produced""entries inresult.validation_errors, making debugging impossible. Codex audit found zero production use sites acrosslib/,examples/,README- only the regression-marker test certifying the bug.
Migration
Ensure every validate / invariant call has a non-empty descriptor (this is already how every README example writes them):
# Before (silently accepted, produced "" in validation_errors):
validate("") { |o| o[:score].between?(0, 100) }
# After (required):
validate("score in range 0-100") { |o| o[:score].between?(0, 100) }Changed
run_eval(no args) return shape pinned toHash<String, Report>keyed by eval name. Documents the existing contract used byRubyLLM::Contract::RakeTask#collect_host_reportsand adopters. No runtime change vs 0.8.0 / 0.9.x - only the spec assertion now locks the shape.Parser.parse(text, strategy: :json)first-bracket-wins boundary documented. Extraction commits to the first balanced{or[structure and does NOT retry on later candidates. Empty{}followed by real JSON parses as the empty Hash; non-JSON{braces}before real JSON raisesParseError. No runtime change - this codifies long-standing behavior with explicit boundary tests.
Internal
- Suite-wide anti-facade audit complete: 89/89 spec files under per-test 17-mode walk (Phase A: 26 specs, Phase C: 63 specs via parallel Codex fan-out). Net +30 strengthened tests against mutation-blind assertions, zero public API change beyond the breaking entry above.
Tests
- Suite: 1401 examples / 0 failures / 7 pending (was 1371/0/8 at 0.9.1).
v0.9.1 — internal quality refactor
⚠️ This version was never published to rubygems.org. The code from this tag shipped as part of v0.10.1 (which consolidates 0.9.0 + 0.9.1 + 0.10.0 work into a single published release). Use v0.10.1 — the content below is kept for historical reference.
Internal quality refactor — zero public API change.
Patch-version bump. Adopters pinning ~> 0.9.0 pick up the safer code path automatically on next bundle update.
Fixed
-
Concurrency hazard in
optimize_retry_policy—with_retry_disabledused to mutate the step class's singletonretry_policymethod aroundcompare_models, then restored it in anensure. Two parallel optimizer calls on the same step class would race. Refactored to passretry_policy_override: nilthroughcontext:(the existing well-supported override path). [Codex finding #3] -
CostCalculator.find_modelexposed as public —Step::Base#estimate_costused to bypassprivate_class_methodwithCostCalculator.send(:find_model, name). Visibility-bypass viasendis invisible to grep-for-callers and fragile under refactor. Now public; theestimated_cost_forhelper is gone —estimate_costroutes through the existing publicCostCalculator.calculate(model_name:, usage:). [Codex finding #2] -
stub_stepunified on a single storage path — Block and non-block forms used to write to different state (thread-local hash vs RSpecallow/receive). Both now useRubyLLM::Contract.step_adapter_overrides(thread-local); cleanup between examples is handled by the existingaround(:each)hook inrspec.rb. [Codex finding #8]
Internal
- Dead code removal in
concerns/eval_host.rb— TheObjectSpace.each_object(Class)fallback inregister_subclasseswas unreachable on every supported runtime: the gemspec requires Ruby>= 3.2.0andClass#subclassesships in 3.1+. Dropped. [Codex finding #7]
Test discipline
- Characterization-first — every refactor wrote tests pinning the current behaviour BEFORE touching production code, then replaced with new contract tests after.
- Suite: 1346 examples / 0 failures / 8 pending (+12 net new tests vs 0.9.0).
Refactor backlog (deferred to 0.10.0+)
Documented in TODO.md. Remaining Codex findings:
Runner.new17 kwargs →RunnerConfigfactory (Long Parameter List / Shotgun Surgery)- DSL inheritance walk DRY +
UNSETsentinel (~200 LOC duplication across DSL accessors; one coordinated PR forstep/dsl.rb) RakeTask#define_taskgod method →SuiteGatevalue object extraction
Each requires more invasive surgery on shared surfaces and benefits from dedicated focus rather than tail-end inclusion in a patch.
Full changelog: CHANGELOG.md
Diff: v0.9.0 → v0.9.1
v0.9.0 — multimodal input
⚠️ This version was never published to rubygems.org. The code from this tag shipped as part of v0.10.1 (which consolidates 0.9.0 + 0.9.1 + 0.10.0 work into a single published release). Use v0.10.1 — the content below is kept for historical reference.
Multimodal input — route PDFs/images/audio through your contract.
First adopter-driven feature after the 0.8 narrative repositioning. Attachments now travel via Step.run(input, context: { attachment: ... }); max_cost, validate, retry_policy escalate(...), and trace observability still apply.
Added
-
Step.run(input, context: { attachment: file_or_io_or_url })— adapter forwardschat.ask(content, with: attachment). RubyLLM ≥ 1.15 normalises wire format per provider (Anthropicurl/base64, OpenAIimage_url/file/input_audio, Geminiinline_data). Multi-attachment supported natively (with: [...]orwith: { images: [...], pdfs: [...] }). -
attachment_token_estimate(n)class macro — adopter-declared conservative estimate of attachment input tokens. Applied to BOTH runtimecheck_limitsAND pre-flightestimate_cost— single source of truth, no estimate/runtime drift. Inherits from superclass, supports:defaultreset. -
on_unknown_attachment_size(:refuse | :warn)class macro — mirrorson_unknown_pricingopt-out semantics. Default:refuse. Per-step only — never settable as global default. -
estimate_cost(input:, model:, attachment: nil)— new kwarg, adds attachment tokens toinput_tokens, same fail-closed behaviour as runtime.
Behavioural change — read before upgrading
Contracts with max_cost or max_input set AND now receiving context[:attachment] AND no attachment_token_estimate declared will refuse with :limit_exceeded. The gem cannot bound vision/PDF cost without an adopter-declared estimate. Opt out per-step with on_unknown_attachment_size :warn. Text-only contracts are unaffected.
Docs
- New guide: Multimodal input (PDF / image / audio) — adopter walkthrough,
attachment_token_estimatecalibration table per provider, fail-closed semantics, testing recipe. - New README FAQ entry: "I upgraded to 0.9 and my contract started refusing — why?"
Deferred (not in 0.9.0)
add_historymulti-turn replay of prior attachments (single-turn supported; follow-up on same document deferred).- Streaming + attachment (contract steps remain synchronous).
- Provider-specific attachment size caps (surface only via
attachment_token_estimatecalibration).
Tests
- 22 new specs in
spec/ruby_llm/contract/step/multimodal_input_spec.rbcovering DSL inheritance/reset/validation, adapter pass-through, runtime fail-closed (refuse + warn modes), estimate parity, multi-attachment array/hash routing, and RubyLLM 1.15with: nilno-op contract. - 4 existing adapter specs updated to expect
chat.ask(..., with: nil). - Full suite: 1336 examples / 0 failures.
Full changelog: CHANGELOG.md
Diff: v0.8.0 → v0.9.0
0.8.0 — Contracts + Evals for RubyLLM
Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.
Added
thinking(effort:, budget:)class macro onStep::Base— mirrorsRubyLLM::Agent.thinkingsignature exactly. Stored as{ effort:, budget: }hash; reader returns the hash; supports:defaultreset semantics; superclass inheritance likemodel/temperature. The convenience aliasreasoning_effort(:low)is implemented asthinking(effort: :low)— single normalized state, not separate ivar.- Adapter wiring for
with_thinking— whenthinkingis set on the Step class, OR whenreasoning_effort:is passed through context, OR when an attempt config inretry_policy escalate(...)carriesreasoning_effort:, the RubyLLM adapter resolves the effective{ effort:, budget: }hash and forwards it viachat.with_thinking(**)— provider-agnostic (supports OpenAIreasoning_effortAND Anthropic extended-thinking budget). Precedence: per-attempt / contextreasoning_effortoverrides class-levelthinking[:effort]; budget is taken from class-levelthinking[:budget]. Behavioural change vs 0.7.x:reasoning_effortis now forwarded viawith_thinkinginstead ofwith_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.
Dependencies
ruby_llmconstraint bumped from~> 1.0to~> 1.12—Chat#with_thinkingis the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters onruby_llm < 1.12need to bump RubyLLM before upgrading this gem to 0.8.0.
Changed
- Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is
Step → Runner → Adapters::RubyLLM → RubyLLM.chatdirectly. TokenEstimatordocumented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages fromLimitCheckernow include(heuristic ±30%)suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer;RubyLLM::Tokensis post-hoc only.CostCalculatorrepositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost,token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged:register_model,unregister_model,reset_custom_models!,calculate.output_schemareframed in docs — described as "wrapper aroundRubyLLM::Schema+ client-side validation step", not a standalone feature. The schema language is identical to whatRubyLLM::Agent.schemaaccepts; the difference is what wraps it.- README retry framing —
retry_policy escalate(...)(model escalation on validation failure) is the marketed default.retry_policy attempts: N(same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.
Documentation
- New disambiguation paragraphs in
prompt_ast.md(Step.input_typevsRubyLLM::Agent.inputs;Prompt::Buildermulti-role DSL vs Agent ERB single-string template loader),testing.md(Step.observevsChat#on_end_message/on_tool_call),output_schema.md(relation toAgent.schema), andoptimizing_retry_policy.md(orthogonal model + thinking dimensions). getting_started.mdrefusal message example updated to include the new(heuristic ±30%)suffix.
Issues closed
- #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments.
attempts: Nretry stays in API; not marketed as a default. - #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.
Not in this release (deferred)
output_schemaProc form for runtime-input-aware schemas (parity withAgent.schemaProc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.- H4 (Step composing
RubyLLM::Agentinternally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.
0.7.3 — Adoption-friction release (docs + examples consolidation)
0.7.3 (2026-04-24)
Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
Documentation
- New guide:
docs/guide/why.md— four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves. - New guide:
docs/guide/rails_integration.md— seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs,around_callobservability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring. - README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (
README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.). - TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (
eval_first.md,testing.md,migration.md). - API coverage gaps closed —
estimate_cost/estimate_eval_cost,max_cost on_unknown_pricing: :warn,run_eval(..., concurrency:),around_calltesting patterns now documented ingetting_started.md,eval_first.md,testing.md. - Industry-standard terminology —
temperature-locked→fixed-temperature,variance-induced→sampling variance,severity signals→severity keywords,takeaway drift→tone/takeaways mismatch. docs/architecture.mdrefresh — diagram now reflects the current class layout: addedStep::RetryPolicy,Pipeline::Result,Eval::AggregatedReport,Eval::BaselineDiff,Eval::PromptDiffComparator,Eval::EvalHistory,Eval::RetryOptimizer,OptimizeRakeTask. Replaced the outdatedEval::TraitEvaluatorentry withEval::ExpectationEvaluator.- Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
Examples — consolidated on SummarizeArticle, renumbered 00-06
The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.
| # | File | Answers |
|---|---|---|
| 00 | 00_basics.rb |
How do I start? (seven incremental layers + real-LLM pointer) |
| 01 | 01_fallback_showcase.rb |
Show me the gem in 30 seconds (zero API keys) |
| 02 | 02_real_llm_minimal.rb |
How do I plug in a real LLM? (~30 lines) |
| 03 | 03_summarize_with_keywords.rb |
How does the contract evolve? (growing prompt) |
| 04 | 04_summarize_and_translate.rb |
Pipeline composition + pipeline-level run_eval |
| 05 | 05_eval_dataset.rb |
How do I stop silent prompt regressions? |
| 06 | 06_retry_variants.rb |
attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI) |
Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).
Examples — bug fixes carried along
- Schema pitfall fixed in 5 files —
array :x do; string :y; ...; endsilently producesitems: stringand drops every declaration after the first, matching the documented pitfall inspec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped inobject do...end. examples/05_eval_dataset.rb(pre-renumber:09_eval_dataset.rb)result[:passed]→result.passed?— the previous code called[]on anEval::CaseResultand raisedNoMethodErrorat runtime.
Testing
- New
spec/integration/pipeline_eval_spec.rb— three cases guaranteeing pipeline-levelrun_evalstays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediatevalidaterejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case assertsstep_status == :validation_failedand the validate's label indetails, so a regression that short-circuits on schema instead of validate would fail loudly.
Deleted (private-project cleanup)
examples/01_classify_threads.rb,02_generate_comment.rb,03_target_audience.rb,10_reddit_full_showcase.rb,spec/integration/reddit_pipeline_spec.rb— Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.examples/02_output_schema.rb— fully covered bydocs/guide/output_schema.md; deleting avoids duplication.
0.7.1 — Narrow run_once ArgumentError rescue
Behavioral change (follow-up to v0.7.0)
Closes the known limitation called out in the v0.7.0 CHANGELOG.
Before: Step::Base#run_once wrapped the entire Runner chain in rescue ArgumentError to convert DSL misconfiguration (e.g. prompt has not been set) into :input_error. Side effect: any ArgumentError raised from adapter code during Runner#call — wrong arity, bad config arg, any programmer bug — was silently coerced into :input_error and re-tried as if the user had supplied bad input.
After: the rescue is scoped to the Runner-construction phase only. DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested). ArgumentError raised during Runner#call propagates to the caller.
Input-type validation failures continue to produce :input_error via InputValidator's own scoped rescue (Dry::Types::CoercionError, TypeError, ArgumentError around the type-check boundary) — unchanged.
Why it matters
v0.7.0's narrative was "programmer errors propagate, provider errors become :adapter_error". AdapterCaller already respected that (narrowed to RubyLLM::Error + Faraday::Error). But run_once's broader rescue ArgumentError was a backdoor that let adapter-code ArgumentError bugs slip back into :input_error and become retry targets.
This release closes that backdoor. Programmer bugs raised during an adapter call now surface loudly instead of being disguised as "user gave bad input".
Compatibility
Technically a behavioral change — callers previously relying on adapter-code ArgumentError to produce :input_error results will now see the exception propagate. If your adapter deliberately raises ArgumentError for expected validation flows, wrap that in RubyLLM::Error (becomes :adapter_error, respected by retry) or add explicit handling at the call site.
Test plan
bundle exec rspec— 1341 examples, 0 failures, 8 pending (all pending are API-key-gated live LLM tests).- New regression specs in
retry_integration_spec.rb:propagates ArgumentError from adapter code (programmer bug, not bad input)— adapter raisingArgumentErrornow propagates.still converts DSL misconfiguration ArgumentError to :input_error (prompt missing)—prompt has not been setstill becomes:input_error.
- Existing BUG 48 adversarial spec (step without prompt →
:input_error) continues to pass.
0.7.0 — Remove :adapter_error default retry, narrow AdapterCaller rescue
Breaking changes
Both changes target redundancy between ruby_llm-contract and upstream ruby_llm 1.14.x.
1. :adapter_error removed from DEFAULT_RETRY_ON
New default: [:validation_failed, :parse_error].
ruby_llm's Faraday middleware already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) with backoff. Retrying on :adapter_error against the same model re-ran what transport had already retried — retry × retry with no change in context.
:adapter_error remains available as explicit opt-in. It is meaningful primarily paired with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.
2. AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error
Provider errors (the RubyLLM::Error hierarchy) and transport errors that escape ruby_llm's Faraday retry middleware after exhaustion (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before.
Programmer errors that are neither — NoMethodError, adapter-code bugs — now propagate instead of being silently converted to :adapter_error and retried. Bugs should be fixed, not retried.
Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor; tracked as a follow-up.
Migration
Restore pre-0.7 behavior:
retry_policy do
attempts 3
retry_on :validation_failed, :parse_error, :adapter_error
endPreferred — pair with a model fallback chain:
retry_policy do
escalate "gpt-4.1-nano", "gpt-4.1-mini"
retry_on :validation_failed, :parse_error, :adapter_error
endWhy the narrative matters
Post-0.7, DEFAULT_RETRY_ON = [:validation_failed, :parse_error] reads cleanly as the gem's core value proposition: retry in ruby_llm-contract is against LLM output variance (malformed JSON, business-rule violations), not against transport or infrastructure. Transport concerns live in ruby_llm/Faraday where they belong; programmer bugs propagate for quick detection.
v0.6.4 — production_mode: retry-aware cost
Highlights
production_mode: { fallback: "..." }oncompare_models/optimize_retry_policy— measures retry-aware, end-to-end cost per successful output. Each candidate runs with a runtime-injected[candidate, fallback]retry chain.- New metrics:
escalation_rate,single_shot_cost,effective_cost,single_shot_latency_ms,effective_latency_ms,latency_percentiles— on bothReportandAggregatedReport(averaged acrossruns:). - Extended
ModelComparison#table:Chain / single-shot / escalation / effective cost / latency / score. Edge casecandidate == fallback→ em-dash (not0%), retry injection skipped soeffective == single-shotby construction. context[:retry_policy_override]— new context key for transient per-call retry-policy overrides without mutating the step class.
Scope
- Single-fallback (2-tier) chains only.
- Step-only: raises
ArgumentErrorif used onPipeline::Basesubclasses (pipeline-wide fallback semantics are a separate design question).