Skip to content

Commit 411cda5

Browse files
committed
extension/llm/runner: Engine/Session C++ core + token-step primitives
Add the model-agnostic LLMEngine/LLMSession interfaces (llm_session.h) with SamplingConfig, DecodeResult and LLMServingCapacity; the TextLLMRunner token-step primitives the session layer is built on (seek, prefill_tokens, position, decode_one); and TextLLMEngine/TextLLMSession over a single loaded Program. decode_one() shares generate()'s logit processors via TextTokenGenerator::apply_logit_processors so the two decode paths cannot diverge. serving_capacity() reports a conservative single physical session (physical weight sharing is backend-dependent). Also add utf8_complete_prefix_len and stop_safe_prefix_len (util.h): byte-level BPE tokenizers can emit a token that is only part of a multi-byte UTF-8 character, so a streaming consumer must forward only the complete-character prefix of accumulated pieces and hold the trailing bytes until the rest arrives; stop_safe_prefix_len additionally holds back the longest possible partial-stop tail so a stop string straddling pieces is still caught. The C++ workers built on this core use both to stream UTF-8-safe text with stop sequences. Covered by gtests in test_text_llm_runner.cpp and test_util.cpp. First of six stacked commits: C++ core -> server foundations -> worker-based HTTP server -> pi docs -> Qwen worker -> Qwen CUDA V2 (per-session state). Part of #20001 ghstack-source-id: 02463c3 ghstack-comment-id: 4617262593 Pull-Request: #19991
1 parent eeb0646 commit 411cda5

14 files changed

Lines changed: 1177 additions & 45 deletions

extension/llm/runner/llm_runner_helper.cpp

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include <executorch/extension/llm/runner/multimodal_runner.h>
1616
#include <executorch/extension/llm/runner/stats.h>
1717
#include <executorch/extension/llm/runner/text_llm_runner.h>
18+
#include <executorch/extension/llm/runner/text_llm_session.h>
1819
#include <executorch/extension/llm/runner/text_prefiller.h>
1920
#include <executorch/extension/llm/runner/text_token_generator.h>
2021
#include <executorch/extension/memory_allocator/cpu_caching_malloc_allocator.h>
@@ -29,8 +30,18 @@
2930
namespace executorch::extension::llm {
3031

3132
using ::executorch::extension::Module;
33+
using ::executorch::extension::Program;
3234
using ::executorch::runtime::Error;
3335

36+
// Assembles the per-Module components (decoder/prefiller/token generator/io
37+
// manager/stats) into a TextLLMRunner. Shared by the path-based and the
38+
// shared-Program (TextLLMEngine session) construction paths.
39+
static std::unique_ptr<TextLLMRunner> assemble_text_llm_runner(
40+
std::unique_ptr<Module> module,
41+
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,
42+
float temperature,
43+
const std::string& method_name);
44+
3445
std::unique_ptr<tokenizers::Tokenizer> load_tokenizer(
3546
const std::string& tokenizer_path,
3647
std::unique_ptr<std::vector<std::string>> special_tokens,
@@ -251,6 +262,15 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
251262
max_cached_memory_size_bytes_));
252263
}
253264

265+
return assemble_text_llm_runner(
266+
std::move(module), std::move(tokenizer), temperature, method_name);
267+
}
268+
269+
static std::unique_ptr<TextLLMRunner> assemble_text_llm_runner(
270+
std::unique_ptr<Module> module,
271+
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,
272+
float temperature,
273+
const std::string& method_name) {
254274
// Get metadata from Module
255275
ET_LOG(Info, "Reading metadata from model");
256276
auto metadata_result = llm::get_llm_metadata(tokenizer.get(), module.get());
@@ -305,6 +325,198 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
305325
temperature);
306326
}
307327

328+
// Builds a TextLLMRunner over an already-loaded Program: the runner's Module
329+
// reuses `program` while owning its own method state and KV cache. File-local —
330+
// the per-session construction path for TextLLMEngine (which keeps the backing
331+
// DataLoader alive for the runners' lifetime). External callers go through
332+
// LLMEngine -> LLMSession, not a raw shared-Program runner.
333+
static std::unique_ptr<TextLLMRunner> create_text_llm_runner_from_program(
334+
std::shared_ptr<Program> program,
335+
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,
336+
float temperature,
337+
const std::string& method_name) {
338+
if (!tokenizer || !tokenizer->is_loaded()) {
339+
ET_LOG(Error, "Tokenizer is null or not loaded");
340+
return nullptr;
341+
}
342+
if (!program) {
343+
ET_LOG(Error, "Program is null");
344+
return nullptr;
345+
}
346+
// A Module over the already-loaded Program: it reuses that Program rather
347+
// than re-loading it, and its loaded method allocates its own planned (KV)
348+
// memory. Whether packed weights are physically shared vs. re-materialized
349+
// per method instance is backend-dependent (serving_capacity() is the
350+
// authority).
351+
constexpr uint32_t kMaxCachedMemoryBytes = 1024 * 1024 * 10; // 10MB
352+
auto module = std::make_unique<Module>(
353+
std::move(program),
354+
nullptr, // memory allocator
355+
std::make_unique<executorch::extension::CPUCachingAllocator>(
356+
kMaxCachedMemoryBytes));
357+
return assemble_text_llm_runner(
358+
std::move(module), std::move(tokenizer), temperature, method_name);
359+
}
360+
361+
namespace detail {
362+
// The TextLLM adapter: implements the model-agnostic LLMSession over a
363+
// TextLLMRunner. TextLLMRunner's token-step methods are private; this adapter
364+
// is their only (friended) caller, so the engine and server depend solely on
365+
// LLMSession.
366+
TextLLMSession::TextLLMSession(std::unique_ptr<TextLLMRunner> runner)
367+
: runner_(std::move(runner)) {}
368+
369+
Error TextLLMSession::prefill_tokens(
370+
std::vector<uint64_t> tokens,
371+
const SamplingConfig* initial_sampling) {
372+
// The model samples the FIRST generated token during prefill, so apply the
373+
// request's sampling here (not a stale default). Only temperature is
374+
// plumbed; reject non-default top_p/top_k/seed for parity with decode_one().
375+
float temperature = -1.0f;
376+
if (initial_sampling != nullptr) {
377+
if (initial_sampling->top_p != 1.0f || initial_sampling->top_k != 0 ||
378+
initial_sampling->seed != 0) {
379+
ET_LOG(
380+
Error,
381+
"TextLLMSession: only temperature is supported; top_p/top_k/seed "
382+
"are not yet implemented");
383+
return ::executorch::runtime::Error::NotSupported;
384+
}
385+
temperature = initial_sampling->temperature;
386+
}
387+
return runner_->prefill_tokens(std::move(tokens), temperature).error();
388+
}
389+
390+
::executorch::runtime::Result<DecodeResult> TextLLMSession::decode_one(
391+
const SamplingConfig& sampling) {
392+
// Only temperature is plumbed today; top_p/top_k/seed need a per-session
393+
// sampler (a follow-up). Reject non-default values rather than silently
394+
// ignoring them, so callers can't assume constraints are applied.
395+
if (sampling.top_p != 1.0f || sampling.top_k != 0 || sampling.seed != 0) {
396+
ET_LOG(
397+
Error,
398+
"TextLLMSession: only temperature is supported; top_p/top_k/seed are "
399+
"not yet implemented");
400+
return ::executorch::runtime::Error::NotSupported;
401+
}
402+
return runner_->decode_one(sampling.temperature);
403+
}
404+
405+
Error TextLLMSession::seek(int64_t pos) {
406+
return runner_->seek(pos);
407+
}
408+
409+
int64_t TextLLMSession::position() const {
410+
return runner_->position();
411+
}
412+
413+
Error TextLLMSession::reset() {
414+
runner_->reset();
415+
return Error::Ok;
416+
}
417+
418+
void TextLLMSession::stop() {
419+
runner_->stop();
420+
}
421+
422+
std::unique_ptr<LLMSession> make_text_llm_session(
423+
std::unique_ptr<TextLLMRunner> runner) {
424+
return std::make_unique<TextLLMSession>(std::move(runner));
425+
}
426+
} // namespace detail
427+
428+
TextLLMEngine::TextLLMEngine(
429+
std::unique_ptr<Module> loader_module,
430+
std::shared_ptr<Program> program,
431+
std::string tokenizer_path,
432+
float temperature,
433+
std::string method_name,
434+
std::unordered_map<std::string, int64_t> metadata)
435+
: loader_module_(std::move(loader_module)),
436+
program_(std::move(program)),
437+
tokenizer_path_(std::move(tokenizer_path)),
438+
temperature_(temperature),
439+
method_name_(std::move(method_name)),
440+
metadata_(std::move(metadata)) {}
441+
442+
std::unique_ptr<TextLLMEngine> TextLLMEngine::create(
443+
const std::string& model_path,
444+
const std::string& tokenizer_path,
445+
std::optional<const std::string> data_path,
446+
float temperature,
447+
const std::string& method_name,
448+
Module::LoadMode load_mode) {
449+
// External .ptd weights are not yet supported for shared sessions: each
450+
// session Module built from the shared Program would also need the
451+
// data_map_loader threaded into its load_method() to resolve external
452+
// weights (see Module::load_method merged_data_map_). Fail loudly rather than
453+
// silently produce sessions that error on first generate.
454+
if (data_path.has_value()) {
455+
ET_LOG(
456+
Error,
457+
"TextLLMEngine: external data_path (.ptd) is not yet supported for "
458+
"shared sessions; use a self-contained .pte for now.");
459+
return nullptr;
460+
}
461+
// Load the program ONCE; sessions reuse it (loaded a single time, per-session
462+
// KV). Physical weight sharing across sessions is backend-dependent — see
463+
// serving_capacity().
464+
auto loader_module = std::make_unique<Module>(model_path, load_mode);
465+
if (loader_module->load() != Error::Ok) {
466+
ET_LOG(
467+
Error,
468+
"TextLLMEngine: failed to load program from %s",
469+
model_path.c_str());
470+
return nullptr;
471+
}
472+
auto program = loader_module->program();
473+
if (!program) {
474+
ET_LOG(Error, "TextLLMEngine: program is null after load");
475+
return nullptr;
476+
}
477+
// Read model-level metadata once (shared by all sessions).
478+
auto meta_tokenizer = load_tokenizer(tokenizer_path);
479+
if (!meta_tokenizer) {
480+
ET_LOG(
481+
Error,
482+
"TextLLMEngine: failed to load tokenizer from %s",
483+
tokenizer_path.c_str());
484+
return nullptr;
485+
}
486+
auto metadata_result =
487+
get_llm_metadata(meta_tokenizer.get(), loader_module.get());
488+
if (metadata_result.error() != Error::Ok) {
489+
ET_LOG(Error, "TextLLMEngine: failed to read metadata");
490+
return nullptr;
491+
}
492+
return std::unique_ptr<TextLLMEngine>(new TextLLMEngine(
493+
std::move(loader_module),
494+
std::move(program),
495+
tokenizer_path,
496+
temperature,
497+
method_name,
498+
metadata_result.get()));
499+
}
500+
501+
::executorch::runtime::Result<std::unique_ptr<LLMSession>>
502+
TextLLMEngine::create_session() {
503+
auto tokenizer = load_tokenizer(tokenizer_path_);
504+
if (!tokenizer) {
505+
ET_LOG(
506+
Error,
507+
"TextLLMEngine: failed to load tokenizer from %s",
508+
tokenizer_path_.c_str());
509+
return Error::InvalidState;
510+
}
511+
auto runner = create_text_llm_runner_from_program(
512+
program_, std::move(tokenizer), temperature_, method_name_);
513+
if (!runner) {
514+
ET_LOG(Error, "TextLLMEngine: failed to build session runner");
515+
return Error::InvalidState;
516+
}
517+
return detail::make_text_llm_session(std::move(runner));
518+
}
519+
308520
std::unique_ptr<MultimodalRunner> create_multimodal_runner(
309521
const std::string& model_path,
310522
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,

extension/llm/runner/llm_runner_helper.h

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
#include <vector>
1919

2020
#include <executorch/extension/llm/runner/constants.h>
21+
#include <executorch/extension/llm/runner/llm_session.h>
2122
#include <executorch/extension/module/module.h>
2223
#include <executorch/runtime/core/result.h>
2324
#include <executorch/runtime/platform/compiler.h>
@@ -141,6 +142,76 @@ ET_EXPERIMENTAL std::unique_ptr<TextLLMRunner> create_text_llm_runner(
141142
const std::string& method_name = "forward",
142143
Module::LoadMode load_mode = Module::LoadMode::MmapUseMlockIgnoreErrors);
143144

145+
/**
146+
* @brief Engine for multi-session text generation over one loaded Program.
147+
*
148+
* Loads the model's Program (weights/constants) once; create_session() builds a
149+
* TextLLMRunner that reuses that Program but owns its own method/KV state. This
150+
* is the correctness-first foundation for serving multiple conversations.
151+
* Backend execution should be serialized by the caller until per-backend thread
152+
* safety is proven (Module::execute is not assumed thread-safe). Whether extra
153+
* sessions avoid duplicating packed weights is backend-dependent and reported
154+
* by serving_capacity() (conservatively one).
155+
*/
156+
class ET_EXPERIMENTAL TextLLMEngine : public LLMEngine {
157+
public:
158+
static std::unique_ptr<TextLLMEngine> create(
159+
const std::string& model_path,
160+
const std::string& tokenizer_path,
161+
std::optional<const std::string> data_path = std::nullopt,
162+
float temperature = -1.0f,
163+
const std::string& method_name = "forward",
164+
Module::LoadMode load_mode = Module::LoadMode::MmapUseMlockIgnoreErrors);
165+
166+
// Returns a TextLLMSession (LLMSession) that reuses this engine's loaded
167+
// Program (physical weight sharing is backend-dependent; see
168+
// serving_capacity).
169+
::executorch::runtime::Result<std::unique_ptr<LLMSession>> create_session()
170+
override;
171+
// Conservative: a single physical session (no proven cross-session weight
172+
// sharing). Raise on a backend proven to share packed weights.
173+
LLMServingCapacity serving_capacity() const override {
174+
return LLMServingCapacity{};
175+
}
176+
const std::unordered_map<std::string, int64_t>& metadata() const override {
177+
return metadata_;
178+
}
179+
180+
TextLLMEngine(const TextLLMEngine&) = delete;
181+
TextLLMEngine& operator=(const TextLLMEngine&) = delete;
182+
183+
private:
184+
TextLLMEngine(
185+
std::unique_ptr<Module> loader_module,
186+
std::shared_ptr<Program> program,
187+
std::string tokenizer_path,
188+
float temperature,
189+
std::string method_name,
190+
std::unordered_map<std::string, int64_t> metadata);
191+
192+
// Keeps the shared Program's DataLoader alive for the lifetime of sessions.
193+
std::unique_ptr<Module> loader_module_;
194+
std::shared_ptr<Program> program_;
195+
std::string tokenizer_path_;
196+
float temperature_;
197+
std::string method_name_;
198+
std::unordered_map<std::string, int64_t> metadata_;
199+
};
200+
201+
namespace detail {
202+
// Implementation detail (not a public API): wraps a TextLLMRunner in an
203+
// LLMSession (the runner -> session seam). The supported entry point is
204+
// LLMEngine::create_session(); this exists only so TextLLMEngine can build its
205+
// sessions and so unit tests can drive the runner's token-step primitives
206+
// through the public LLMSession surface (the concrete adapter type is private).
207+
// Do not depend on wrapping arbitrary runners.
208+
//
209+
// @param runner A loaded TextLLMRunner; ownership transfers to the session.
210+
// @return std::unique_ptr<LLMSession> wrapping `runner`.
211+
std::unique_ptr<LLMSession> make_text_llm_session(
212+
std::unique_ptr<TextLLMRunner> runner);
213+
} // namespace detail
214+
144215
/**
145216
* @brief Creates a MultimodalRunner instance with dependency injection
146217
*

0 commit comments

Comments
 (0)