Fix C++ -Werror regressions in llama runner (#19326) (#19326)

psiddh · web-flow · commit cdcc9156f974 · 2026-05-06T10:46:00.000-07:00
Summary: Fixes 3 `-Werror` diagnostics that broke the qualcomm llama runner build on `cfg:android-arm64-clang19-no-san` and disabled the following test infra targets: - `xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib_static` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner_static` Three diagnostics fixed: 1. `-Wreorder-ctor` in `runner.cpp`: `attention_sink_rope_module_` is declared as the 2nd field of `Runner<T>` (right after `module_`) but the constructor initializer list appended it last, after `tokenizer_`. Moved it to the correct position in the init list to match declaration order. Recent regression introduced in the attention-sink diff (#16574). 2. `-Woverloaded-virtual` in `lhd_token_generator.h` and `multimodal_lhd_token_generator.h`: the derived classes define a `prepare_io(std::vector<uint64_t>, std::vector<int32_t>)` overload that hides the base class virtual `prepare_io(uint64_t, int64_t)`. Added a `using TokenGenerator<T>::prepare_io;` (and equivalent for the multimodal hierarchy) declaration so the base virtual stays in scope and the warning is silenced without changing behavior. Latent bug surfaced by the clang19 toolchain bump. 3. `-Wdelete-non-abstract-non-virtual-dtor` in `prompt_processor.h`: `PromptProcessor<T>` has virtual member functions but no virtual destructor, so deleting via `std::unique_ptr<PromptProcessor<T>>` in `Runner` was undefined behavior under strict warnings. Added `virtual ~PromptProcessor() = default;` mirroring the pattern already used in `TokenGenerator` (`token_generator.h`). Also transitively fixes `MultimodalPromptProcessor<T>`. Reviewed By: rascani Differential Revision: D103991803
diff --git a/examples/qualcomm/oss_scripts/llama/runner/lhd_token_generator.h b/examples/qualcomm/oss_scripts/llama/runner/lhd_token_generator.h
@@ -102,6 +102,9 @@ class LhdTokenGenerator : public TokenGenerator<T> {
       AttentionSinkRopeRunner* attention_sink_rope_runner) override;
 
  private:
+  // Bring base class's virtual prepare_io into scope so the overload below
+  // does not hide it (-Woverloaded-virtual).
+  using TokenGenerator<T>::prepare_io;
   /**
    * @brief Fill in I/O buffers with prompt token and position.
    * @param cur_token Current token.
diff --git a/examples/qualcomm/oss_scripts/llama/runner/multimodal_runner/multimodal_lhd_token_generator.h b/examples/qualcomm/oss_scripts/llama/runner/multimodal_runner/multimodal_lhd_token_generator.h
@@ -108,6 +108,9 @@ class MultimodalLhdTokenGenerator
       AttentionSinkRopeRunner* attention_sink_rope_runner) override;
 
  private:
+  // Bring base class's virtual prepare_io into scope so the overload below
+  // does not hide it (-Woverloaded-virtual).
+  using TokenGenerator<T>::prepare_io;
   /**
    * @brief Fill in I/O buffers with prompt token and position.
    * @param cur_token Current token.
diff --git a/examples/qualcomm/oss_scripts/llama/runner/prompt_processor.h b/examples/qualcomm/oss_scripts/llama/runner/prompt_processor.h
@@ -40,6 +40,8 @@ class PromptProcessor {
       const std::string& method_name,
       Metadata metadata);
 
+  virtual ~PromptProcessor() = default;
+
   /**
    * @brief Initialize I/O tensor and allocate I/O data buffer.
    * @param buffer_manager Pointer to IMemAlloc instance; by default, it uses a
diff --git a/examples/qualcomm/oss_scripts/llama/runner/runner.cpp b/examples/qualcomm/oss_scripts/llama/runner/runner.cpp
@@ -102,6 +102,7 @@ Runner<T>::Runner(
     std::unique_ptr<tokenizers::Tokenizer> tokenizer,
     std::unique_ptr<executorch::extension::Module> attention_sink_rope_module)
     : module_(std::move(module)),
+      attention_sink_rope_module_(std::move(attention_sink_rope_module)),
       ngram_(ngram),
       window_(window),
       gcap_(gcap),
@@ -111,8 +112,7 @@ Runner<T>::Runner(
       temperature_(temperature),
       eval_mode_(static_cast<EvalMode>(eval_mode)),
       shared_buffer_(shared_buffer),
-      tokenizer_(std::move(tokenizer)),
-      attention_sink_rope_module_(std::move(attention_sink_rope_module)) {
+      tokenizer_(std::move(tokenizer)) {
   stats_.reset();
 
   if (decoder_model_version == "llama2") {