Skip to content

Modality conditional adapters#22184

Open
gabe-l-hart wants to merge 16 commits intoggml-org:masterfrom
gabe-l-hart:ModalityConditionalAdapters
Open

Modality conditional adapters#22184
gabe-l-hart wants to merge 16 commits intoggml-org:masterfrom
gabe-l-hart:ModalityConditionalAdapters

Conversation

@gabe-l-hart
Copy link
Copy Markdown
Collaborator

@gabe-l-hart gabe-l-hart commented Apr 20, 2026

Overview

This PR introduces a new mechanism for automatic LoRA adapter toggling based on the presence of one-or-more modalities that are tied to the adapter. This is a required feature for serving modular models such as ibm-granite/granite-speech-3.3-2b and ibm-granite/granite-4.0-3b-vision where the base LLM is preserved and the modality support is added through the adapter. Without this, a modular model must be booted in either text-mode or modality-mode. With this change, the model can be booted once and used in either mode based on the presence of the modality.

Related Work

The existing PR (#22101) by @ReinforcedKnowledge for Granite Speech adds support for ibm-granite/granite-speech-3.3-2b. I'm not aware of any other models that use this pattern and are already supported, so while this PR is still in review, there are no existing test models to verify the functionality with.

Testing

I have a temporary merge point between this branch and #22101 where I've tested the ability for granite-speech-3.3-2b to leverage its conditional adapter. With this combination (using the conversion steps in my comment here), I've tested the following scenarios:

Text Only Request

curl http://localhost:9696/chat/completions -d '{"model": "granite-speech-3.3-2B-BF16.gguf", "temperature": 0.0, "messages": [{"role": "user", "content": "Tell me a story about a developer and their dog"}]}' | jq -r ".choices[0].message.content"

ASR Request

# NOTE: Using inline python to get around curl request length limit
python -c 'import base64, requests, json
from pathlib import Path
audio = base64.b64encode(open(Path("~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav").expanduser(), "rb").read()).decode("utf-8")
print(requests.post("http://localhost:9696/chat/completions", json={
    "model": "granite-speech-3.3-2B-BF16.gguf",
    "temperature": 0.0,
    "messages": [{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio,
                    "format": "wav"
                }
            },
            {"type": "text", "text": "can you transcribe the speech into a written format?"}
        ]
    }]
}).json()["choices"][0]["message"]["content"])'

Run without adapter -> Good Text / Bad ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

Sure, I'd be happy to transcribe the speech into written format. Here's the transcription:

---

For Timothy was a spoiled cat, and he allowed no one to interfere. Everybody waited upon him, moving their chairs even, for he was the monarch of the hearth.

The next night, Timothy's sister called him when he was still awake. "Sister," he said, "if you don't sleep, I beg you, wait until the day that will soon appear to continue the tale of the pecker."

---

This transcription maintains the original rhythm and tone of the text, preserving the poetic language and the sense of formality in the dialogue.

Run with unconditional adapter -> Bad Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

(empty newline)



ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Run with conditional adapter -> Good Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf --lora-modality 0:audio

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Additional information

Requirements

AI Usage Disclosure

For this work, I used a combination of IBM Bob and Open Code with qwen3.5:122b running in Ollama. Bob was used primarily for planning while OC+qwen3.5 was used primarily for implementation.

I annotated each commit with AI-usage: [full, draft, none] (<agent>) based on how I used my assistants (full -> unaltered agent output, draft -> edited agent output, none -> no agent usage). This is a convention I've been using to track my ownership. Every commit, regardless of agent generation, was fully reviewed and (if needed) edited before committing. I have a small tool git-ai-stats to track the breakdown of commits by agent, usage type, and lines of code.

git-ai-stats output
╔══════════════════════════════════════════════════════════╗
║           GIT AI USAGE ANALYSIS                          ║
╚══════════════════════════════════════════════════════════╝

📊 COMMITS BY AGENT

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
IBM Bob, OpenCode + qwen3.5:122b |          6
---------------------------------------------
TOTAL                          |         13

📊 COMMITS BY USAGE TYPE

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
draft                          |          3
full                           |          3
---------------------------------------------
TOTAL                          |         13

📈 LINES OF CODE BY AGENT

--- Aggregate ---
Agent                     |    Commits |  Additions |  Deletions
------------------------------------------------------------
none                      |          7 |        808 |          6
IBM Bob, OpenCode + qwen3.5:122b |          6 |        277 |          2
------------------------------------------------------------
TOTAL                     |         13 |       1085 |          8

📈 LINES OF CODE BY USAGE TYPE

--- Aggregate ---
Usage Type           |    Commits |  Additions |  Deletions
-------------------------------------------------------
none                 |          7 |        808 |          6
draft                |          3 |         93 |          2
full                 |          3 |        184 |          0
-------------------------------------------------------
TOTAL                |         13 |       1085 |          8

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Looks like some platform-specific code in the tests. Will fix shortly.

@ggerganov
Copy link
Copy Markdown
Member

Open Code with qwen3.5:122b running in Ollama

Why not use llama.cpp with OpenCode?

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Why not use llama.cpp with OpenCode?

😁 I knew I was going to get in trouble for this! Truth be told, Ollama got baked into my fingers a long time ago before the multi-model serving ecosystem was working here. It hasn't broken yet. I am actively working on transitioning my scripting ecosystem over though (just ask @0cc4m. He's been on me about this since joining RH).

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Looking over the discussion in #13693, it seems like what I have here is a subset of the proposed automatic-switching solution originally proposed by @CISC that was eventually decided against. It seems like this leaves two things to consider:

  1. Is this sort of adapter swapping better handled explicitly in the user requests?
    • That seemed to be the consensus of the earlier PR
    • I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.
  2. Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.
    • This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

@github-actions github-actions Bot added testing Everything test related examples server labels Apr 20, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 20, 2026

I'm a bit low availability so I just reading the discussion quickly, I may miss something. But here my 2c:

I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.

Not quite sure if I understand this correctly, but IMO we should offer a better UX by automatically load the built-in lora (opt-in default as you mentioned).

The main problem is that most people already familiar with using llama with a text model file plus a mmproj file, and imagine someone pretty new to llama.cpp wants to try your model, there is a good chance they will skip loading lora (as they don't know what it is), get bad result, then assume the model is broken somehow.

Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.

Yes, it is better to reuse the existing task_name

This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

Yes, I think it should be a std::set<std::string>. And even better, std::set<enum lora_task_type>, so that we can explicitly define which tasks we support in the code base (better documentation)

For the gguf field, task_name can be either a string (for backward compat) or an array of strings


API-design-wise, I think these points could be some good additions to the lora support in llama.cpp. Ranging from easy to hard:

  1. Add enum lora_task_type to lock-in the types of adapter
  2. Add a new API to the core library: llama_adapter_lora_get_task_type that returns the enum; this hides the raw string from end-user
  3. Upon creating mtmd_context, it searches for lora adapter(s) with the given type(s) and store the pointer-to-adapter inside the context
  4. Add mtmd_pre_decode() call to setup the lora, and mtmd_post_decode() to clean it up
  5. Extend libllama to store adapters and main model inside the same GGUF, maybe lora tensors prefixed by lora.{task_name}.*, then having an API like llama_model_get_adapter_lora(enum lora_task_type) to retrieve it (returns nullptr if it doesn't exist)

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Thanks for the thoughts @ngxson! I'll get the basic refactoring to reuse task_name done soon and explore the use of an enum. My only push-back on that one would be that some models may define their own task types with their own activation sequences (eg an adapter to activate a new builtin tool or something), so using an enum to make it explicit would imply that task types are a static attribute of the codebase rather than an attribute of the model. I'd need to think a little more on this though. The existing aLoRA implementation would already support this kind of user-defined task adapter that only activates when needed.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 21, 2026

My only push-back on that one would be that some models may define their own task types with their own activation sequences (eg an adapter to activate a new builtin tool or something), so using an enum to make it explicit would imply that task types are a static attribute of the codebase rather than an attribute of the model.

I don't really see why an arbitrary task type attribute is needed in this case. The activation sequence is a separated mechanism from the task type, they can live side-by-side and one doesn't interfere the other (but ofc in reality, a given adapter should only use one of the 2)

Can you give a concrete example of a case where this can be the problem?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 21, 2026

Also just to note that there might be an extra case where the model has a given task by default, and the task can be changed via a lora. For example, an embedding model can be TYPE_INDEX and an adapter can change it to TYPE_QUERY. But maybe another model will do the reverse (i.e. being QUERY by default, but lora is needed for INDEX)

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Can you give a concrete example of a case where this can be the problem?

I think you're probably right and this probably can be an enum, but I've been too scatterbrained since yesterday to think it through deeply. I'm hoping to get the schedule space tomorrow or Thursday, so will report thoughts then.

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

@CISC Based on #13693, it looks like task_name and prompt_prefix are only used as metadata for the adapter that is presented back to the user in GET /lora-adapters. On this branch, I've added the ability for a given adapter to support multiple modalities, so it may make sense to extend these singular values to vectors, but that could be an API breaking change for the /lora-adapters response object. Options I could see working:

  1. For now, only support a single modality and don't extend task_name/prompt_prefix
  2. Extend to support multiple task_name values, but in the response format to /lora-adapters, keep the current format if (and only if) there is a single value. Return a list representation if multiple values are present (API preserving, but complex response semantics)
  3. Extend to support multiple task_name values and always return a list view (API breaking, but simple response semantics)

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Can you give a concrete example of a case where this can be the problem?

I think you're probably right and this probably can be an enum, but I've been too scatterbrained since yesterday to think it through deeply. I'm hoping to get the schedule space tomorrow or Thursday, so will report thoughts then.

@ngxson @CISC Ok, thinking a little more on this, my hypothetical here is using a LoRA to implement a domain-specific task (style_classification, web_search, my_personal_tool_call). This would be a property of the adapter that would trigger the model to output a specific format that the client application would parse on its own. I think this lines up with the original intent of task_name and prompt_prefix which would allow arbitrary adapter-specific task definitions and let the user know how to stimulate the logic with the adapter enabled.

For multimodal adapters, the space would be significantly less flexible. As currently implemented on this branch, the MM modalities are mapped to the internal enum in mtmd since that's a strict list. If I merge this logic into task_name/prompt_prefix, it would conflate the arbitrary adapter-specific task logic with the code-predefined tasks for multimodal processing. This could easily be handled in has_modality_type that would disregard task names that don't match a predefined constant string for each modality, so implementation-wise, this wouldn't be too hard, but it would make for some slightly muddy semantics on when auto-enablement is triggered.

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 22, 2026

TBH I don't know if anyone ever used task_name in /lora_adapters for anything other than a visual indicator, changing it to occasionally or even always be an array may not be a big breaking change.

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Also, I meant to upload this with the AI disclosure. This is the initial prompt and plan doc I created with Bob that was used to power the initial implementation.

modality-adapters-prompt.md

We're going to implement a new feature here in llama.cpp that allows lora adapters to be conditionally enabled based on the presence of multimodal data for multimodal models. The goal is for the user to serve a model with an adapter and indicate that the adapter is tied to the presence of data for a certain modality (eg audio or image).

The following is a list of key code pointers to get things started:

  • The core LLM code lives in ./src
  • All multimodal support code lives in ./tools/mtmd
  • mtmd is used in tools/cli and tools/server as the CLI and server executables, so this is where arguments would given to indicate the adapter:modality mapping
    • All argument parsing is implemented in common/arg.*, so any new arguments should be added there, then exposed in cli and server
  • The implementation of lora adapter structs and methods lives in ./src/llama-adapter.*
  • Adapters are stored for a given model in the llama_model struct in the loras attribute (src/llama-model.h:562)
  • Individual adapters are mapped to the model in llama_context (src/llama-context.cpp:1117) based on the stored scale value
  • Lora updating is plumbed from the tool layer through to the src layer as follows:
    • tools/server/server-context.cpp:2703: Call common_set_adapter_lora with the lora for the given server slot
    • common/common.cpp:1408: Call in common_set_adapter_lora, call llama_set_adapters_lora (llama_* is the public API prefix for the core library code)
    • src/llama-context.cpp:3179: Call ctx->set_adapters_lora on the passed in llama_context with the passed in adapters and corresponding scales

There is a very similar feature already implemented for alora (activated LoRA) where adapters are toggled on/off based on the presence of an activation sequence. We could consider reusing that infrastructure to enable/disable the modality-specific adapters. The main difference would be how the activation sequence is detected (in this case, it would be the modality media marker).

Please investigate this problem and construct a robust plan for how to implement and test this feature!

modality-lora-implementation-plan.md

Modality-Based LoRA Adapter Activation - Implementation Plan

Overview

This document outlines the implementation plan for enabling conditional LoRA adapter activation based on the presence of multimodal data (image/audio) in requests to multimodal models. This feature is called MMLoRA (Multi-Modal LoRA).

Design Decisions

Core Semantics

  • Activation Scope: Adapter applies to entire request when modality is present (no batch splitting)
  • Cache Behavior: No cache clearing needed when toggling adapters between requests
  • Mutual Exclusivity: MMLoRA and aLoRA are mutually exclusive (error at load time if both present)
  • Scale Behavior: When modality absent, force scale = 0.0f; when present, respect user-configured scale
  • Enabled State: Implied by non-empty mmlora_modality_types vector (no separate boolean flag)

Configuration

  • Storage: Runtime configuration only (no GGUF metadata)
  • CLI Syntax: Index-based for explicit control: --lora-modality 0:image,audio
  • Multiple Modalities: Comma-separated list with OR logic (adapter activates if ANY modality present)
  • Error Handling: Error at startup if modality type not supported by model

Technical Details

  • Modality Type: Use existing enum mtmd_input_chunk_type from mtmd.h:56
  • String Mapping: Extract string↔enum conversion to mtmd-helper.* functions
  • Testing: Both synthetic test adapters and documentation for real adapters
  • Documentation: API reference + examples

Background

Current aLoRA Implementation

The existing activated LoRA (aLoRA) feature enables/disables adapters based on detecting specific token sequences:

  • Adapters store alora_invocation_tokens vector in GGUF metadata
  • During slot processing, system scans backwards through prompt tokens to find invocation sequence
  • If found, adapter remains enabled; otherwise disabled
  • Batch is split at invocation point - adapter disabled before, enabled after
  • Implementation in tools/server/server-context.cpp:1130-1175

Multimodal Architecture

  • Multimodal support lives in tools/mtmd/
  • mtmd_tokenize() processes text + media into chunks with type markers:
    • MTMD_INPUT_CHUNK_TYPE_TEXT (value: 0)
    • MTMD_INPUT_CHUNK_TYPE_IMAGE (value: 1)
    • MTMD_INPUT_CHUNK_TYPE_AUDIO (value: 2)
  • Server tracks media chunks via map_idx_to_media in server_tokens struct
  • Media chunks are encoded separately via mtmd_encode_chunk()

Implementation Plan

Phase 1: Helper Functions for Modality String Mapping

1.1 Add String↔Enum Conversion Functions

File: tools/mtmd/mtmd-helper.h

Add declarations:

// Convert modality type enum to string
const char * mtmd_modality_type_to_str(enum mtmd_input_chunk_type type);

// Convert string to modality type enum
// Returns MTMD_INPUT_CHUNK_TYPE_TEXT on invalid input
enum mtmd_input_chunk_type mtmd_modality_type_from_str(const char * str);

// Validate if string is a valid modality type (image or audio)
bool mtmd_is_valid_modality_str(const char * str);

File: tools/mtmd/mtmd-helper.cpp

Add implementations:

const char * mtmd_modality_type_to_str(enum mtmd_input_chunk_type type) {
    switch (type) {
        case MTMD_INPUT_CHUNK_TYPE_IMAGE: return "image";
        case MTMD_INPUT_CHUNK_TYPE_AUDIO: return "audio";
        case MTMD_INPUT_CHUNK_TYPE_TEXT:  return "text";
        default: return "unknown";
    }
}

enum mtmd_input_chunk_type mtmd_modality_type_from_str(const char * str) {
    if (strcmp(str, "image") == 0) {
        return MTMD_INPUT_CHUNK_TYPE_IMAGE;
    } else if (strcmp(str, "audio") == 0) {
        return MTMD_INPUT_CHUNK_TYPE_AUDIO;
    }
    return MTMD_INPUT_CHUNK_TYPE_TEXT; // default/invalid
}

bool mtmd_is_valid_modality_str(const char * str) {
    return strcmp(str, "image") == 0 || strcmp(str, "audio") == 0;
}

Phase 2: Data Structure Changes

2.1 Extend common_adapter_lora_info

File: common/common.h

struct common_adapter_lora_info {
    std::string path;
    float scale = 1.0f;
    llama_adapter_lora * ptr = nullptr;
    
    // Multi-Modal LoRA activation (MMLoRA)
    // Empty vector = not an MMLoRA adapter (always active)
    // Non-empty = MMLoRA adapter (activates if ANY modality present - OR logic)
    std::vector<enum mtmd_input_chunk_type> mmlora_modality_types;
    
    // Helper to check if this is an MMLoRA adapter
    bool is_mmlora() const { return !mmlora_modality_types.empty(); }
    
    // ... rest of struct ...
};

Phase 3: Argument Parsing

3.1 Add CLI Argument Handler

File: common/arg.cpp

Add new argument:

common_arg(
    {"--lora-modality"},
    "INDEX:MODALITY[,MODALITY...]",
    "MODALITY",
    "Bind LoRA adapter to modality type(s). Adapter activates when ANY specified modality is present.\n"
    "INDEX is the 0-based adapter index (order of --lora arguments).\n"
    "MODALITY can be: image, audio (comma-separated for multiple).\n"
    "Example: --lora-modality 0:image,audio",
    [](common_params & params, const std::string & value) {
        // Parse "INDEX:MODALITY[,MODALITY...]"
        size_t colon_pos = value.find(':');
        if (colon_pos == std::string::npos) {
            throw std::invalid_argument("Invalid format for --lora-modality. Expected INDEX:MODALITY");
        }
        
        std::string index_str = value.substr(0, colon_pos);
        std::string modalities_str = value.substr(colon_pos + 1);
        
        int adapter_idx = std::stoi(index_str);
        if (adapter_idx < 0 || adapter_idx >= (int)params.lora_adapters.size()) {
            throw std::invalid_argument("Invalid adapter index: " + index_str);
        }
        
        // Parse comma-separated modalities
        std::vector<std::string> modality_strs;
        size_t start = 0;
        size_t comma_pos;
        while ((comma_pos = modalities_str.find(',', start)) != std::string::npos) {
            modality_strs.push_back(modalities_str.substr(start, comma_pos - start));
            start = comma_pos + 1;
        }
        modality_strs.push_back(modalities_str.substr(start));
        
        // Validate and convert to enums
        auto & lora_info = params.lora_adapters[adapter_idx];
        lora_info.mmlora_modality_types.clear();
        
        for (const auto & mod_str : modality_strs) {
            // Validate the string
            if (!mtmd_is_valid_modality_str(mod_str.c_str())) {
                throw std::invalid_argument("Invalid modality type: " + mod_str + " (must be 'image' or 'audio')");
            }
            enum mtmd_input_chunk_type type = mtmd_modality_type_from_str(mod_str.c_str());
            lora_info.mmlora_modality_types.push_back(type);
        }
    }
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN})

Phase 4: Adapter Loading Validation

4.1 Validate MMLoRA/aLoRA Mutual Exclusivity

File: common/common.cpp in common_init_from_params()

After loading adapters (~line 1330):

// Validate MMLoRA configuration
for (size_t i = 0; i < params.lora_adapters.size(); ++i) {
    auto & lora_info = params.lora_adapters[i];
    
    if (lora_info.is_mmlora()) {
        // Check for aLoRA conflict
        const uint64_t n_alora_tokens = llama_adapter_get_alora_n_invocation_tokens(lora_info.ptr);
        if (n_alora_tokens > 0) {
            LOG_ERR("Adapter %zu cannot be both MMLoRA and aLoRA\n", i);
            return nullptr;
        }
        
        // Validate modality types are supported by model
        const bool supports_vision = mtmd_support_vision(ctx_vision.get());
        const bool supports_audio = mtmd_support_audio(ctx_vision.get());
        
        for (auto modality_type : lora_info.mmlora_modality_types) {
            if (modality_type == MTMD_INPUT_CHUNK_TYPE_IMAGE && !supports_vision) {
                LOG_ERR("Adapter %zu requires image modality but model does not support vision\n", i);
                return nullptr;
            }
            if (modality_type == MTMD_INPUT_CHUNK_TYPE_AUDIO && !supports_audio) {
                LOG_ERR("Adapter %zu requires audio modality but model does not support audio\n", i);
                return nullptr;
            }
        }
    }
}

Phase 5: Server Integration

5.1 Add Modality Detection Helper

File: tools/server/server-context.cpp

Add before process_token():

// Check if slot has media chunks of any of the specified types
static bool slot_has_any_modality(
        const server_slot & slot,
        const std::vector<enum mtmd_input_chunk_type> & modality_types) {
    if (!slot.task || !slot.task->tokens.has_mtmd) {
        return false;
    }
    
    for (const auto & [idx, chunk] : slot.task->tokens.map_idx_to_media) {
        enum mtmd_input_chunk_type chunk_type = mtmd_input_chunk_get_type(chunk.get());
        for (auto target_type : modality_types) {
            if (chunk_type == target_type) {
                return true;
            }
        }
    }
    return false;
}

5.2 Apply MMLoRA Activation Logic

File: tools/server/server-context.cpp in launch_slot_with_task()

After aLoRA detection logic (~line 1175), add:

// Handle Multi-Modal LoRA (MMLoRA) activation
// This must happen BEFORE aLoRA logic to ensure proper precedence
for (size_t i = 0; i < slot.lora.size(); ++i) {
    auto & lora_info = slot.lora[i];
    
    if (lora_info.is_mmlora()) {
        const bool has_modality = slot_has_any_modality(slot, lora_info.mmlora_modality_types);
        
        if (!has_modality) {
            SLT_DBG(slot, "MMLoRA %zu requires modality but not found, deactivating\n", i);
            lora_info.scale = 0.0f;
        } else {
            SLT_DBG(slot, "MMLoRA %zu activated (modality present)\n", i);
            // Leave scale as-is (user-configured or default)
        }
    }
}

5.3 Update Server Task Structures

File: tools/server/server-task.h

Add to lora info in task results:

struct lora {
    common_adapter_lora_info info;
    std::string alora_invocation_string;
    llama_tokens alora_invocation_tokens;
    std::vector<std::string> mmlora_modality_types; // ["image"], ["audio"], or ["image", "audio"]
    
    // Helper to check if this is an MMLoRA adapter
    bool is_mmlora() const { return !mmlora_modality_types.empty(); }
};

File: tools/server/server-task.cpp

Update JSON serialization (~line 1920):

if (lora.info.is_mmlora()) {
    entry["mmlora_enabled"] = true;
    json modalities = json::array();
    for (auto type : lora.info.mmlora_modality_types) {
        modalities.push_back(mtmd_modality_type_to_str(type));
    }
    entry["mmlora_modality_types"] = modalities;
}

Phase 6: Testing Strategy

6.1 Unit Tests

Create tests/test-mmlora.cpp:

// Test 1: Modality detection with single modality
// Test 2: Modality detection with multiple modalities (OR logic)
// Test 3: Adapter activation when modality present
// Test 4: Adapter deactivation when modality absent
// Test 5: Multiple adapters with different modalities
// Test 6: Multiple adapters with same modality
// Test 7: Error on MMLoRA + aLoRA conflict
// Test 8: Error on unsupported modality type
// Test 9: String to enum conversion
// Test 10: CLI argument parsing
// Test 11: is_mmlora() helper function

6.2 Integration Tests

Create tests/test-mmlora-integration.sh:

#!/bin/bash
set -e

# Test 1: Image-only adapter with image input (should activate)
echo "Test 1: Image adapter with image..."
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora image-adapter.gguf --lora-modality 0:image &
SERVER_PID=$!
sleep 2
# Send request with image, verify adapter active
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"test.jpg"}},{"type":"text","text":"describe"}]}]}'
kill $SERVER_PID

# Test 2: Image-only adapter without image input (should not activate)
echo "Test 2: Image adapter without image..."
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora image-adapter.gguf --lora-modality 0:image &
SERVER_PID=$!
sleep 2
# Send text-only request, verify adapter inactive
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":"hello"}]}'
kill $SERVER_PID

# Test 3: Multi-modality adapter (image,audio) with image
echo "Test 3: Multi-modality adapter with image..."
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora multi-adapter.gguf --lora-modality 0:image,audio &
SERVER_PID=$!
sleep 2
# Send request with image, verify adapter active
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"test.jpg"}},{"type":"text","text":"describe"}]}]}'
kill $SERVER_PID

# Test 4: Multi-modality adapter (image,audio) with audio
echo "Test 4: Multi-modality adapter with audio..."
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora multi-adapter.gguf --lora-modality 0:image,audio &
SERVER_PID=$!
sleep 2
# Send request with audio, verify adapter active
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":[{"type":"audio_url","audio_url":{"url":"test.mp3"}},{"type":"text","text":"transcribe"}]}]}'
kill $SERVER_PID

# Test 5: Multiple adapters with different modalities
echo "Test 5: Multiple adapters..."
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora image-adapter.gguf --lora audio-adapter.gguf \
  --lora-modality 0:image --lora-modality 1:audio &
SERVER_PID=$!
sleep 2
# Send request with image, verify only image adapter active
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"test.jpg"}},{"type":"text","text":"describe"}]}]}'
kill $SERVER_PID

# Test 6: Error on unsupported modality
echo "Test 6: Error on unsupported modality..."
! llama-server --model text-only-model.gguf \
  --lora image-adapter.gguf --lora-modality 0:image 2>&1 | grep -q "does not support vision"

echo "All tests passed!"

6.3 Synthetic Test Adapters

Create minimal test adapters for unit testing:

# scripts/create-test-mmlora.py
import torch
from safetensors.torch import save_file

# Create minimal LoRA adapter for testing
# Just needs valid structure, doesn't need to be trained
lora_a = torch.randn(8, 4096)  # rank 8
lora_b = torch.randn(4096, 8)

tensors = {
    "base_model.model.layers.0.self_attn.q_proj.lora_A.weight": lora_a,
    "base_model.model.layers.0.self_attn.q_proj.lora_B.weight": lora_b,
}

save_file(tensors, "test-mmlora-adapter.safetensors")

# Convert to GGUF (no special metadata needed for MMLoRA)
# ./convert_lora_to_gguf.py test-mmlora-adapter.safetensors

6.4 Manual Testing Checklist

  • Load model with MMLoRA adapter via CLI
  • Send request with matching modality → verify adapter activates (check logs)
  • Send request without matching modality → verify adapter deactivates (check logs)
  • Test with multiple adapters (different modalities)
  • Test with multi-modality adapter (image,audio)
  • Test error on MMLoRA + aLoRA conflict
  • Test error on unsupported modality type
  • Verify server /v1/lora endpoint shows MMLoRA info
  • Test invalid CLI arguments (wrong index, invalid modality)
  • Test KV cache behavior across requests with/without modality

Phase 7: Documentation

7.1 User Documentation

Create docs/mmlora.md:

# Multi-Modal LoRA Adapter Activation (MMLoRA)

## Overview

Multi-Modal LoRA (MMLoRA) allows you to conditionally activate LoRA adapters based on the presence of specific input modalities (image or audio) in multimodal requests.

## Use Cases

- **Image-specific style adapters**: Apply artistic style only when processing images
- **Audio-specific adapters**: Apply speech recognition improvements only for audio
- **Modality-aware fine-tuning**: Different behavior for text-only vs multimodal inputs

## Usage

### CLI

```bash
# Single adapter, single modality
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora path/to/adapter.gguf --lora-modality 0:image

# Single adapter, multiple modalities (OR logic)
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora path/to/adapter.gguf --lora-modality 0:image,audio

# Multiple adapters with different modalities
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora path/to/img-adapter.gguf \
  --lora path/to/aud-adapter.gguf \
  --lora-modality 0:image --lora-modality 1:audio

Behavior

  • Modality present: Adapter activates for entire request
  • Modality absent: Adapter deactivated (scale = 0.0)
  • Multiple modalities: Adapter activates if ANY specified modality present (OR logic)
  • No cache clearing: Toggling between requests does not clear KV cache

Limitations

  • MMLoRA and aLoRA are mutually exclusive (cannot use both on same adapter)
  • Modality binding is set at server startup (not per-request)
  • Adapter applies to entire request (not just modality tokens)

Examples

See examples/mmlora/ for complete examples.

Troubleshooting

Error: "Adapter requires image modality but model does not support vision"

  • Solution: Ensure model has vision support (--mmproj) when using image-based adapters

Adapter not activating

  • Check server logs for "MMLoRA X activated" messages
  • Verify modality type matches input (image vs audio)
  • Ensure adapter index is correct (0-based)

#### 7.2 API Documentation
Update `tools/server/README.md`:

Add section:
```markdown
### LoRA Adapters with Modality Activation

The server supports Multi-Modal LoRA activation (MMLoRA), where adapters are conditionally enabled based on input modality:

```bash
llama-server --model model.gguf --mmproj mmproj.gguf \
  --lora image-adapter.gguf --lora-modality 0:image

When a request contains an image, the adapter activates. For text-only requests, it remains disabled.

GET /v1/lora

Returns information about loaded adapters, including MMLoRA configuration:

{
  "loras": [
    {
      "id": 0,
      "path": "image-adapter.gguf",
      "scale": 1.0,
      "mmlora_enabled": true,
      "mmlora_modality_types": ["image"]
    }
  ]
}

## Implementation Order

1. **Phase 1**: Helper functions (1 hour)
2. **Phase 2**: Data structures (1 hour)
3. **Phase 3**: Argument parsing (2 hours)
4. **Phase 4**: Adapter validation (1 hour)
5. **Phase 5**: Server integration (2 hours)
6. **Phase 6**: Testing (4-6 hours)
7. **Phase 7**: Documentation (2 hours)

**Total Estimated Time**: 13-15 hours

## Compatibility Considerations

### Backward Compatibility
- Existing adapters without MMLoRA configuration work unchanged
- No breaking changes to existing APIs
- New CLI arguments are optional

### Forward Compatibility
- Design allows future modality types (video, 3D, etc.)
- Extensible to other activation conditions
- String-based modality names allow additions without enum changes

## Edge Cases & Error Handling

1. **Adapter without MMLoRA config**: Treated as regular LoRA (always active)
2. **Invalid modality type**: Error at startup with clear message
3. **MMLoRA + aLoRA on same adapter**: Error at load time (mutually exclusive)
4. **Modality not supported by model**: Error at startup
5. **Multiple adapters for same modality**: All activate (user responsibility)
6. **Invalid adapter index**: Error during argument parsing
7. **Empty modality list**: Error during argument parsing

## Performance Considerations

- Modality detection is O(n) where n = number of media chunks (typically 1-3)
- Detection happens once per request during slot launch
- No impact on token processing performance
- Minimal memory overhead (vector of enums per adapter)
- No batch splitting overhead (unlike aLoRA)

## Security Considerations

- Validate modality type strings during argument parsing
- Validate adapter indices to prevent out-of-bounds access
- Sanitize error messages to avoid information leakage
- No user-controlled GGUF metadata (runtime config only)

## Future Enhancements

1. **Per-request configuration**: Allow modality binding changes via API
2. **AND logic**: Require multiple modalities simultaneously
3. **NOT logic**: Activate only when modality absent
4. **Per-chunk activation**: Different adapters for different media chunks
5. **Modality metadata**: Pass resolution, duration, etc. to adapters
6. **Auto-detection**: Infer modality binding from adapter metadata

## References

- aLoRA implementation: `tools/server/server-context.cpp:1130-1175`
- Multimodal tokenization: `tools/mtmd/mtmd.cpp:927-936`
- Adapter structures: `src/llama-adapter.h`
- Server tokens: `tools/server/server-common.h:120-150`
- Modality enum: `tools/mtmd/mtmd.h:56`

</details>

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

TBH I don't know if anyone ever used task_name in /lora_adapters for anything other than a visual indicator, changing it to occasionally or even always be an array may not be a big breaking change.

Makes sense. The cloud engineer in me is still skittish about anything that would cause unexpected format changes requiring client-side code changes, but I don't think we're trying to apply strict semantic versioning w.r.t. the server's API, so this may not be part of the project's API contract.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 22, 2026

Ok, thinking a little more on this, my hypothetical here is using a LoRA to implement a domain-specific task (style_classification, web_search, my_personal_tool_call). This would be a property of the adapter that would trigger the model to output a specific format that the client application would parse on its own. I think this lines up with the original intent of task_name and prompt_prefix which would allow arbitrary adapter-specific task definitions and let the user know how to stimulate the logic with the adapter enabled.

I could be wrong here, but I feel like "domain-specific task" that you mentioned earlier is a subset of "activation sequence", i.e. the lora gets activated when via a sequence of tokens, for ex. <|tool_name|>style_classification

IMO a 2-tier task type / subtype system could be a bit more flexible and easier to document. the higher level "type" specify which code path should be used for switching lora, and subtype is used by the specific code path

For ex:

  • type MULTIMODAL, subtype audio / vision
  • type ALORA, subtype style_classification, web_search, my_personal_tool_call

(Similar idea to tokenizer model vs pre-tokenizer in llama.cpp code base)

In common, we store the modalities as strings which will be mapped to enums
once handled in mtmd.

Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Since there can be invalid input from user requests, this will be the
default/invalid value for anything that can't be processed.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y chunks

Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This allows the lora to be toggled on/off without losing the value of scale
that may have been set by the user intentionally.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…e of modality tokens

Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: temporary/GraniteVisionModular
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
All three of task_name, prompt_prefix, and mmlora_modality_types are now in
a common mapping from task name to prompt prefix (task_prompt_prefixes).
This mapping is held string:string to allow for arbitrary task definitions,
but individual task name strings that match modalities will be mapped to
the mtmd enums for auto enablement.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ared fields

The logic for auto-enabling/disabling modality adapters now relies on the
presence of a task name that maps to one of the well-known names for a
modality. Other task names will be ignored.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This helper was never used, so no need to keep it just to test it

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart force-pushed the ModalityConditionalAdapters branch from bd85365 to f6d44b3 Compare April 22, 2026 18:44
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

IMO a 2-tier task type / subtype system could be a bit more flexible and easier to document.

Interesting, this sounds like a good idea. I just pushed a WIP change I had to merge the task_name/prompt_prefix fields, so I haven't investigated this idea of a 2-tier system yet. Will look into it tomorrow most likely.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

One other point of discussion that I'd be curious to get input on: I think there may be some model-specific variability in the logic of when the adapter should be active vs not.

As a concrete example, the current transformers implementation for granite-speech-3.3-2b will toggle the lora on ANY time there is audio in the sequence, even if that audio has already been processed and cached. The result is that after the model has done a transcription, it will always use the adapter, even for follow-up questions unless explicitly disabled. That matches the implementation I have here. When testing this, it results in the model always responding with the transcribed text, regardless of the prompt. If I manually disable the adapter (kill the server, restart without the adapter), then continue the conversation, the model happily performs text-only turns.

This makes me think that a better toggle for this model would be to perform the toggle based on the presence of the modality in the turn rather than the full sequence (the un-cached portion of the sequence). This brings up the meta-questions: Is this kind of toggling logic something we expect to be model/adapter-specific (ie, some combos would be best leaving the adapter on, others would do best turning it off for non-modality turns)?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 24, 2026

IMO it's more intuitive the way phi-4-multimodal does, they only apply the lora only when processing the vision/audio input, and deactivate and lora adapters when prefill or generate text. We should target only support simple mechanism like this.

More complex logic like turn-based requires patching logic from different parts of the code base (i.e. chat parser, server code, etc) which I think will become more like a maintenance burden in the long term. Also, haven't mentioned that there is quite a lot of trade-off when using lora in a more complex way, like no KV cache reuse or no batching.

In my experience working with llama.cpp, it's generally better to only implement logic that is used by at least 2 models (from 2 different makers.) This way, we make sure that the feature maybe widely adopted / become mature and the efforts spends in long-term maintenance can be justified.

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

The more I'm getting into how these models were intended to be used, the more I'm starting to agree that this type of auto-switching may not be the right approach. I have a deep dive with the team on Wed this week, so I'll try to gain as much insight on the intended usage patterns then. It may call for abandoning this PR if it would be a one-off for just this model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants