Modality conditional adapters#22184
Conversation
|
Looks like some platform-specific code in the tests. Will fix shortly. |
Why not use llama.cpp with OpenCode? |
😁 I knew I was going to get in trouble for this! Truth be told, Ollama got baked into my fingers a long time ago before the multi-model serving ecosystem was working here. It hasn't broken yet. I am actively working on transitioning my scripting ecosystem over though (just ask |
|
Looking over the discussion in #13693, it seems like what I have here is a subset of the proposed automatic-switching solution originally proposed by @CISC that was eventually decided against. It seems like this leaves two things to consider:
|
|
I'm a bit low availability so I just reading the discussion quickly, I may miss something. But here my 2c:
Not quite sure if I understand this correctly, but IMO we should offer a better UX by automatically load the built-in lora (opt-in default as you mentioned). The main problem is that most people already familiar with using llama with a text model file plus a mmproj file, and imagine someone pretty new to llama.cpp wants to try your model, there is a good chance they will skip loading lora (as they don't know what it is), get bad result, then assume the model is broken somehow.
Yes, it is better to reuse the existing task_name
Yes, I think it should be a For the gguf field, API-design-wise, I think these points could be some good additions to the lora support in llama.cpp. Ranging from easy to hard:
|
|
Thanks for the thoughts @ngxson! I'll get the basic refactoring to reuse |
I don't really see why an arbitrary task type attribute is needed in this case. The activation sequence is a separated mechanism from the task type, they can live side-by-side and one doesn't interfere the other (but ofc in reality, a given adapter should only use one of the 2) Can you give a concrete example of a case where this can be the problem? |
|
Also just to note that there might be an extra case where the model has a given task by default, and the task can be changed via a lora. For example, an embedding model can be TYPE_INDEX and an adapter can change it to TYPE_QUERY. But maybe another model will do the reverse (i.e. being QUERY by default, but lora is needed for INDEX) |
I think you're probably right and this probably can be an enum, but I've been too scatterbrained since yesterday to think it through deeply. I'm hoping to get the schedule space tomorrow or Thursday, so will report thoughts then. |
|
@CISC Based on #13693, it looks like
|
@ngxson @CISC Ok, thinking a little more on this, my hypothetical here is using a LoRA to implement a domain-specific task ( For multimodal adapters, the space would be significantly less flexible. As currently implemented on this branch, the MM modalities are mapped to the internal enum in |
|
TBH I don't know if anyone ever used |
|
Also, I meant to upload this with the AI disclosure. This is the initial prompt and plan doc I created with Bob that was used to power the initial implementation. modality-adapters-prompt.mdWe're going to implement a new feature here in The following is a list of key code pointers to get things started:
There is a very similar feature already implemented for Please investigate this problem and construct a robust plan for how to implement and test this feature! modality-lora-implementation-plan.mdModality-Based LoRA Adapter Activation - Implementation PlanOverviewThis document outlines the implementation plan for enabling conditional LoRA adapter activation based on the presence of multimodal data (image/audio) in requests to multimodal models. This feature is called MMLoRA (Multi-Modal LoRA). Design DecisionsCore Semantics
Configuration
Technical Details
BackgroundCurrent aLoRA ImplementationThe existing activated LoRA (aLoRA) feature enables/disables adapters based on detecting specific token sequences:
Multimodal Architecture
Implementation PlanPhase 1: Helper Functions for Modality String Mapping1.1 Add String↔Enum Conversion FunctionsFile: Add declarations: // Convert modality type enum to string
const char * mtmd_modality_type_to_str(enum mtmd_input_chunk_type type);
// Convert string to modality type enum
// Returns MTMD_INPUT_CHUNK_TYPE_TEXT on invalid input
enum mtmd_input_chunk_type mtmd_modality_type_from_str(const char * str);
// Validate if string is a valid modality type (image or audio)
bool mtmd_is_valid_modality_str(const char * str);File: Add implementations: const char * mtmd_modality_type_to_str(enum mtmd_input_chunk_type type) {
switch (type) {
case MTMD_INPUT_CHUNK_TYPE_IMAGE: return "image";
case MTMD_INPUT_CHUNK_TYPE_AUDIO: return "audio";
case MTMD_INPUT_CHUNK_TYPE_TEXT: return "text";
default: return "unknown";
}
}
enum mtmd_input_chunk_type mtmd_modality_type_from_str(const char * str) {
if (strcmp(str, "image") == 0) {
return MTMD_INPUT_CHUNK_TYPE_IMAGE;
} else if (strcmp(str, "audio") == 0) {
return MTMD_INPUT_CHUNK_TYPE_AUDIO;
}
return MTMD_INPUT_CHUNK_TYPE_TEXT; // default/invalid
}
bool mtmd_is_valid_modality_str(const char * str) {
return strcmp(str, "image") == 0 || strcmp(str, "audio") == 0;
}Phase 2: Data Structure Changes2.1 Extend
|
Makes sense. The cloud engineer in me is still skittish about anything that would cause unexpected format changes requiring client-side code changes, but I don't think we're trying to apply strict semantic versioning w.r.t. the server's API, so this may not be part of the project's API contract. |
I could be wrong here, but I feel like "domain-specific task" that you mentioned earlier is a subset of "activation sequence", i.e. the lora gets activated when via a sequence of tokens, for ex. IMO a 2-tier task type / subtype system could be a bit more flexible and easier to document. the higher level "type" specify which code path should be used for switching lora, and subtype is used by the specific code path For ex:
(Similar idea to tokenizer model vs pre-tokenizer in llama.cpp code base) |
In common, we store the modalities as strings which will be mapped to enums once handled in mtmd. Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Since there can be invalid input from user requests, this will be the default/invalid value for anything that can't be processed. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y chunks Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This allows the lora to be toggled on/off without losing the value of scale that may have been set by the user intentionally. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…e of modality tokens Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: temporary/GraniteVisionModular AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
All three of task_name, prompt_prefix, and mmlora_modality_types are now in a common mapping from task name to prompt prefix (task_prompt_prefixes). This mapping is held string:string to allow for arbitrary task definitions, but individual task name strings that match modalities will be mapped to the mtmd enums for auto enablement. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ared fields The logic for auto-enabling/disabling modality adapters now relies on the presence of a task name that maps to one of the well-known names for a modality. Other task names will be ignored. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This helper was never used, so no need to keep it just to test it Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
bd85365 to
f6d44b3
Compare
Interesting, this sounds like a good idea. I just pushed a WIP change I had to merge the |
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
One other point of discussion that I'd be curious to get input on: I think there may be some model-specific variability in the logic of when the adapter should be active vs not. As a concrete example, the current transformers implementation for This makes me think that a better toggle for this model would be to perform the toggle based on the presence of the modality in the turn rather than the full sequence (the un-cached portion of the sequence). This brings up the meta-questions: Is this kind of toggling logic something we expect to be model/adapter-specific (ie, some combos would be best leaving the adapter on, others would do best turning it off for non-modality turns)? |
|
IMO it's more intuitive the way phi-4-multimodal does, they only apply the lora only when processing the vision/audio input, and deactivate and lora adapters when prefill or generate text. We should target only support simple mechanism like this. More complex logic like turn-based requires patching logic from different parts of the code base (i.e. chat parser, server code, etc) which I think will become more like a maintenance burden in the long term. Also, haven't mentioned that there is quite a lot of trade-off when using lora in a more complex way, like no KV cache reuse or no batching. In my experience working with llama.cpp, it's generally better to only implement logic that is used by at least 2 models (from 2 different makers.) This way, we make sure that the feature maybe widely adopted / become mature and the efforts spends in long-term maintenance can be justified. |
|
The more I'm getting into how these models were intended to be used, the more I'm starting to agree that this type of auto-switching may not be the right approach. I have a deep dive with the team on Wed this week, so I'll try to gain as much insight on the intended usage patterns then. It may call for abandoning this PR if it would be a one-off for just this model. |
Overview
This PR introduces a new mechanism for automatic LoRA adapter toggling based on the presence of one-or-more modalities that are tied to the adapter. This is a required feature for serving modular models such as ibm-granite/granite-speech-3.3-2b and ibm-granite/granite-4.0-3b-vision where the base LLM is preserved and the modality support is added through the adapter. Without this, a modular model must be booted in either text-mode or modality-mode. With this change, the model can be booted once and used in either mode based on the presence of the modality.
Related Work
The existing PR (#22101) by @ReinforcedKnowledge for Granite Speech adds support for ibm-granite/granite-speech-3.3-2b. I'm not aware of any other models that use this pattern and are already supported, so while this PR is still in review, there are no existing test models to verify the functionality with.
Testing
I have a temporary merge point between this branch and #22101 where I've tested the ability for
granite-speech-3.3-2bto leverage its conditional adapter. With this combination (using the conversion steps in my comment here), I've tested the following scenarios:Text Only Request
ASR Request
Run without adapter -> Good Text / Bad ASR
Text Response
ASR Response
Run with unconditional adapter -> Bad Text / Good ASR
Text Response
(empty newline)
ASR Response
Run with conditional adapter -> Good Text / Good ASR
Text Response
ASR Response
Additional information
Requirements
AI Usage Disclosure
For this work, I used a combination of IBM Bob and Open Code with
qwen3.5:122brunning in Ollama. Bob was used primarily for planning while OC+qwen3.5 was used primarily for implementation.I annotated each commit with
AI-usage: [full, draft, none] (<agent>)based on how I used my assistants (full-> unaltered agent output,draft-> edited agent output,none-> no agent usage). This is a convention I've been using to track my ownership. Every commit, regardless of agent generation, was fully reviewed and (if needed) edited before committing. I have a small tool git-ai-stats to track the breakdown of commits by agent, usage type, and lines of code.git-ai-stats output