Add load-time dequantization for BLAS backend buffers#2
Draft
taronaeo wants to merge 1 commit into
Draft
Conversation
Reviewer's GuideImplements a custom BLAS backend buffer type that owns a CPU host buffer and optional pre-dequantized float copies of weight tensors, and updates BLAS matmul to reuse these precomputed float buffers to avoid per-inference dequantization work. Sequence diagram for BLAS buffer tensor initialization and pre-dequantizationsequenceDiagram
participant Caller
participant BLASBuffer as ggml_backend_blas_buffer
participant BLASBufCtx as ggml_backend_blas_buffer_context
participant HostBuffer as host_buffer
participant Tensor as ggml_tensor
participant Extra as ggml_backend_blas_tensor_extra
Caller->>BLASBuffer: ggml_backend_blas_buffer_init_tensor(buffer, tensor)
BLASBuffer->>BLASBufCtx: load context from buffer->context
alt tensor is view (tensor->view_src != NULL)
BLASBuffer->>Tensor: read view_src and view_offs
BLASBuffer->>Tensor: read view_src->extra as src_extra
alt src_extra has dequantized
BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
BLASBuffer->>Extra: compute elem_offset from view_offs
BLASBuffer->>Extra: dequantized = src_extra->dequantized + elem_offset
BLASBuffer->>Extra: size_bytes = max(src_extra->size_bytes - byte_offset, 0)
BLASBuffer->>Extra: owns_data = false
BLASBuffer->>Tensor: tensor->extra = Extra
BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
else src_extra missing or no dequantized
BLASBuffer->>Tensor: tensor->extra = tensor->view_src->extra
end
BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
else tensor is not a view
BLASBuffer->>BLASBuffer: compute can_pre_dequantize
alt can_pre_dequantize
BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
BLASBuffer->>Extra: size_bytes = nelements(tensor) * sizeof(float)
BLASBuffer->>Extra: dequantized = ggml_aligned_malloc(size_bytes)
alt allocation failed
BLASBuffer-->>Caller: GGML_STATUS_ALLOC_FAILED
else allocation ok
BLASBuffer->>Extra: owns_data = true
BLASBuffer->>Tensor: tensor->extra = Extra
end
else cannot_pre_dequantize
note over BLASBuffer: tensor->extra remains unchanged
end
alt host_buffer has init_tensor
BLASBuffer->>HostBuffer: host_buffer->iface.init_tensor(host_buffer, tensor)
HostBuffer-->>BLASBuffer: GGML_STATUS_SUCCESS
end
alt Extra was created
BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
end
BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
end
Class diagram for BLAS buffer context and tensor extrasclassDiagram
class ggml_backend_blas_tensor_extra {
float* dequantized
size_t size_bytes
bool owns_data
}
class ggml_backend_blas_buffer_context {
ggml_backend_buffer_t host_buffer
vector~unique_ptr_ggml_backend_blas_tensor_extra~~ extras
}
class ggml_backend_buffer {
ggml_backend_buffer_i iface
void* context
ggml_backend_buffer_type_t type
size_t size
}
class ggml_backend_buffer_i {
free_buffer(buffer)
get_base(buffer)
init_tensor(buffer, tensor)
memset_tensor(buffer, tensor, value, offset, size)
set_tensor(buffer, tensor, data, offset, size)
get_tensor(buffer, tensor, data, offset, size)
cpy_tensor(buffer, src, dst)
clear(buffer, value)
reset(buffer)
}
class ggml_backend_buffer_type {
ggml_backend_buffer_type_i iface
ggml_backend_dev_t device
void* context
}
class ggml_backend_buffer_type_i {
get_name(buft)
alloc_buffer(buft, size)
get_alignment(buft)
get_max_size(buft)
get_alloc_size(buft, size)
is_host(buft)
}
class ggml_tensor {
void* data
ggml_type type
ggml_tensor* view_src
size_t view_offs
void* extra
}
class ggml_backend_blas_context {
size_t work_size
unique_ptr_char_array work_data
}
ggml_backend_buffer --> ggml_backend_blas_buffer_context : context
ggml_backend_blas_buffer_context --> ggml_backend_buffer : host_buffer
ggml_tensor --> ggml_backend_blas_tensor_extra : extra
ggml_backend_buffer_type --> ggml_backend_buffer_type_i : iface
ggml_backend_buffer --> ggml_backend_buffer_i : iface
ggml_backend_blas_context --> ggml_tensor : uses in ggml_backend_blas_mul_mat
ggml_backend_blas_context --> ggml_backend_blas_tensor_extra : reads dequantized
ggml_backend_blas_buffer_type ..> ggml_backend_blas_buffer_context : alloc_buffer creates
ggml_backend_blas_buffer_type ..> ggml_backend_buffer : returns BLAS buffer
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing
Codex Task
Summary by Sourcery
Introduce a BLAS-specific backend buffer type that can hold host storage and optional pre-dequantized weight copies, and use these pre-dequantized weights in BLAS matmul to avoid per-inference dequantization and temporary work buffers.
New Features:
Enhancements: