Skip to content

Add load-time dequantization for BLAS backend buffers#2

Draft
taronaeo wants to merge 1 commit into
masterfrom
codex/explore-dequantization-during-tensor-loading
Draft

Add load-time dequantization for BLAS backend buffers#2
taronaeo wants to merge 1 commit into
masterfrom
codex/explore-dequantization-during-tensor-loading

Conversation

@taronaeo

@taronaeo taronaeo commented Dec 12, 2025

Copy link
Copy Markdown
Owner

Summary

  • introduce a BLAS buffer type that allocates host storage and optional pre-dequantized float copies for weight tensors
  • use pre-dequantized weights during BLAS matmul to avoid per-inference conversions and reuse view offsets safely

Testing

  • Not run (not requested)

Codex Task

Summary by Sourcery

Introduce a BLAS-specific backend buffer type that can hold host storage and optional pre-dequantized weight copies, and use these pre-dequantized weights in BLAS matmul to avoid per-inference dequantization and temporary work buffers.

New Features:

  • Add a BLAS backend buffer type that wraps a host buffer and tracks optional pre-dequantized float storage for weight tensors.

Enhancements:

  • Update BLAS matmul to reuse pre-dequantized weight data when available, reducing runtime conversions and workspace usage.
  • Route BLAS device buffer type queries through the new BLAS buffer implementation instead of the generic CPU buffer type.

@sourcery-ai

sourcery-ai Bot commented Dec 12, 2025

Copy link
Copy Markdown

Reviewer's Guide

Implements a custom BLAS backend buffer type that owns a CPU host buffer and optional pre-dequantized float copies of weight tensors, and updates BLAS matmul to reuse these precomputed float buffers to avoid per-inference dequantization work.

Sequence diagram for BLAS buffer tensor initialization and pre-dequantization

sequenceDiagram
    participant Caller
    participant BLASBuffer as ggml_backend_blas_buffer
    participant BLASBufCtx as ggml_backend_blas_buffer_context
    participant HostBuffer as host_buffer
    participant Tensor as ggml_tensor
    participant Extra as ggml_backend_blas_tensor_extra

    Caller->>BLASBuffer: ggml_backend_blas_buffer_init_tensor(buffer, tensor)
    BLASBuffer->>BLASBufCtx: load context from buffer->context

    alt tensor is view (tensor->view_src != NULL)
        BLASBuffer->>Tensor: read view_src and view_offs
        BLASBuffer->>Tensor: read view_src->extra as src_extra
        alt src_extra has dequantized
            BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
            BLASBuffer->>Extra: compute elem_offset from view_offs
            BLASBuffer->>Extra: dequantized = src_extra->dequantized + elem_offset
            BLASBuffer->>Extra: size_bytes = max(src_extra->size_bytes - byte_offset, 0)
            BLASBuffer->>Extra: owns_data = false
            BLASBuffer->>Tensor: tensor->extra = Extra
            BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
        else src_extra missing or no dequantized
            BLASBuffer->>Tensor: tensor->extra = tensor->view_src->extra
        end
        BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
    else tensor is not a view
        BLASBuffer->>BLASBuffer: compute can_pre_dequantize
        alt can_pre_dequantize
            BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
            BLASBuffer->>Extra: size_bytes = nelements(tensor) * sizeof(float)
            BLASBuffer->>Extra: dequantized = ggml_aligned_malloc(size_bytes)
            alt allocation failed
                BLASBuffer-->>Caller: GGML_STATUS_ALLOC_FAILED
            else allocation ok
                BLASBuffer->>Extra: owns_data = true
                BLASBuffer->>Tensor: tensor->extra = Extra
            end
        else cannot_pre_dequantize
            note over BLASBuffer: tensor->extra remains unchanged
        end

        alt host_buffer has init_tensor
            BLASBuffer->>HostBuffer: host_buffer->iface.init_tensor(host_buffer, tensor)
            HostBuffer-->>BLASBuffer: GGML_STATUS_SUCCESS
        end

        alt Extra was created
            BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
        end
        BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
    end
Loading

Class diagram for BLAS buffer context and tensor extras

classDiagram
    class ggml_backend_blas_tensor_extra {
        float* dequantized
        size_t size_bytes
        bool owns_data
    }

    class ggml_backend_blas_buffer_context {
        ggml_backend_buffer_t host_buffer
        vector~unique_ptr_ggml_backend_blas_tensor_extra~~ extras
    }

    class ggml_backend_buffer {
        ggml_backend_buffer_i iface
        void* context
        ggml_backend_buffer_type_t type
        size_t size
    }

    class ggml_backend_buffer_i {
        free_buffer(buffer)
        get_base(buffer)
        init_tensor(buffer, tensor)
        memset_tensor(buffer, tensor, value, offset, size)
        set_tensor(buffer, tensor, data, offset, size)
        get_tensor(buffer, tensor, data, offset, size)
        cpy_tensor(buffer, src, dst)
        clear(buffer, value)
        reset(buffer)
    }

    class ggml_backend_buffer_type {
        ggml_backend_buffer_type_i iface
        ggml_backend_dev_t device
        void* context
    }

    class ggml_backend_buffer_type_i {
        get_name(buft)
        alloc_buffer(buft, size)
        get_alignment(buft)
        get_max_size(buft)
        get_alloc_size(buft, size)
        is_host(buft)
    }

    class ggml_tensor {
        void* data
        ggml_type type
        ggml_tensor* view_src
        size_t view_offs
        void* extra
    }

    class ggml_backend_blas_context {
        size_t work_size
        unique_ptr_char_array work_data
    }

    ggml_backend_buffer --> ggml_backend_blas_buffer_context : context
    ggml_backend_blas_buffer_context --> ggml_backend_buffer : host_buffer
    ggml_tensor --> ggml_backend_blas_tensor_extra : extra
    ggml_backend_buffer_type --> ggml_backend_buffer_type_i : iface
    ggml_backend_buffer --> ggml_backend_buffer_i : iface
    ggml_backend_blas_context --> ggml_tensor : uses in ggml_backend_blas_mul_mat
    ggml_backend_blas_context --> ggml_backend_blas_tensor_extra : reads dequantized
    ggml_backend_blas_buffer_type ..> ggml_backend_blas_buffer_context : alloc_buffer creates
    ggml_backend_blas_buffer_type ..> ggml_backend_buffer : returns BLAS buffer
Loading

File-Level Changes

Change Details Files
Introduce BLAS backend buffer context and tensor-extra structures to manage optional pre-dequantized weight storage.
  • Add ggml_backend_blas_tensor_extra to track dequantized pointer, size, and ownership.
  • Add ggml_backend_blas_buffer_context to wrap an underlying host buffer and own tensor-extra lifetimes.
  • Ensure extras are freed correctly in buffer free and reset paths.
ggml/src/ggml-blas/ggml-blas.cpp
Implement a BLAS-specific buffer interface that allocates CPU storage and optionally allocates and maintains dequantized float copies for weight tensors.
  • Implement buffer interface callbacks (free_buffer, get_base, init_tensor, memset_tensor, set_tensor, get_tensor, clear, reset) to delegate to host buffer and keep dequantized copies in sync.
  • In init_tensor, allocate dequantized float storage for suitable weight tensors and handle view tensors by creating offset views into the source dequantized buffer.
  • In memset_tensor and set_tensor, propagate writes to the dequantized buffer when present.
ggml/src/ggml-blas/ggml-blas.cpp
Define and register a BLAS backend buffer type that wraps the CPU buffer type and associates it with the BLAS device.
  • Add ggml_backend_blas_buffer_type and its interface (get_name, alloc_buffer, get_alignment, is_host).
  • Allocate an underlying CPU buffer for the BLAS buffer context and initialize ggml_backend_buffer with the BLAS buffer interface.
  • Lazily bind the BLAS buffer type to the BLAS device via ggml_backend_blas_reg_get_device.
ggml/src/ggml-blas/ggml-blas.cpp
Update BLAS matmul path to preferentially use pre-dequantized weights when available, avoiding temporary work buffers and conversions.
  • Read ggml_backend_blas_tensor_extra from src0->extra to detect presence of pre-dequantized data.
  • Skip work buffer allocation when pre-dequantized weights exist.
  • In the matmul loop, select x from the pre-dequantized buffer when available; otherwise fall back to on-the-fly dequantization into the work buffer.
ggml/src/ggml-blas/ggml-blas.cpp
Switch BLAS device buffer type from the generic CPU buffer type to the new BLAS buffer type.
  • Change ggml_backend_blas_device_get_buffer_type to return ggml_backend_blas_buffer_type instead of ggml_backend_cpu_buffer_type.
ggml/src/ggml-blas/ggml-blas.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions Bot added the ggml label Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant