Add load-time dequantization for BLAS backend buffers by taronaeo · Pull Request #2 · taronaeo/llama.cpp-s390x

taronaeo · 2025-12-12T15:19:26Z

Summary

introduce a BLAS buffer type that allocates host storage and optional pre-dequantized float copies for weight tensors
use pre-dequantized weights during BLAS matmul to avoid per-inference conversions and reuse view offsets safely

Testing

Not run (not requested)

Summary by Sourcery

Introduce a BLAS-specific backend buffer type that can hold host storage and optional pre-dequantized weight copies, and use these pre-dequantized weights in BLAS matmul to avoid per-inference dequantization and temporary work buffers.

New Features:

Add a BLAS backend buffer type that wraps a host buffer and tracks optional pre-dequantized float storage for weight tensors.

Enhancements:

Update BLAS matmul to reuse pre-dequantized weight data when available, reducing runtime conversions and workspace usage.
Route BLAS device buffer type queries through the new BLAS buffer implementation instead of the generic CPU buffer type.

sourcery-ai · 2025-12-12T15:19:32Z

Reviewer's Guide

Implements a custom BLAS backend buffer type that owns a CPU host buffer and optional pre-dequantized float copies of weight tensors, and updates BLAS matmul to reuse these precomputed float buffers to avoid per-inference dequantization work.

Sequence diagram for BLAS buffer tensor initialization and pre-dequantization

sequenceDiagram
    participant Caller
    participant BLASBuffer as ggml_backend_blas_buffer
    participant BLASBufCtx as ggml_backend_blas_buffer_context
    participant HostBuffer as host_buffer
    participant Tensor as ggml_tensor
    participant Extra as ggml_backend_blas_tensor_extra

    Caller->>BLASBuffer: ggml_backend_blas_buffer_init_tensor(buffer, tensor)
    BLASBuffer->>BLASBufCtx: load context from buffer->context

    alt tensor is view (tensor->view_src != NULL)
        BLASBuffer->>Tensor: read view_src and view_offs
        BLASBuffer->>Tensor: read view_src->extra as src_extra
        alt src_extra has dequantized
            BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
            BLASBuffer->>Extra: compute elem_offset from view_offs
            BLASBuffer->>Extra: dequantized = src_extra->dequantized + elem_offset
            BLASBuffer->>Extra: size_bytes = max(src_extra->size_bytes - byte_offset, 0)
            BLASBuffer->>Extra: owns_data = false
            BLASBuffer->>Tensor: tensor->extra = Extra
            BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
        else src_extra missing or no dequantized
            BLASBuffer->>Tensor: tensor->extra = tensor->view_src->extra
        end
        BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
    else tensor is not a view
        BLASBuffer->>BLASBuffer: compute can_pre_dequantize
        alt can_pre_dequantize
            BLASBuffer->>BLASBuffer: create Extra = new ggml_backend_blas_tensor_extra
            BLASBuffer->>Extra: size_bytes = nelements(tensor) * sizeof(float)
            BLASBuffer->>Extra: dequantized = ggml_aligned_malloc(size_bytes)
            alt allocation failed
                BLASBuffer-->>Caller: GGML_STATUS_ALLOC_FAILED
            else allocation ok
                BLASBuffer->>Extra: owns_data = true
                BLASBuffer->>Tensor: tensor->extra = Extra
            end
        else cannot_pre_dequantize
            note over BLASBuffer: tensor->extra remains unchanged
        end

        alt host_buffer has init_tensor
            BLASBuffer->>HostBuffer: host_buffer->iface.init_tensor(host_buffer, tensor)
            HostBuffer-->>BLASBuffer: GGML_STATUS_SUCCESS
        end

        alt Extra was created
            BLASBuffer->>BLASBufCtx: extras.push_back(Extra)
        end
        BLASBuffer-->>Caller: GGML_STATUS_SUCCESS
    end

Class diagram for BLAS buffer context and tensor extras

classDiagram
    class ggml_backend_blas_tensor_extra {
        float* dequantized
        size_t size_bytes
        bool owns_data
    }

    class ggml_backend_blas_buffer_context {
        ggml_backend_buffer_t host_buffer
        vector~unique_ptr_ggml_backend_blas_tensor_extra~~ extras
    }

    class ggml_backend_buffer {
        ggml_backend_buffer_i iface
        void* context
        ggml_backend_buffer_type_t type
        size_t size
    }

    class ggml_backend_buffer_i {
        free_buffer(buffer)
        get_base(buffer)
        init_tensor(buffer, tensor)
        memset_tensor(buffer, tensor, value, offset, size)
        set_tensor(buffer, tensor, data, offset, size)
        get_tensor(buffer, tensor, data, offset, size)
        cpy_tensor(buffer, src, dst)
        clear(buffer, value)
        reset(buffer)
    }

    class ggml_backend_buffer_type {
        ggml_backend_buffer_type_i iface
        ggml_backend_dev_t device
        void* context
    }

    class ggml_backend_buffer_type_i {
        get_name(buft)
        alloc_buffer(buft, size)
        get_alignment(buft)
        get_max_size(buft)
        get_alloc_size(buft, size)
        is_host(buft)
    }

    class ggml_tensor {
        void* data
        ggml_type type
        ggml_tensor* view_src
        size_t view_offs
        void* extra
    }

    class ggml_backend_blas_context {
        size_t work_size
        unique_ptr_char_array work_data
    }

    ggml_backend_buffer --> ggml_backend_blas_buffer_context : context
    ggml_backend_blas_buffer_context --> ggml_backend_buffer : host_buffer
    ggml_tensor --> ggml_backend_blas_tensor_extra : extra
    ggml_backend_buffer_type --> ggml_backend_buffer_type_i : iface
    ggml_backend_buffer --> ggml_backend_buffer_i : iface
    ggml_backend_blas_context --> ggml_tensor : uses in ggml_backend_blas_mul_mat
    ggml_backend_blas_context --> ggml_backend_blas_tensor_extra : reads dequantized
    ggml_backend_blas_buffer_type ..> ggml_backend_blas_buffer_context : alloc_buffer creates
    ggml_backend_blas_buffer_type ..> ggml_backend_buffer : returns BLAS buffer

File-Level Changes

Change	Details	Files
Introduce BLAS backend buffer context and tensor-extra structures to manage optional pre-dequantized weight storage.	Add ggml_backend_blas_tensor_extra to track dequantized pointer, size, and ownership. Add ggml_backend_blas_buffer_context to wrap an underlying host buffer and own tensor-extra lifetimes. Ensure extras are freed correctly in buffer free and reset paths.	`ggml/src/ggml-blas/ggml-blas.cpp`
Implement a BLAS-specific buffer interface that allocates CPU storage and optionally allocates and maintains dequantized float copies for weight tensors.	Implement buffer interface callbacks (free_buffer, get_base, init_tensor, memset_tensor, set_tensor, get_tensor, clear, reset) to delegate to host buffer and keep dequantized copies in sync. In init_tensor, allocate dequantized float storage for suitable weight tensors and handle view tensors by creating offset views into the source dequantized buffer. In memset_tensor and set_tensor, propagate writes to the dequantized buffer when present.	`ggml/src/ggml-blas/ggml-blas.cpp`
Define and register a BLAS backend buffer type that wraps the CPU buffer type and associates it with the BLAS device.	Add ggml_backend_blas_buffer_type and its interface (get_name, alloc_buffer, get_alignment, is_host). Allocate an underlying CPU buffer for the BLAS buffer context and initialize ggml_backend_buffer with the BLAS buffer interface. Lazily bind the BLAS buffer type to the BLAS device via ggml_backend_blas_reg_get_device.	`ggml/src/ggml-blas/ggml-blas.cpp`
Update BLAS matmul path to preferentially use pre-dequantized weights when available, avoiding temporary work buffers and conversions.	Read ggml_backend_blas_tensor_extra from src0->extra to detect presence of pre-dequantized data. Skip work buffer allocation when pre-dequantized weights exist. In the matmul loop, select x from the pre-dequantized buffer when available; otherwise fall back to on-the-fly dequantization into the work buffer.	`ggml/src/ggml-blas/ggml-blas.cpp`
Switch BLAS device buffer type from the generic CPU buffer type to the new BLAS buffer type.	Change ggml_backend_blas_device_get_buffer_type to return ggml_backend_blas_buffer_type instead of ggml_backend_cpu_buffer_type.	`ggml/src/ggml-blas/ggml-blas.cpp`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Add load-time dequantization for BLAS weights

cdf525a

taronaeo added the codex label Dec 12, 2025 — with ChatGPT Codex Connector

github-actions Bot added the ggml label Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load-time dequantization for BLAS backend buffers#2

Add load-time dequantization for BLAS backend buffers#2
taronaeo wants to merge 1 commit into
masterfrom
codex/explore-dequantization-during-tensor-loading

taronaeo commented Dec 12, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Dec 12, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taronaeo commented Dec 12, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for BLAS buffer tensor initialization and pre-dequantization

Class diagram for BLAS buffer context and tensor extras

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taronaeo commented Dec 12, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Dec 12, 2025 •

edited

Loading