Skip to content

Update training configuration parameters #89

Merged
Eamon2009 merged 7 commits into
masterfrom
optimization
Jun 15, 2026
Merged

Update training configuration parameters #89
Eamon2009 merged 7 commits into
masterfrom
optimization

Conversation

@Eamon2009

@Eamon2009 Eamon2009 commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Optimize Hyperparameters for Target 6M Parameter Model

Summary

This PR refactors the model configuration and training hyperparameters. The previous settings resulted in an under-scaled model (under 0.5M parameters) and a highly restricted context window. This update rescales the architecture to hit the target ~6.2M parameter mark while optimizing training dynamics for efficiency and stability.

Parameter Previous Value New Value Rationale
n_embd 64 192 Scales model capacity up to the 6M target.
n_layer 4 6 Deepens the network for better feature extraction.
n_head 4 6 Maintains exactly 32 dimensions per head (192 / 6).
block_size 32 256 Expands the context window so the model can learn longer dependencies.
batch_size 16 64 Maximizes GPU parallelization and stabilizes gradient steps.
learning_rate 1e-3 6e-4 Adjusted downward slightly to prevent divergence at a larger batch size.
max_iters 20000 5000 Reduced because the larger batch size processes more tokens per step, speeding up overall convergence.
eval_interval 100 250 Reduces evaluation overhead during training.
eval_iters 20 200 Provides a larger validation sample for more reliable loss metrics.

Eamon2009 and others added 6 commits June 8, 2026 18:09
Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.
* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

* merge branch master of codeaddict (#77)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* merge branch master  (#78)

* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Optimize CI workflow and Docker configurations with refactoring (#72) (#76)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

* docs:report [run_20260530_165216](~791 tok/s) (#60)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs: report [run_20260530_165216] (~791 tok/s) (#62)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



---------



* chore: clang-format configuration file based on LLVM (#63)



* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

* Refactor core architecture and optimize CUDA features (#75)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------



* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






---------






---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Refactor .dockerignore for improved clarity

Removed unnecessary entries to streamline the build context.

* Enhance .gitignore with additional exclusions

Expanded .gitignore to include more file types and directories.

* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update main.py

* Delete quadtrix_training_report.png

* Delete docker-compose.yml

* Delete docker-compose.gpu.yml

* Delete docker-compose.dev.yml

* Delete benchmark_results.csv

* Delete SECURITY.md

* refactor: switch tokenizer from gpt2 to tiktoken o200k

* Delete contributing.md

* Delete CUDA/llmcpp directory

* Delete CUDA directory

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Optimization (#82)

* Modify training configuration parameters

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

* Refactor training configuration and optimize CI workflows (#81)

* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

* merge branch master of codeaddict (#77)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* merge branch master  (#78)

* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Optimize CI workflow and Docker configurations with refactoring (#72) (#76)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

* docs:report [run_20260530_165216](~791 tok/s) (#60)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs: report [run_20260530_165216] (~791 tok/s) (#62)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



---------



* chore: clang-format configuration file based on LLVM (#63)



* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

* Refactor core architecture and optimize CUDA features (#75)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------



* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






---------






---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Refactor .dockerignore for improved clarity

Removed unnecessary entries to streamline the build context.

* Enhance .gitignore with additional exclusions

Expanded .gitignore to include more file types and directories.

* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update main.py

* Delete quadtrix_training_report.png

* Delete docker-compose.yml

* Delete docker-compose.gpu.yml

* Delete docker-compose.dev.yml

* Delete benchmark_results.csv

* Delete SECURITY.md

* refactor: switch tokenizer from gpt2 to tiktoken o200k

* Delete contributing.md

* Delete CUDA/llmcpp directory

* Delete CUDA directory

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: implement Head and MultiHeadAttention modules in C++

* feat: add core Tensor class with SIMD-accelerated math operations

* add inference sampler parameters and repetition penalty

* Linear layer with binary serialization  Added Linear struct mimicking torch.nn.Linear with conditional bias support.

* FeedForward network module  Added FeedForward struct mirroring the standard Transformer position-wise MLP block

* LayerNorm module with serialization  Added LayerNorm struct mirroring torch.nn.LayerNorm(n_embd).

* gradient and activation tracking structures for backpropagation

* single Transformer Block with Pre-LN architecture  Added Block struct mirroring the standard Transformer layer composition.

* character-level DataLoader with batch sampling  Added DataLoader struct for character-level tokenization and text processing.

* Embedding layer with token and position mapping  Added Embedding struct mirroring torch.nn.Embedding.

* BPE tokenizer and training pipeline in DataLoader  Upgraded DataLoader fr…
* Optimization (#82)

* Modify training configuration parameters

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

* Refactor training configuration and optimize CI workflows (#81)

* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

* merge branch master of codeaddict (#77)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* merge branch master  (#78)

* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Optimize CI workflow and Docker configurations with refactoring (#72) (#76)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

* docs:report [run_20260530_165216](~791 tok/s) (#60)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs: report [run_20260530_165216] (~791 tok/s) (#62)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



---------



* chore: clang-format configuration file based on LLVM (#63)



* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

* Refactor core architecture and optimize CUDA features (#75)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------



* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






---------






---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Refactor .dockerignore for improved clarity

Removed unnecessary entries to streamline the build context.

* Enhance .gitignore with additional exclusions

Expanded .gitignore to include more file types and directories.

* Modify training configuration parameters (#80)

Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update main.py

* Delete quadtrix_training_report.png

* Delete docker-compose.yml

* Delete docker-compose.gpu.yml

* Delete docker-compose.dev.yml

* Delete benchmark_results.csv

* Delete SECURITY.md

* refactor: switch tokenizer from gpt2 to tiktoken o200k

* Delete contributing.md

* Delete CUDA/llmcpp directory

* Delete CUDA directory

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: implement Head and MultiHeadAttention modules in C++

* feat: add core Tensor class with SIMD-accelerated math operations

* add inference sampler parameters and repetition penalty

* Linear layer with binary serialization  Added Linear struct mimicking torch.nn.Linear with conditional bias support.

* FeedForward network module  Added FeedForward struct mirroring the standard Transformer position-wise MLP block

* LayerNorm module with serialization  Added LayerNorm struct mirroring torch.nn.LayerNorm(n_embd).

* gradient and activation tracking structures for backpropagation

* single Transformer Block with Pre-LN architecture  Added Block struct mirroring the standard Transformer layer composition.

* character-level DataLoader with batch sampling  Added DataLoader struct for character-level tokenization and text processing.

* Embedding layer with token and position mapping  Added Embedding struct mirroring torch.nn.Embedding.

* BPE tokenizer and training pipeline in DataLoader  Upgraded DataLoader fr…
@Eamon2009 Eamon2009 self-assigned this Jun 15, 2026
@Eamon2009

Copy link
Copy Markdown
Owner Author

/run-checks

@github-actions

Copy link
Copy Markdown

✅ All checks passed!

@Eamon2009 Eamon2009 linked an issue Jun 15, 2026 that may be closed by this pull request
@Eamon2009

Copy link
Copy Markdown
Owner Author

/run-checks

@Eamon2009 Eamon2009 added the enhancement New feature or request label Jun 15, 2026
@github-actions

Copy link
Copy Markdown

✅ All checks passed!

@github-actions

Copy link
Copy Markdown

✅ All checks passed!

@Eamon2009 Eamon2009 merged commit 6703c1e into master Jun 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimized Hyperparameters for ~6M Parameter Model

3 participants