Commit ed7cc40
Refactor core architecture and optimize CUDA features (#2)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
Co-authored-by: Max <eamon5174@gmail.com>
* chore: clang-format configuration file based on LLVM (#63)
Co-authored-by: Eamon <eamon112009@gmail.com>
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
* Refactor core architecture and optimize CUDA features (#75)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon Sippy <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add CUDA kernels, optimize CI, and update documentation (#74)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
* CUDA header declarations for (LayerNorm) forward and backward (#66)
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
---------
Co-authored-by: Max <eamon5174@gmail.com>
* Add CUDA attention kernels, gradient norms, and CI improvements (#69)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
Co-authored-by: Max <eamon5174@gmail.com>
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Optimize CI workflow and Docker configurations with refactoring (#72) (#76)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
* docs:report [run_20260530_165216](~791 tok/s) (#60)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs: report [run_20260530_165216] (~791 tok/s) (#62)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
---------
* chore: clang-format configuration file based on LLVM (#63)
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
* Refactor core architecture and optimize CUDA features (#75)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
* Add CUDA kernels, optimize CI, and update documentation (#74)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
* CUDA header declarations for (LayerNorm) forward and backward (#66)
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
---------
* Add CUDA attention kernels, gradient norms, and CI improvements (#69)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add un…1 parent 2105377 commit ed7cc40
0 file changed
0 commit comments