Commit a6af1fc
Optimize training configuration and CI workflows (#87)
* Optimization (#82)
* Modify training configuration parameters
Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.
* Refactor training configuration and optimize CI workflows (#81)
* Modify training configuration parameters (#80)
Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.
* merge branch master of codeaddict (#77)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
* CUDA header declarations for (LayerNorm) forward and backward (#66)
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
---------
Co-authored-by: Max <eamon5174@gmail.com>
* Add CUDA attention kernels, gradient norms, and CI improvements (#69)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
Co-authored-by: Max <eamon5174@gmail.com>
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* merge branch master (#78)
* Add CUDA kernels, optimize CI, and update documentation (#74)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
* CUDA header declarations for (LayerNorm) forward and backward (#66)
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
---------
Co-authored-by: Max <eamon5174@gmail.com>
* Add CUDA attention kernels, gradient norms, and CI improvements (#69)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
Co-authored-by: Max <eamon5174@gmail.com>
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
Co-authored-by: Max <eamon5174@gmail.com>
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Optimize CI workflow and Docker configurations with refactoring (#72) (#76)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
* docs:report [run_20260530_165216](~791 tok/s) (#60)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs: report [run_20260530_165216] (~791 tok/s) (#62)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
---------
* chore: clang-format configuration file based on LLVM (#63)
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
* Refactor core architecture and optimize CUDA features (#75)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
* Add CUDA kernels, optimize CI, and update documentation (#74)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
* CUDA header declarations for (LayerNorm) forward and backward (#66)
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
---------
* Add CUDA attention kernels, gradient norms, and CI improvements (#69)
* exp(#58)
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* feat(ci): optimize workflow pipeline and update docker configurations
* refactor(ci): optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* refactor : optimize workflow pipeline and update docker configurations
* Added MIT LICENSE to this project Quadtrix.cpp
* Refactor Dockerfile to use ARG for CUDA version
* Refactor Dockerfile for backend dependencies
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* Delete .devops/Dockerfile.frontend
* Delete .devops/Dockerfile.dev.frontend
* refactor : Dockerfile.backend optimize workflow pipeline
* refactor : Dockerfile.backend optimize workflow pipeline
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication
* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes
* refactor : message bubble layout to use inline styles
* refactor(ui): complete inline-style migration and update auto-scroll implementation
* refactor(ui): complete inline-style migration for MessageAvatar component
* refactor(ui): rewrite EmptyState component using pure inline styles
* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE
- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.
* refactor(main): redesign training loop to log per-step and sample during evaluation
- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.
* feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.
* Update README.md with new banner for qudtrix.cpp
---------
* ci: add manual PR checks workflow with slash command support
* feat(cuda): add attention forward backward kernel declarations (#64)
* docs: report [run_20260530_165216] (~791 tok/s)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* docs:report [run_20260530_165216](~791 tok/s) (#61)
Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900
* feat(cuda): add attention forward and backward kernel declarations
Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.
---------
* feat(cuda): add checkpoint metadata struct and stub functions
* feat(cuda): introduce core type definitions and error handling utilities
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
* feat(cuda): add TokenBatchView struct and DataLoader stub class
* feat(cuda): add GeLU activation forward and backward declarations
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
* feat(cuda): add gradient norm calculation and clipping interfaces
* feat(cuda): add LayerNorm forward and backward kernel declarations
* refactor(ci): organize workflow into push-triggered QA and manual docker builds
Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
* Fix formatting and update CI workflow steps
* Enhance CI with macOS binary build and release
Added macOS binary build and release steps to CI workflow.
* feat(docker): add Dockerfile for frontend application
* feat(docker): add Dockerfile for frontend application
* refactor(ci): remove release job from GitHub actions
* ci: add unified release and docker build workflow
* ci: add unified release and docker build workflow
* Refactor macOS build workflow for arm64 architecture
* Update release workflow to remove macOS x64 build
Removed dependency on build-macos-x64 for the release job.
* perf: update execution time benchmarks in csv
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* ci(docker): refactor image build workflow and add frontend job
* Remove frontend job from Docker Images workflow
* Update release workflow to remove s390x and add notes
Removed s390x build configurations and added a step to write detailed release notes.
* feat: add local orchestration script for frontend and backend servers
Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
* chore(deps): bump actions/github-script from 7 to 9 (#71)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)
---
updated-dependencies:
- dependency-name: actions/github-script
dependency-version: '9'
dependency-type: direct:production
update-type: version-update:semver-major
...
* feat(cuda): introduce log_message utility and LogLevel enum
* feat(cuda): add cuBLAS handle wrapper and matmul operations
* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions
* refactor: untie embedding and lm_head weights and to quadtrix
* feat(cuda): add NCCL communicator wrapper and all-reduce primitives
* Update README.md with workflow badges
Added badges for release, package, and CI workflows.
* kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.
* cudnn: implement cached SDPA forward graph using cuDNN frontend
* feat(cuda): implement Packed128 memory vectorization utilities
* feat: add distributed sharded DataLoader for binary token files
* feat(multi-gpu): add foundational utilities for ZeRO sharding
* feat(utils): add I/O and memory error-checking wrappers
* feat : add PyTorch-compatible Mersenne Twister random utilities
* README : Enhance README with header and workflow badges
Updated README to include a header and badges for release, package, and CI workflows.
* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
* mfu: add GPU specifications database and utilities for MFU estimation
* Modify project title in README.md
Changed the project title to include 'llm.cpp' for clarity.
* Update README to remove image and clean up content
Removed image from README and adjusted formatting.
---------
---------
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Refactor .dockerignore for improved clarity
Removed unnecessary entries to streamline the build context.
* Enhance .gitignore with additional exclusions
Expanded .gitignore to include more file types and directories.
* Modify training configuration parameters (#80)
Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update main.py
* Delete quadtrix_training_report.png
* Delete docker-compose.yml
* Delete docker-compose.gpu.yml
* Delete docker-compose.dev.yml
* Delete benchmark_results.csv
* Delete SECURITY.md
* refactor: switch tokenizer from gpt2 to tiktoken o200k
* Delete contributing.md
* Delete CUDA/llmcpp directory
* Delete CUDA directory
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat: implement Head and MultiHeadAttention modules in C++
* feat: add core Tensor class with SIMD-accelerated math operations
* add inference sampler parameters and repetition penalty
* Linear layer with binary serialization Added Linear struct mimicking torch.nn.Linear with conditional bias support.
* FeedForward network module Added FeedForward struct mirroring the standard Transformer position-wise MLP block
* LayerNorm module with serialization Added LayerNorm struct mirroring torch.nn.LayerNorm(n_embd).
* gradient and activation tracking structures for backpropagation
* single Transformer Block with Pre-LN architecture Added Block struct mirroring the standard Transformer layer composition.
* character-level DataLoader with batch sampling Added DataLoader struct for character-level tokenization and text processing.
* Embedding layer with token and position mapping Added Embedding struct mirroring torch.nn.Embedding.
* BPE tokenizer and training pipeline in DataLoader Upgraded DataLoader fr…1 parent 3b5973e commit a6af1fc
24 files changed
Lines changed: 4759 additions & 706 deletions
File tree
- .devops
- docs
- llm.cpp
- config
- include
- train_test
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
0 commit comments