Skip to content

Commit 54ce566

Browse files
Eamon2009codeaddict-119ethos-cmddependabot[bot]
authored
merge branch master (#78)
* Add CUDA kernels, optimize CI, and update documentation (#74) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. * CUDA header declarations for (LayerNorm) forward and backward (#66) * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces --------- Co-authored-by: Max <eamon5174@gmail.com> * Add CUDA attention kernels, gradient norms, and CI improvements (#69) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- Co-authored-by: Max <eamon5174@gmail.com> * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-Authored-By: Eamon Sippy <eamon112009@gmail.com> * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Max <eamon5174@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Eamon <eamon112009@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Optimize CI workflow and Docker configurations with refactoring (#72) (#76) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp * docs:report [run_20260530_165216](~791 tok/s) (#60) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs: report [run_20260530_165216] (~791 tok/s) (#62) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 --------- * chore: clang-format configuration file based on LLVM (#63) * ci: add manual PR checks workflow with slash command support * ci: add manual PR checks workflow with slash command support * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. * Refactor core architecture and optimize CUDA features (#75) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- * Add CUDA kernels, optimize CI, and update documentation (#74) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. * CUDA header declarations for (LayerNorm) forward and backward (#66) * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces --------- * Add CUDA attention kernels, gradient norms, and CI improvements (#69) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- --------- --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Max <eamon5174@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Refactor .dockerignore for improved clarity Removed unnecessary entries to streamline the build context. * Enhance .gitignore with additional exclusions Expanded .gitignore to include more file types and directories. * Modify training configuration parameters (#80) Updated training parameters including batch size, iterations, evaluation interval, learning rate, and dropout rate. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Max <eamon5174@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1 parent 6837c2e commit 54ce566

3 files changed

Lines changed: 52 additions & 51 deletions

File tree

.dockerignore

Lines changed: 0 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1 @@
1-
2-
.git
31
.gitignore
4-
.github
5-
.venv
6-
__pycache__
7-
*.pyc
8-
*.pyo
9-
*.pyd
10-
*.egg-info
11-
.pytest_cache
12-
.ruff_cache
13-
dist/
14-
build/
15-
*.egg
16-
node_modules
17-
frontend/node_modules
18-
frontend/dist
19-
frontend/.vite
20-
*.npm-cache
21-
.npmignore
22-
*.o
23-
*.a
24-
*.so
25-
*.dylib
26-
quadtrix.exe
27-
quadtrix
28-
build/
29-
cmake-build-*/
30-
.vscode
31-
*.bin
32-
*.pt
33-
*.gguf
34-
*.safetensors
35-
engine/best_model.pt
36-
engine/logs/
37-
engine/fineweb_30mb.txt
38-
data/input.txt
39-
.DS_Store
40-
Thumbs.db
41-
*.swp
42-
*.swo
43-
.idea
44-
docker-compose.override.yml

.gitignore

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,46 @@ engine/fine-tune/input.txt
1414
*best_model.pt
1515
*.pt
1616
*exe
17+
.git
18+
.gitignore
19+
.github
20+
.venv
21+
__pycache__
22+
*.pyc
23+
*.pyo
24+
*.pyd
25+
*.egg-info
26+
.pytest_cache
27+
.ruff_cache
28+
dist/
29+
build/
30+
*.egg
31+
node_modules
32+
frontend/node_modules
33+
frontend/dist
34+
frontend/.vite
35+
*.npm-cache
36+
.npmignore
37+
*.o
38+
*.a
39+
*.so
40+
*.dylib
41+
quadtrix.exe
42+
quadtrix
43+
build/
44+
cmake-build-*/
45+
.vscode
46+
*.bin
47+
*.pt
48+
*.gguf
49+
*.safetensors
50+
engine/best_model.pt
51+
engine/logs/
52+
engine/fineweb_30mb.txt
53+
data/input.txt
54+
.DS_Store
55+
Thumbs.db
56+
*.swp
57+
*.swo
58+
.idea
59+
docker-compose.override.yml

config/config.h

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
11
#pragma once
22
#include <string>
3+
34
static const std::string DEFAULT_CLEANED_PATH = "data/input.txt";
45
static const std::string DATA_PATH_ENV_VAR = "GPT_DATA_PATH";
56
static const unsigned int SEED = 1337;
6-
static const double TRAIN_SPLIT = 0.9; // 90 % train, 10 % val
7-
static const int BATCH_SIZE = 4;
8-
static const int BLOCK_SIZE = 64; // context length
9-
static const int MAX_ITERS = 10000;
10-
static const int EVAL_INTERVAL = 20;
11-
static const float LEARNING_RATE = 3e-4f;
12-
static const int EVAL_ITERS = 1;
7+
static const double TRAIN_SPLIT = 0.9; // 90% train, 10% val
8+
static const int BATCH_SIZE = 16;
9+
static const int BLOCK_SIZE = 64; // Context length
10+
static const int MAX_ITERS = 5000;
11+
static const int EVAL_INTERVAL = 250;
12+
static const float LEARNING_RATE = 5e-4f;
13+
static const int EVAL_ITERS = 100;
1314
static const int N_EMBD = 128;
1415
static const int N_HEAD = 4;
1516
static const int N_LAYER = 4;
16-
static const float DROPOUT = 0.2f; // applied during training only
17+
static const float DROPOUT = 0.05f;
1718
static const std::string BEST_MODEL_PATH = "best_model.bin";
1819
static const std::string MODEL_PATH_ENV_VAR = "GPT_MODEL_PATH";

0 commit comments

Comments
 (0)