minor bugfixes

ngc92 · ngc92 · commit 5ce9c1749a4c · 2026-05-25T17:56:17.000+02:00
diff --git a/README.md b/README.md
@@ -103,7 +103,7 @@ and sets this in relation to the GPU's speed of light (SOL), i.e., the fastest p
 
 ### Inspecting the logs
 After 50 steps, the training will finish, and save the final model to `model.safetensors`. In addition, a log file will be created,
-which contains the training log in JSON format. We can visualize the log using the [plot_training_run.py](scripts/plot_training_run.py) utility script:`
+which contains the training log in JSON format. We can visualize the log using the [plot_training_run.py](scripts/plot_training_run.py) utility script:
 ```shell
 uv run scripts/plot_training_run.py log.json
 ```
@@ -281,14 +281,14 @@ For the optimizer state, this will slow down the optimizer step drastically (mem
 ## Python bindings
 While it is nice to demonstrate training in pure C++/Cuda, there are scenarios where it is desirable to use Python for training, e.g., when using an alternative learning-rate schedule.
 
-The Python bindings are provided in the `src/bindings` directory, and can be built using the
+The Python bindings are provided in the `src/binding` directory, and can be built using the
 `_pyllmq` target. The library can be built manually (`-DPYTHON_BINDING=ON`), or directly into a wheel file
 using `uv build --wheel`.
 The [demo.py](scripts/demo.py) script provides an example of how to use the bindings. Running it with `uv run pyllmq-demo` will trigger the wheel build automatically.
 
 Pre-built wheels are available from [GitHub Releases](https://github.com/IST-DASLab/llmq/releases) for convenience.
-Download the latest `.whl` file and install it with `uv pip install 'pyllmq-0.2.0-cp312-abi3-linux_x86_64.whl[scripts]'`,
-or run example scripts directly: `uv run --with 'pyllmq-0.2.0+cu128-cp312-abi3-linux_x86_64.whl[scripts]' pyllmq-demo`, replacing the file name as appropriate. The `[scripts]` extra installs additional packages that aren't strictly required for pyllmq, but are used in the utility scripts, such as `datasets` and `matplotlib`.
+Download the latest `.whl` file and install it with `uv pip install 'pyllmq-0.3.2-cp312-abi3-linux_x86_64.whl[scripts]'`,
+or run example scripts directly: `uv run --with 'pyllmq-0.3.2+cu128-cp312-abi3-linux_x86_64.whl[scripts]' pyllmq-demo`, replacing the file name as appropriate. The `[scripts]` extra installs additional packages that aren't strictly required for pyllmq, but are used in the utility scripts, such as `datasets` and `matplotlib`.
 The wheels are built against CUDA 12.8 and 13.0 and support compute capabilities 89, 90, 100f, and 120f.
 
 By design, the bindings expose only coarse-grained operations; that is, the minimum unit
diff --git a/doc.md b/doc.md
@@ -11,7 +11,7 @@ for (int step = latest_step; step < MaxSteps; ++step) {
     for (int j = 0; j < GradAccSteps; ++j) {
         train_loader.load_batch(inputs, targets);
         model.forward(inputs, comm, j);
-        model.backward(inputs, targets, comm, GradAccSteps, j);
+        model.backward(inputs, targets, comm, 0.0f, GradAccSteps, j);
     }
 
     float lr = schedule.get_lr(step);
diff --git a/pyproject.toml b/pyproject.toml
@@ -47,7 +47,6 @@ scripts = [
     "wandb>=0.21.1",
     "lm-eval>=0.4.9",
     "pandas>=2.3.1",
-    "plotly[express]>=6.3.0",
 ]
 
 [project.urls]
diff --git a/scripts/train.py b/scripts/train.py
@@ -265,7 +265,7 @@ def run_training_loop(config: pyllmq.TrainingConfig, trainer: pyllmq.LLMQTrainer
 
     # Final evaluation
     print("\nRunning final evaluation...")
-    final_loss, _ = run_evaluation(trainer, eval_loader, in_tokens, out_tokens, eval_loader.num_chunks)
+    final_loss, final_loss_1k, eval_ms = run_evaluation(trainer, eval_loader, in_tokens, out_tokens, eval_loader.num_chunks)
     print(f"Final validation loss: {final_loss:.4f}")
 
 
diff --git a/src/binding/binding.cpp b/src/binding/binding.cpp
@@ -122,7 +122,7 @@ NB_MODULE(_pyllmq, m) {
         .def_rw("num_hidden_layers", &TransformerConfig::NumLayers)
         .def_rw("max_position_embeddings", &TransformerConfig::MaxPositionEmbeddings)
         .def_rw("rope_theta", &TransformerConfig::RopeTheta)
-        .def_rw("rms_norm_eps", &TransformerConfig::RopeTheta)
+        .def_rw("rms_norm_eps", &TransformerConfig::RmsNormEps)
         .def_rw("tie_word_embeddings", &TransformerConfig::TiedWordEmbeddings)
         .def_rw("use_qkv_bias", &TransformerConfig::UseQKVBias)
         .def_prop_rw("dtype",

Original file line number	Diff line number	Diff line change
`@@ -11,7 +11,7 @@ for (int step = latest_step; step < MaxSteps; ++step) {`
`11`	`11`	`for (int j = 0; j < GradAccSteps; ++j) {`
`12`	`12`	`train_loader.load_batch(inputs, targets);`
`13`	`13`	`model.forward(inputs, comm, j);`
`14`		`- model.backward(inputs, targets, comm, GradAccSteps, j);`
	`14`	`+ model.backward(inputs, targets, comm, 0.0f, GradAccSteps, j);`
`15`	`15`	`}`
`16`	`16`
`17`	`17`	`float lr = schedule.get_lr(step);`
Original file line number	Diff line number	Diff line change
`@@ -47,7 +47,6 @@ scripts = [`
`47`	`47`	`"wandb>=0.21.1",`
`48`	`48`	`"lm-eval>=0.4.9",`
`49`	`49`	`"pandas>=2.3.1",`
`50`		`- "plotly[express]>=6.3.0",`
`51`	`50`	`]`
`52`	`51`
`53`	`52`	`[project.urls]`