OpenNMT
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 7 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 7 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 2 additions & 2 deletions b/‎CMakeLists.txt‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 16 additions & 3 deletions b/‎CONTRIBUTING.md‎
Lines changed: 16 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 0 deletions b/‎README.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎include/ctranslate2/batch_reader.h‎
Lines changed: 8 additions & 1 deletion b/‎include/ctranslate2/batch_reader.h‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎python/cpp/generator.cc‎
Lines changed: 4 additions & 4 deletions b/‎python/cpp/generator.cc‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎python/cpp/translator.cc‎
Lines changed: 4 additions & 4 deletions b/‎python/cpp/translator.cc‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎python/ctranslate2/converters/fairseq.py‎
Lines changed: 3 additions & 1 deletion b/‎python/ctranslate2/converters/fairseq.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎python/ctranslate2/converters/opennmt_py.py‎
Lines changed: 3 additions & 1 deletion b/‎python/ctranslate2/converters/opennmt_py.py‎
Lines changed: 3 additions & 1 deletion
@@ -170,7 +170,7 @@ jobs:
           CIBW_MANYLINUX_X86_64_IMAGE: manylinux2014
           CIBW_MANYLINUX_AARCH64_IMAGE: manylinux2014
           CIBW_ARCHS: ${{ matrix.arch }}
-          CIBW_SKIP: pp* *-musllinux_*
+          CIBW_SKIP: "*-musllinux_*"
 
       - name: Upload Python wheels
         uses: actions/upload-artifact@v4
@@ -195,10 +195,6 @@ jobs:
             artifact_pattern: python-wheels-Linux-aarch64
             wheel_pattern: "*cp310*manylinux*_aarch64.whl"
 
-          #- os: windows-2022
-          #  artifact_pattern: python-wheels-Windows-auto64
-          #  wheel_pattern: "*cp310*win*.whl"
-
           - os: macos-15
             artifact_pattern: python-wheels-macOS-arm64
             wheel_pattern: "*cp310*macosx*arm64.whl"
@@ -226,8 +222,6 @@ jobs:
       - name: Install wheel
         shell: bash
         run: |
-          ls -l
-          find .
           pip install ${{ matrix.wheel_pattern }}
 
       - name: Test Python wheel
 
@@ -547,8 +547,9 @@ if (WITH_CUDA)
     list(APPEND PRIVATE_INCLUDE_DIRECTORIES ${CUDNN_INCLUDE_DIR})
     list(APPEND LIBRARIES ${CUDNN_LIBRARIES})
     add_definitions(-DCT2_WITH_CUDNN)
+    list(APPEND SOURCES src/ops/conv1d_cudnn_gpu.cu)
   else()
-    message(WARNING "cuDNN library is not enabled: convolution layers will not be supported on GPU")
+    list(APPEND SOURCES src/ops/conv1d_gpu.cu)
   endif()
 
   if(CUDA_DYNAMIC_LOADING)
@@ -638,7 +639,6 @@ if (WITH_CUDA)
     src/ops/alibi_add_gpu.cu
     src/ops/bias_add_gpu.cu
     src/ops/concat_split_slide_gpu.cu
-    src/ops/conv1d_gpu.cu
     src/ops/dequantize_gpu.cu
     src/ops/flash_attention_gpu.cu
     src/ops/gather_gpu.cu
 
@@ -23,6 +23,19 @@ Do you think a feature is missing or would be a great addition to the project? P
   * look for GitHub issues marked with the *help wanted* label: these are developments that we find particularly suited for community contributions.
 * If you are planning to make a large change to the existing code, consider asking first on [the forum](https://forum.opennmt.net/) to confirm that it is welcome.
 
+## Contribution rules
+
+CTranslate2 is a low-level, performance-critical codebase. A single misplaced pointer or inefficient memory allocation (which LLMs often get wrong) can take hours to debug.
+
+To maintain code integrity and manage maintainer workload, we apply the following policy:
+
+* Use of AI tools for brainstorming or minor assistance is acceptable, but contributors must explicitly disclose how AI was used and remain fully responsible for correctness, performance, and design. Submissions that appear generated without deep understanding will be declined. Verifying AI output for correctness and performance is more time-consuming than writing code manually.
+
+* Mandatory Deep Understanding: Contributors must fully understand their code and be prepared to justify the purpose of part of the code base.
+
+* Please contribute within your area of expertise. If you are not familiar with the core codebase, consider contributing to documentation, examples, or Hugging Face integrations.
+
+
 ### Building the sources
 
 See [Install from sources](https://opennmt.net/CTranslate2/installation.html#install-from-sources).
@@ -85,7 +98,7 @@ The list is ordered on 5. from the largest to smallest time.
 
 #### `StorageView` class
 
-CTranslate2 uses [row-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order) storages, usually encapsulated in the `StorageView` class. This class acts like a tensor representation but without the mathematical semantics. It is convenience wrapper to view a buffer of data in a particular shape, and provides methods to resize, reshape, and copy data. The underlying storage has a type (e.g. `float`) and a location (e.g. GPU #1) which are both resolved at runtime.
+CTranslate2 uses [row-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order) storages, usually encapsulated in the `StorageView` class. This class acts like a tensor representation but without the mathematical semantics. It is a convenience wrapper to view a buffer of data in a particular shape, and provides methods to resize, reshape, and copy data. The underlying storage has a type (e.g. `float`) and a location (e.g. GPU #1) which are both resolved at runtime.
 
 To maximize performance, the implementation avoid new allocations when possible:
 
@@ -144,7 +157,7 @@ To limit the size of the packages pushed to PyPI, some libraries are not include
 
 One of the benefits of this dynamic loading is that multiple versions of cuBLAS and cuDNN are supported by the same binary. In particular, users can install any CUDA 12.x version as long as it provides `libcublas.so.12`.
 
-The Python library only support CUDA 12.x. C++ source code is always compatible with CUDA 11, possible to use CUDA 11 libraries during compilation to create CUDA 11.x support wheel.
+The Python library only supports CUDA 12.x. C++ source code is always compatible with CUDA 11, possible to use CUDA 11 libraries during compilation to create CUDA 11.x support wheel.
 
 ### Updating other dependencies
 
@@ -161,7 +174,7 @@ If a dependency needs an update, it is particularly important that it is updated
 
 ### Managing PyPI project size limit
 
-Projects on PyPI have a size limit. The default limit is 10GB and [we already requested](https://github.com/pypi/support/issues/1480) an increase to 20GB in the past. Because increase requests can take several months to be accepted, we now try to work with this 20GB limit.
+Projects on PyPI have a size limit. The default limit is 10GB. Currently the CTranslate2 project [has 50GB](https://github.com/pypi/support/issues/8119) of storage limit.
 
 So older releases need to be regularly deleted on PyPI to make room for new releases. **However, make sure to keep the latest release of each major version.**
 
 
@@ -119,6 +119,16 @@ Executed with 4 threads on a [*c5.2xlarge*](https://aws.amazon.com/ec2/instance-
 
 Executed with CUDA 11 on a [*g5.xlarge*](https://aws.amazon.com/ec2/instance-types/g5/) Amazon EC2 instance equipped with a NVIDIA A10G GPU (driver version: 510.47.03).
 
+## Contributing
+
+CTranslate2 is a community-driven project. We welcome contributions of all kinds:
+* **New Model Support:** Help us implement more Transformer architectures.
+* **Performance:** Propose optimizations for CPU or GPU kernels.
+* **Bug Reports:** Open an issue if you find something not working as expected.
+* **Documentation:** Improve our guides or add new examples.
+
+Check out our [Contributing Guide](CONTRIBUTING.md) to learn how to set up your development environment.
+
 ## Additional resources
 
 * [Documentation](https://opennmt.net/CTranslate2)
 
@@ -56,7 +56,8 @@ namespace ctranslate2 {
 
     std::vector<Example>
     get_next(const size_t max_batch_size,
-             const BatchType batch_type = BatchType::Examples);
+             const BatchType batch_type = BatchType::Examples,
+             const bool consider_padding = false);
 
     // Consumes and returns the next example.
     virtual Example get_next_example() = 0;
@@ -67,6 +68,12 @@ namespace ctranslate2 {
     }
 
   private:
+    std::vector<Example> fill_batch_with_fixed_increment(const size_t max_batch_size,
+                                                          const BatchType batch_type);
+
+    std::vector<Example> fill_batch_with_variable_increment(const size_t max_batch_size,
+                                                             const BatchType batch_type);
+
     bool _initialized = false;
     Example _next;
   };
 
@@ -234,10 +234,10 @@ namespace ctranslate2 {
                  Arguments:
                    start_tokens: Batch of start tokens. If the decoder starts from a special
                      start token like ``<s>``, this token should be added to this input.
-                   max_batch_size: The maximum batch size. If the number of inputs is greater than
-                     :obj:`max_batch_size`, the inputs are sorted by length and split by chunks of
-                     :obj:`max_batch_size` examples so that the number of padding positions is
-                     minimized.
+                   max_batch_size: The maximum batch size. If the number of inputs is greater than :obj:`max_batch_size`,
+                     the inputs are sorted by length and split by chunks of :obj:`max_batch_size` examples
+                     (or tokens when :obj:`batch_type`="tokens") so that the number of padding positions
+                     is minimized.
                    batch_type: Whether :obj:`max_batch_size` is the number of "examples" or "tokens".
                    asynchronous: Run the generation asynchronously.
                    beam_size: Beam size (1 for greedy search).
 
@@ -372,10 +372,10 @@ namespace ctranslate2 {
                  Arguments:
                    source: Batch of source tokens.
                    target_prefix: Optional batch of target prefix tokens.
-                   max_batch_size: The maximum batch size. If the number of inputs is greater than
-                     :obj:`max_batch_size`, the inputs are sorted by length and split by chunks of
-                     :obj:`max_batch_size` examples so that the number of padding positions is
-                     minimized.
+                   max_batch_size: The maximum batch size. If the number of inputs is greater than :obj:`max_batch_size`,
+                     the inputs are sorted by length and split by chunks of :obj:`max_batch_size` examples
+                     (or tokens when :obj:`batch_type`="tokens") so that the number of padding positions
+                     is minimized.
                    batch_type: Whether :obj:`max_batch_size` is the number of "examples" or "tokens".
                    asynchronous: Run the translation asynchronously.
                    beam_size: Beam size (1 for greedy search).
 
@@ -146,7 +146,9 @@ def _load(self):
             import_user_module(argparse.Namespace(user_dir=self._user_dir))
 
         with torch.no_grad():
-            checkpoint = checkpoint_utils.load_checkpoint_to_cpu(self._model_path)
+            checkpoint = torch.load(
+                self._model_path, map_location=torch.device("cpu"), weights_only=False
+            )
             args = checkpoint["args"] or checkpoint["cfg"]["model"]
 
             args.data = self._data_dir
 
@@ -174,7 +174,9 @@ def __init__(self, model_path: str):
     def _load(self):
         import torch
 
-        checkpoint = torch.load(self._model_path, map_location="cpu")
+        checkpoint = torch.load(
+            self._model_path, map_location="cpu", weights_only=False
+        )
 
         src_vocabs, tgt_vocabs = get_vocabs(checkpoint["vocab"])