Skip to content
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
64f38a5
feat: add HF dependencies (as a group)
meilame-tayebjee Oct 22, 2025
3703f48
feat: add WordPiece tokenize
meilame-tayebjee Oct 22, 2025
1266287
chore: rename file to ngram
meilame-tayebjee Oct 27, 2025
d2563ea
feat: improve base tokenizer, add HF abstract
meilame-tayebjee Oct 27, 2025
ae045ab
feat: change inheritance to HFTokenizer
meilame-tayebjee Oct 27, 2025
c6eac58
feat(dataset): init
meilame-tayebjee Oct 27, 2025
c25eb36
fix: add update of vocab size in post training
meilame-tayebjee Oct 27, 2025
d897bef
fix: categorical tensors set to None instead of empty tensors when no…
meilame-tayebjee Oct 27, 2025
51be1d1
feat: add ruff and datasets dep
meilame-tayebjee Oct 31, 2025
b53a10d
feat: first working example for model/module
meilame-tayebjee Oct 31, 2025
6f3417c
chore: fix signature
meilame-tayebjee Oct 31, 2025
c600f18
chore: default value for batch_idx in predict
meilame-tayebjee Nov 3, 2025
85cb8b8
feat!: violently modularize and simplify forward+checking
meilame-tayebjee Nov 3, 2025
dc863ff
chore: remove tokenizer (now it is ngram tokenizer)
meilame-tayebjee Nov 3, 2025
064b73f
feat!(components): first working example with full modularity
meilame-tayebjee Nov 4, 2025
164cccf
fix: avoid bugs with numpy arrays in boolean contexts
meilame-tayebjee Nov 5, 2025
c5b9673
feat: add smooth imports for HF and output_dim field
meilame-tayebjee Nov 5, 2025
a0fe18c
feat!(wrapper class): finalize orchestration tokenizer, dataset, mode…
meilame-tayebjee Nov 5, 2025
ddd7cec
fix: return only optimizer when scheduler is none
meilame-tayebjee Nov 5, 2025
32e6805
feat(test): clean tests (wip)
meilame-tayebjee Nov 5, 2025
8fdaf0c
chore: clean
meilame-tayebjee Nov 5, 2025
a7f71d3
feat: enable to choose context size in tokenizer
meilame-tayebjee Nov 5, 2025
0a9eda5
chore: pin_memory to default False (avoid warning on CPU run)
meilame-tayebjee Nov 7, 2025
6d951fe
feat: ad __repr__ for all components
meilame-tayebjee Nov 7, 2025
c31ad43
chore: format
meilame-tayebjee Nov 7, 2025
956b7a3
feat!(HF): enable load from pretrained
meilame-tayebjee Nov 7, 2025
a497697
chore: update description
meilame-tayebjee Nov 7, 2025
2fda9c2
feat: __call__ for tokenizers is tokenize
meilame-tayebjee Nov 7, 2025
13b9de4
feat(tokenizers): clean __call__ and __rep__, add offset return for e…
meilame-tayebjee Nov 7, 2025
f55452b
feat!(explainability): finalize explainability feature at word and ch…
meilame-tayebjee Nov 7, 2025
0262109
chore: remove useless file
meilame-tayebjee Nov 7, 2025
6bdb750
fix: typo in trainer_params max_epochs
meilame-tayebjee Nov 10, 2025
830a45c
feat!(tokenizer): ensure output is consistent across al tokenizers
meilame-tayebjee Nov 10, 2025
c7307f5
fix: move hf-dep to optional dependencies
meilame-tayebjee Nov 10, 2025
a5b3e4d
Merge branch 'main' into hf_tokenizer
meilame-tayebjee Nov 10, 2025
934b041
feat!(attention): enable attention logic
meilame-tayebjee Nov 10, 2025
5e150b2
fix: check if categorical var are present before checking their arrays
meilame-tayebjee Nov 10, 2025
162e296
fix: no persistent_workers if num_workers=0
meilame-tayebjee Nov 10, 2025
1591bd9
fix: closing parenthesis
meilame-tayebjee Nov 12, 2025
1af9e53
fix: truncation=True is needed
meilame-tayebjee Nov 12, 2025
4ca1807
add ipywidgets
meilame-tayebjee Nov 12, 2025
927a5e7
fix: check_Y problem of indexes
meilame-tayebjee Nov 12, 2025
7fdb4e3
fix: truncation=True is needed
meilame-tayebjee Nov 12, 2025
4e36940
rmeove unncessary print
meilame-tayebjee Nov 12, 2025
d44d051
progress on doc
meilame-tayebjee Nov 12, 2025
a179c37
fix: load model on cpu to avoid pb after training
meilame-tayebjee Nov 12, 2025
ea26799
progress on docs
meilame-tayebjee Nov 12, 2025
269c76a
fix!(explainability): remove nan words and fix plotting
meilame-tayebjee Nov 12, 2025
89cc8fe
examples : fix basic_classification after refactor
micedre Nov 12, 2025
1b62eee
Fix check for categorical variable
micedre Nov 12, 2025
704fe14
Adapt examples to new package architecture
micedre Nov 13, 2025
be28866
Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…
meilame-tayebjee Nov 13, 2025
be4acf2
chore: first draft of example notebook. WIP
meilame-tayebjee Nov 13, 2025
0f9b4b4
refactor: replace cpu_run with accelerator in TrainingConfig
meilame-tayebjee Nov 17, 2025
5102f82
feat!(tokenizer-ngram): add very fast ngram tokenizer
meilame-tayebjee Nov 18, 2025
ab58e26
doc: clean example notebook
meilame-tayebjee Nov 19, 2025
45ace28
fix: better handling of truncation to avoid warning
meilame-tayebjee Nov 19, 2025
b2e797b
doc: fix readme
meilame-tayebjee Nov 19, 2025
84b118b
fix: allow tokenizer not to have train attribute
meilame-tayebjee Nov 20, 2025
3c0a85a
feat(ngram): add return offsets and word_ids + fix output_dim
meilame-tayebjee Nov 20, 2025
ab70485
fix: update vocab_size after training
meilame-tayebjee Nov 20, 2025
27a11bb
fix: add a flag for return_word_ids
meilame-tayebjee Nov 20, 2025
823467b
fix: add a flag for return_word_ids
meilame-tayebjee Nov 20, 2025
93a6e80
Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…
meilame-tayebjee Nov 20, 2025
4e2ffa5
fix: replace _build_vocab by train
meilame-tayebjee Nov 20, 2025
519a32d
feat(test): add test of all pipeline with different tokenizers
meilame-tayebjee Nov 20, 2025
6017a22
chore: remove old file
meilame-tayebjee Nov 20, 2025
aa70919
fix: right command to install HF dependencies in warning
meilame-tayebjee Nov 20, 2025
41a15f0
chore: change HF opt. dep. group name to huggingface
meilame-tayebjee Nov 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,227 changes: 807 additions & 1,420 deletions notebooks/example.ipynb

Large diffs are not rendered by default.

20 changes: 12 additions & 8 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
[project]
name = "torchtextclassifiers"
description = "An implementation of the https://github.com/facebookresearch/fastText supervised learning algorithm for text classification using Pytorch."
description = "A text classification toolkit to easily build, train and evaluate deep learning text classifiers using PyTorch."
authors = [
{ name = "Tom Seimandi", email = "tom.seimandi@gmail.com" },
{ name = "Julien Pramil", email = "julien.pramil@insee.fr" },
{ name = "Meilame Tayebjee", email = "meilame.tayebjee@insee.fr" },
{ name = "Cédric Couralet", email = "cedric.couralet@insee.fr" },
{ name = "Meilame Tayebjee", email = "meilame.tayebjee@insee.fr" },
]
readme = "README.md"
repository = "https://github.com/InseeFrLab/torchTextClassifiers"
Expand All @@ -31,7 +29,10 @@ dev = [
"nltk",
"unidecode",
"captum",
"pyarrow"
"pyarrow",
"pre-commit>=4.3.0",
"ruff>=0.14.3",
"ipywidgets>=8.1.8",
]
docs = [
"sphinx>=5.0.0",
Expand All @@ -46,6 +47,12 @@ docs = [
[project.optional-dependencies]
explainability = ["unidecode", "nltk", "captum"]
preprocess = ["unidecode", "nltk"]
hf-dep = [
Comment thread
meilame-tayebjee marked this conversation as resolved.
Outdated
"tokenizers>=0.22.1",
"transformers>=4.57.1",
"datasets>=4.3.0",
]


[build-system]
requires = ["uv_build>=0.9.3,<0.10.0"]
Expand All @@ -58,6 +65,3 @@ line-length = 100
[tool.uv.build-backend]
module-name="torchTextClassifiers"
module-root = ""



71 changes: 33 additions & 38 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
import pytest
from unittest.mock import Mock

import numpy as np
from unittest.mock import Mock, MagicMock
import pytest


@pytest.fixture
def sample_text_data():
"""Sample text data for testing."""
return np.array([
"This is a positive example",
"This is a negative example",
"Another positive case",
"Another negative case",
"Good example here",
"Bad example here"
])
return np.array(
[
"This is a positive example",
"This is a negative example",
"Another positive case",
"Another negative case",
"Good example here",
"Bad example here",
]
)


@pytest.fixture
Expand All @@ -25,14 +28,7 @@ def sample_labels():
@pytest.fixture
def sample_categorical_data():
"""Sample categorical data for testing."""
return np.array([
[1, 2],
[2, 1],
[1, 3],
[3, 1],
[2, 2],
[3, 3]
])
return np.array([[1, 2], [2, 1], [1, 3], [3, 1], [2, 2], [3, 3]])


@pytest.fixture
Expand All @@ -48,33 +44,32 @@ def sample_X_text_only(sample_text_data):


@pytest.fixture
def fasttext_config():
"""Mock FastText configuration."""
from torchTextClassifiers.classifiers.fasttext.core import FastTextConfig
config = FastTextConfig(
def model_config():
"""Mock model configuration."""
from torchTextClassifiers import ModelConfig

config = ModelConfig(
embedding_dim=10,
sparse=False,
num_tokens=1000,
min_count=1,
min_n=3,
max_n=6,
len_word_ngrams=2,
num_classes=2
categorical_vocabulary_sizes=[4, 5],
categorical_embedding_dims=[3, 4],
num_classes=10,
)
return config


@pytest.fixture
def mock_tokenizer():
"""Mock NGramTokenizer for testing."""
"""Mock BaseTokenizer for testing."""
tokenizer = Mock()
tokenizer.min_count = 1
tokenizer.min_n = 3
tokenizer.max_n = 6
tokenizer.num_tokens = 1000
tokenizer.word_ngrams = 2
tokenizer.padding_index = 999
tokenizer.vocab_size = 1000
tokenizer.padding_idx = 1
tokenizer.tokenize = Mock(
return_value={
"input_ids": np.array([[1, 2, 3], [4, 5, 6]]),
"attention_mask": np.array([[1, 1, 1], [1, 1, 1]]),
}
)
tokenizer.output_dim = 50
return tokenizer


Expand Down Expand Up @@ -108,4 +103,4 @@ def mock_dataset():
@pytest.fixture
def mock_dataloader():
"""Mock dataloader for testing."""
return Mock()
return Mock()
Loading