You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: model2vec/distill/distillation.py
+10-21Lines changed: 10 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -39,9 +39,9 @@ def distill_from_model(
39
39
pca_dims: PCADimType=256,
40
40
apply_zipf: bool|None=None,
41
41
sif_coefficient: float|None=1e-4,
42
-
use_subword: bool=True,
43
42
token_remove_pattern: str|None=r"\[unused\d+\]",
44
43
quantize_to: DType|str=DType.Float16,
44
+
use_subword: bool|None=None,
45
45
) ->StaticModel:
46
46
"""
47
47
Distill a staticmodel from a sentence transformer.
@@ -63,18 +63,20 @@ def distill_from_model(
63
63
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
64
64
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
65
65
Should be a value > 0 and < 1.0. A value of 1e-4 is a good default.
66
-
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
67
66
:param token_remove_pattern: If this is set to a string, we compile this into a regex. Any tokens that conform to this regex pattern will be removed from the vocabulary.
68
67
If the pattern is so general that it removes all tokens, we throw an error. If the pattern can't be compiled into a valid regex, we also throw an error.
69
68
:param quantize_to: The data type to quantize to. Can be any of the DType enum members or their string equivalents.
69
+
:param use_subword: DEPRECATED: If this is not set to None, we show a warning. It doesn't do anything.
70
70
:return: A StaticModel
71
71
72
72
"""
73
+
ifuse_subwordisnotNone:
74
+
logger.warning(
75
+
"The `use_subword` parameter is deprecated and will be removed in the next release. It doesn't do anything."
Validate the parameters passed to the distillation function.
162
161
163
-
:param vocabulary: The vocabulary to use.
164
162
:param apply_zipf: DEPRECATED: This parameter used to control whether Zipf is applied.
165
163
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
166
164
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
167
165
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
168
-
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
169
166
:param token_remove_pattern: If this is set to a string, we compile this into a regex. Any tokens that conform to this regex pattern will be removed from the vocabulary.
170
167
:return: The SIF coefficient to use.
171
-
:raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.
172
-
:raises: ValueError if the vocabulary contains duplicate tokens.
173
168
:raises: ValueError if the regex can't be compiled.
174
-
:raises: ValueError if the vocabulary is empty after token removal.
175
169
176
170
"""
177
171
ifapply_zipfisnotNone:
@@ -191,11 +185,6 @@ def _validate_parameters(
191
185
ifnot0<sif_coefficient<1.0:
192
186
raiseValueError("SIF coefficient must be a value > 0 and < 1.0.")
193
187
194
-
ifnotuse_subwordandvocabularyisNone:
195
-
raiseValueError(
196
-
"You must pass a vocabulary if you don't use subword tokens. Either pass a vocabulary, or set use_subword to True."
197
-
)
198
-
199
188
token_remove_regex: re.Pattern|None=None
200
189
iftoken_remove_patternisnotNone:
201
190
try:
@@ -213,10 +202,10 @@ def distill(
213
202
pca_dims: PCADimType=256,
214
203
apply_zipf: bool|None=None,
215
204
sif_coefficient: float|None=1e-4,
216
-
use_subword: bool=True,
217
205
token_remove_pattern: str|None=r"\[unused\d+\]",
218
206
trust_remote_code: bool=False,
219
207
quantize_to: DType|str=DType.Float16,
208
+
use_subword: bool|None=None,
220
209
) ->StaticModel:
221
210
"""
222
211
Distill a staticmodel from a sentence transformer.
@@ -237,10 +226,10 @@ def distill(
237
226
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
238
227
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
239
228
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
240
-
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
241
229
:param token_remove_pattern: If this is set to a string, we compile this into a regex. Any tokens that conform to this regex pattern will be removed from the vocabulary.
242
230
:param trust_remote_code: Whether to trust the remote code. If this is False, we will only load components coming from `transformers`. If this is True, we will load all components.
243
231
:param quantize_to: The data type to quantize to. Can be any of the DType enum members or their string equivalents.
232
+
:param use_subword: DEPRECATED: If this is not set to None, we show a warning. It doesn't do anything.
0 commit comments