Skip to content

Commit b936329

Browse files
authored
fix: separate loading configuration handling for jieba and lindera (#3932)
This is a pull request for #3891, which implements the following changes: - Changed the Lindera configuration file format to YAML (see: https://github.com/lindera/lindera?tab=readme-ov-file#configuration-file). - Replaced all lindera `config.json` files with `config.yml`. - Updated the documentation to reflect the new YAML-based configuration. - Introduced the `LINDERA_CONFIG_PATH` environment variable to allow specifying custom config file. Please let me know if backward compatibility needs to be preserved.
1 parent f830d8b commit b936329

18 files changed

Lines changed: 226 additions & 141 deletions

File tree

docs/tokenizer.rst

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -70,18 +70,17 @@ Using the Model
7070
User Dictionaries
7171
~~~~~~~~~~~~~~~~~
7272

73-
Create a file named config.json in the root directory of the current model.
73+
Create a file named config.yml in the root directory of your model, or specify a custom YAML file using the `LINDERA_CONFIG_PATH` environment variable.
74+
If both are provided, the config.yml in the root directory will be used.
75+
For more detailed configuration methods, see the lindera documentation at https://github.com/lindera/lindera/.
7476

75-
.. code-block::json
76-
{
77-
"main": "main",
78-
"users": "path/to/user/dict.bin",
79-
"user_kind": "ipadic|ko-dic|unidic"
80-
}
77+
.. code-block:: yaml
8178
82-
- The "main" field is optional. If not filled, the default is the "main" directory.
83-
- "user" is the path of the user dictionary. The user dictionary can be passed as a CSV file or as a binary file compiled by lindera-cli.
84-
- The "user_kind" field can be left blank if the user dictionary is in binary format. If it's in CSV format, you need to specify the type of the language model.
79+
segmenter:
80+
mode: "normal"
81+
dictionary:
82+
# Note: in lance, the `kind` field is not supported. You need to specify the model path using the `path` field instead.
83+
path: /path/to/lindera/ipadic/main
8584
8685
8786
Create your own language model

python/python/tests/models/lindera/invalid_dict/config.json

Lines changed: 0 additions & 4 deletions
This file was deleted.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"
5+
user_dictionary:
6+
path: "./python/tests/models/lindera/invalid_dict/invalid.bin"
7+
kind: "ipadic"

python/python/tests/models/lindera/invalid_dict2/config.json

Lines changed: 0 additions & 4 deletions
This file was deleted.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"
5+
user_dictionary:
6+
path: "./python/tests/models/lindera/invalid_dict2/ipadic_simple_userdic.bin"
7+
kind: "ipadic"
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"
5+
6+
character_filters:
7+
- kind: mapping
8+
args:
9+
mapping:
10+
成田: ほげほげ
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
segmenter:
2+
mode: "normal"
3+
dictionary:
4+
path: "./python/tests/models/lindera/ipadic/main"

python/python/tests/models/lindera/user_dict/config.json

Lines changed: 0 additions & 5 deletions
This file was deleted.

0 commit comments

Comments
 (0)