@@ -19,7 +19,7 @@ Some important parameters are listed as follows.
1919 learning_rate: 0.001
2020 network_config:
2121 embed_dropout: 0.4
22- encoder_dropout : 0.4
22+ post_encoder_dropout : 0.4
2323 rnn_dim: 512
2424 rnn_layers: 1
2525
@@ -30,7 +30,7 @@ The training command is:
3030 python3 main.py --config example_config/EUR-Lex/bigru_lwan.yml
3131
3232 After training for 50 epochs, the checkpoint with the best validation performance is stored for testing. The
33- average P@1 score on the test data set is 81.40 %.
33+ average P@1 score on the test data set is 80.36 %.
3434
3535Next, the ``learning_rate `` is changed to 0.003 while other parameters are kept the same.
3636
@@ -39,12 +39,12 @@ Next, the ``learning_rate`` is changed to 0.003 while other parameters are kept
3939 learning_rate: 0.003
4040 network_config:
4141 embed_dropout: 0.4
42- encoder_dropout : 0.4
42+ post_encoder_dropout : 0.4
4343 rnn_dim: 512
4444 rnn_layers: 1
4545
46- By the same training command, the P@1 score of the second parameter set is about 78.14 %, which is
47- 4 % lower than the first one. This demonstrates the importance of parameter selection.
46+ By the same training command, the P@1 score of the second parameter set is about 78.65 %, which is
47+ 2 % lower than the first one. This demonstrates the importance of parameter selection.
4848
4949For more striking examples on the importance of parameter selection, you can see `this paper <https://www.csie.ntu.edu.tw/~cjlin/papers/parameter_selection/acl2021_parameter_selection.pdf >`_.
5050
@@ -61,7 +61,7 @@ In the configuration file, we specify a grid search on the following parameters.
6161 learning_rate: [' grid_search' , [0.003, 0.001, 0.0003]]
6262 network_config:
6363 embed_dropout: [' grid_search' , [0, 0.2, 0.4, 0.6, 0.8]]
64- encoder_dropout : [' grid_search' , [0, 0.2, 0.4]]
64+ post_encoder_dropout : [' grid_search' , [0, 0.2, 0.4]]
6565 rnn_dim: [' grid_search' , [256, 512, 1024]]
6666 rnn_layers: 1
6767 embed_cache_dir: .vector_cache
@@ -75,10 +75,13 @@ Then the training command is:
7575
7676 python3 search_params.py --config example_config/EUR-Lex/bigru_lwan_tune.yml
7777
78- The process finds the best parameter set of ` ` learning_rate= 0.0003` ` , ` ` embed_dropout= 0.4 ` ` , ` ` encoder_dropout = 0.4 ` ` , and ` ` rnn_dim= 512 ` ` .
78+ The process finds the best parameter set of ` ` learning_rate= 0.0003` ` , ` ` embed_dropout= 0.6 ` ` , ` ` post_encoder_dropout = 0.2 ` ` , and ` ` rnn_dim= 256 ` ` .
7979
8080After the search process, the program applies the best parameters to obtain the final model by adding
81- the validation set for training. The average P@1 score is 83.65% on the test set.
81+ the validation set for training. The average P@1 score is 81.99% on the test set, better
82+ than the result without a hyper-parameter search. Note that after obtaining the best
83+ hyper-parameters, we combine training and validation sets to train a final model for testing.
84+ For more details about ' re-training' , please refer to the ` Re-train or not` _ section.
8285
8386Early Stopping of the Parameter Search
8487--------------------------------------
@@ -101,10 +104,10 @@ First, uncomment the following lines in the
101104 brackets: 1
102105
103106Under the same computing environment and the same command, the best parameter set of ` ` learning_rate= 0.001` ` ,
104- ` ` embed_dropout= 0.4` ` , ` ` encoder_dropout = 0.2` ` , and ` ` rnn_dim= 512` ` is found in 47 % of the time compared to the
105- grid search, while the average test P@1 score = 82.90% is similar to the result without early stopping.
107+ ` ` embed_dropout= 0.4` ` , ` ` post_encoder_dropout = 0.2` ` , and ` ` rnn_dim= 512` ` is found in 26 % of the time compared to the
108+ grid search, while the average test P@1 score is similar to the result without early stopping.
106109
107- A summary of results is in the following table. Four Nvidia Tesla V100 GPUs were used in this experiment.
110+ A summary of results is in the following table. Eight Nvidia Tesla V100 GPUs were used in this experiment.
108111
109112
110113.. list-table::
@@ -119,20 +122,149 @@ A summary of results is in the following table. Four Nvidia Tesla V100 GPUs were
119122 - Training Time (GPU)
120123
121124 * - wo/ parameter selection
122- - 20.48
123- - 51.56
124- - 78.13
125- - 52.16
126- - 27.8 minutes
125+ - 20.79
126+ - 54.91
127+ - 80.36
128+ - 53.89
129+ - 42.5 minutes
127130 * - w/ parameter selection (grid search)
128- - 23.65
129- - 59.41
130- - 83.65
131- - 58.72
132- - 24.6 hours
131+ - 24.43
132+ - 57.99
133+ - 81.99
134+ - 57.57
135+ - 23.0 hours
133136 * - w/ parameter selection (ASHA)
134- - 22.70
135- - 57.42
136- - 82.90
137- - 56.38
138- - 11.6 hours
137+ - 23.07
138+ - 58.03
139+ - 82.33
140+ - 57.07
141+ - 5.89 hours
142+
143+ Re-train or not
144+ --------------------------------------
145+
146+ In the ` Grid Search over Parameters` _ section, we split the available data into training
147+ and validation sets for hyperparameter search. For methods like SVM, they usually train the
148+ final model with the best hyper-parameters by combining the training and validation sets.
149+ This approach maximizes the utilization of information for model learning, and we refer to
150+ it as the " re-train" strategy.
151+
152+ .. However, when applied in deep learning, merging the validation set into the training
153+ .. set means that the optimization process, which previously relied on the validation set for
154+ .. termination, no longer works. While there' s no definitively proven best termination criterion
155+ .. , a typical approach is to determine the optimal epoch during
156+ .. hyper-parameter search based on the number of training steps that led to the best
157+ .. validation performance. This optimal epoch serves as a stopping criterion
158+ .. when training the model with all available data. This strategy has been shown
159+ .. to provide stable improvements while mitigating the risk of overfitting.
160+
161+ Since re-training is usually beneficial, we have incorporated the strategy into ``search_params.py``.
162+ When hyper-parameter search is done, the re-training process will be automatically
163+ executed by default, like the case in section `Grid Search over Parameters`_.
164+
165+ Though not recommended, you can use the argument ``--no_retrain`` to disable the
166+ re-training process.
167+
168+ .. code-block:: bash
169+
170+ python search_params.py --config example_config/EUR-Lex/bigru_lwan.yml --no_retrain
171+
172+ By doing so, the model achieving the best validation performance during parameter search
173+ will be returned.
174+ In this case, the P@1 performance with re-training shows an improvement of approximately 2%
175+ compared to the performance without re-training. The following test results illustrate
176+ the advantages of the re-training.
177+
178+ .. list-table::
179+ :widths: 50 25 25 25 25
180+ :header-rows: 1
181+
182+ * - Methods
183+ - Macro-F1
184+ - Micro-F1
185+ - P@1
186+ - P@5
187+
188+ * - wo/ re-training after hyper-parameter search
189+ - 22.95
190+ - 56.37
191+ - 80.08
192+ - 56.24
193+
194+ * - w/ re-training after hyper-parameter search
195+ - 24.43
196+ - 57.99
197+ - 81.99
198+ - 57.57
199+
200+ In a different scenario, if you want to skip the parameter search but still wish
201+ to re-train the model with your chosen hyper-parameters, we will provide an example
202+ of how to do this.
203+
204+ Let' s train a BiGRU model using the configuration file used in the ` Direct Trying Some Parameters` _
205+ section, where the learning rate is set to 0.001. Please note that because the validation set
206+ is not specified in the configuration file, the training dataset is partitioned into
207+ a training set and a validation subsets to assess the performance at each epoch.
208+
209+
210+ .. code-block:: bash
211+
212+ python main.py --config example_config/EUR-Lex/bigru_lwan.yml
213+
214+ Using the model obtained at the epoch of the best validation PR@5,
215+ the test performance is:
216+
217+ .. list-table::
218+ :widths: 25 25 25 25
219+ :header-rows: 1
220+
221+ * - Macro-F1
222+ - Micro-F1
223+ - P@1
224+ - P@5
225+
226+ * - 20.79
227+ - 54.91
228+ - 80.36
229+ - 53.89
230+
231+ To get the epoch with the best validation performance, the following code snippet reads
232+ the log, extracts the performance metrics for each epoch, and identifies the optimal epoch:
233+
234+ .. code-block:: python
235+
236+ import json
237+ import numpy as np
238+
239+ with open(' your_log_path_for_the_first_step.json' , ' r' ) as r: # the log file which records the configuration and validation performance of each epoch is saved in the 'runs' directory by default.
240+ log = json.load(r)
241+ log_metric = np.array([l[log[" config" ][" val_metric" ]] for l in log[" val" ]])
242+ optimal_idx = log_metric.argmax () # if your validation metric is loss, use np.argmin() instead.
243+ best_epoch = optimal_idx.item () + 1
244+ print(best_epoch)
245+
246+ In this case, the optimal epoch should be 42.
247+ We then specify ` ` --no_merge_train_val` ` to include the validation set for training and
248+ specify the number of epochs by ` ` --epochs` ` . Note that options explicitly defined
249+ override those in the configuration file. Because of no validation set, only the model
250+ at the last epoch is returned.
251+
252+ .. code-block:: bash
253+
254+ python main.py --config example_config/EUR-Lex/bigru_lwan.yml --epochs 42 --merge_train_val
255+
256+ Similar with the last case, the test performance improves after re-training:
257+
258+ .. list-table::
259+ :widths: 25 25 25 25
260+ :header-rows: 1
261+
262+ * - Macro-F1
263+ - Micro-F1
264+ - P@1
265+ - P@5
266+
267+ * - 22.65
268+ - 57.06
269+ - 83.10
270+ - 56.34
0 commit comments