TODO: [] fine tunning to find the best parameters [] tensorboardX [] Prediction GUI
===================================================================== Jul 10, 2023
- With the recent version of Pandas,
read_csvshould be loaded with the parameterkeep_default_na=Falseto prevent readingNoneasNaN, as 'None' is a word in normal English.
===================================================================== Oct 12, 2022
- Make reproducible output by fixing seed for torch and numpy
===================================================================== July 28, 2022
- Adding labels for Confusion Matrix axises
- Revision of confusion matrix's labels
===================================================================== July 26, 2022
ignore_index=0forCrossEntropyLossto ignore the padding index. This option Specifies a target value that is ignored and does not contribute to the input gradient resulting in lower computation consumption and the more accuratelossandf1-scoreas without theignore_indexall the paddings are also included in the criteria.- Revision of
flattenfunction applying the sentences' lengths in order to compute the loss more accurately. - By these two modifications, the issue with padding seems solved.
- Now, the predicition of the first token is almost correct.
- target_size is now 3 instead of 4 (to count the token); actually, the loss function needs to be fed
ignore_index=target_size. - Classification Report
- Confusion Matrix
===================================================================== July 25, 2022
listinlist(zip(*batch))ofcollate_fnfunction was not necessary and just made the running time more..to(device)inside thecollate_fnto get rid of device migration inside the training phase.- some retouchments in char2vec
===================================================================== July 20, 2022
- Solved issue with prediction. Using the
setto remove the redundant characters resulted in a new order in each run. To get rid of this issue, thesortedfunction guarantees that we have a unique order. Another solution to this problem is to save thechr2iddictionary with the model and reload it during the prediction.
===================================================================== July 19, 2022
padding_idx=0fornn.Embeddinglayer.- GPU support; run and tested on Google Colab
- increasing dropout, out_ch1, out_ch2 to 0.3, 37, 35, respectively doesn't help so much (
f1-scoreof0.926in comparison to0.924!), so I reverted them to the smaller size.
===================================================================== July 18, 2022
predict.pyfor the ordinary application.
python predict.py --text [sample text] --model [pretrained model]- Since I've used F1_Score with micro average, I should mention here that it means
micro-F1 = micro-precision = micro-recall = accuracy, i.e. I've reported in all casesaccuraynotf1-score. From now on, I usemacroaverage forf1-score. Therefore, the result would be more realistic now.
- Saving the best model for prediction
- multiple-width filter bank in the second layer of the Char2Vec --> better result and less overfitting.
BiLSTMtagger(
(word_embeddings): Char2Vec(
(embeds): Embedding(298, 9)
(conv1): Sequential(
(0): Conv1d(9, 12, kernel_size=(3,), stride=(1,))
(1): ReLU()
(2): Dropout(p=0.25, inplace=False)
)
(convs2): ModuleList(
(0): Sequential(
(0): Conv1d(12, 5, kernel_size=(3,), stride=(1,))
(1): ReLU()
)
(1): Sequential(
(0): Conv1d(12, 5, kernel_size=(4,), stride=(1,))
(1): ReLU()
)
(2): Sequential(
(0): Conv1d(12, 5, kernel_size=(5,), stride=(1,))
(1): ReLU()
)
)
(linear): Sequential(
(0): Linear(in_features=15, out_features=15, bias=True)
(1): ReLU()
)
)
(lstm): LSTM(15, 128, num_layers=2, batch_first=True, dropout=0.25, bidirectional=True)
(hidden2tag): Linear(in_features=256, out_features=4, bias=True)
)===================================================================== July 17, 2022
- F1 score with
weightedaverage instead ofmicro. - Char2Vec class
- removing repetition in a token with more than 4 characters and truncation of any words to the length of at most 20 characters; ==> a slightly better result
- Char2Vec+BiLSTM finished, with f1=0.9549, val_f1=0.9443; another slight improvement in the model
BiLSTMtagger(
(word_embeddings): Char2Vec(
(embeds): Embedding(298, 9)
(conv1): Sequential(
(0): Conv1d(9, 12, kernel_size=(3,), stride=(1,))
(1): ReLU()
(2): Dropout(p=0.1, inplace=False)
)
(conv2): Sequential(
(0): Conv1d(12, 15, kernel_size=(3,), stride=(1,))
(1): ReLU()
)
(linear): Sequential(
(0): Linear(in_features=15, out_features=15, bias=True)
(1): ReLU()
)
)
(lstm): LSTM(15, 128, num_layers=2, batch_first=True, dropout=0.25, bidirectional=True)
(hidden2tag): Linear(in_features=256, out_features=4, bias=True)
)===================================================================== July 16, 2022
-
dechipher the text/label from the output of network
-
tokens should be considered in the context, not as a collection of single tokens: in the following
audiois a Spanish token, not an English one.@andres_romero17 si , prometo hacer un audio :)
other es other es es es es other
-
loss/f1_score plot
- data analysis around tweets and their tokens/chars
- code sanitization
===================================================================== July 15, 2022
- Printing the loss for train/val set on the screen
- computation of
f1_scorefor both training and validation set shows the network convergence - SGD, lr=0.1, hidden_dim=64
Epoch 1/40, loss=0.9072, val_loss=0.8901 ,train_f1=0.5998, val_f1=0.5462
Epoch 2/40, loss=0.6987, val_loss=0.7863 ,train_f1=0.7165, val_f1=0.6602
Epoch 3/40, loss=0.5788, val_loss=0.7573 ,train_f1=0.7714, val_f1=0.7342
Epoch 4/40, loss=0.4912, val_loss=0.7454 ,train_f1=0.8088, val_f1=0.7589
Epoch 5/40, loss=0.4221, val_loss=0.7322 ,train_f1=0.8367, val_f1=0.7747
Epoch 10/40, loss=0.2226, val_loss=0.6976 ,train_f1=0.9123, val_f1=0.7897
Epoch 15/40, loss=0.1427, val_loss=0.7406 ,train_f1=0.9431, val_f1=0.8072
Epoch 20/40, loss=0.1083, val_loss=0.6276 ,train_f1=0.9577, val_f1=0.8133
Epoch 25/40, loss=0.0925, val_loss=0.6425 ,train_f1=0.9648, val_f1=0.8163
Epoch 30/40, loss=0.0842, val_loss=0.6611 ,train_f1=0.9683, val_f1=0.8171
Epoch 35/40, loss=0.0792, val_loss=0.6735 ,train_f1=0.9701, val_f1=0.8178
Epoch 40/40, loss=0.0763, val_loss=0.6753 ,train_f1=0.9711, val_f1=0.8180- Adam+ReduceLROnPlateau, lr=1e-3, wd=1e-5, hidden_dim=128
Epoch 1/7, loss=0.5991, val_loss=0.5572 ,train_f1=0.7311, val_f1=0.7483
Epoch 2/7, loss=0.2947, val_loss=0.4787 ,train_f1=0.9005, val_f1=0.8266
Epoch 3/7, loss=0.1783, val_loss=0.4336 ,train_f1=0.9485, val_f1=0.8379
Epoch 4/7, loss=0.1256, val_loss=0.4124 ,train_f1=0.9653, val_f1=0.8494
Epoch 5/7, loss=0.1049, val_loss=0.3998 ,train_f1=0.9698, val_f1=0.8512
Epoch 6/7, loss=0.0977, val_loss=0.3884 ,train_f1=0.9714, val_f1=0.8512
Epoch 7/7, loss=0.0940, val_loss=0.3817 ,train_f1=0.9725, val_f1=0.8529- Minibatches made a great leap: train_f1=0.97, val_f1=0.94
Epoch 1/40, loss=0.5998, val_loss=0.4027 ,train_f1=0.7502, val_f1=0.7768
Epoch 2/40, loss=0.3764, val_loss=0.3790 ,train_f1=0.8179, val_f1=0.7971
Epoch 3/40, loss=0.3242, val_loss=0.3561 ,train_f1=0.8501, val_f1=0.8307
Epoch 4/40, loss=0.2618, val_loss=0.2922 ,train_f1=0.8861, val_f1=0.8741
Epoch 5/40, loss=0.2209, val_loss=0.2553 ,train_f1=0.9065, val_f1=0.8931
Epoch 10/40, loss=0.1291, val_loss=0.1723 ,train_f1=0.9460, val_f1=0.9291
Epoch 15/40, loss=0.0892, val_loss=0.1429 ,train_f1=0.9616, val_f1=0.9419
Epoch 20/40, loss=0.0665, val_loss=0.1471 ,train_f1=0.9675, val_f1=0.9409
Epoch 25/40, loss=0.0510, val_loss=0.1481 ,train_f1=0.9715, val_f1=0.9397
Epoch 30/40, loss=0.0420, val_loss=0.1676 ,train_f1=0.9742, val_f1=0.9397
Epoch 35/40, loss=0.0359, val_loss=0.1756 ,train_f1=0.9755, val_f1=0.9386
Epoch 40/40, loss=0.0323, val_loss=0.1934 ,train_f1=0.9765, val_f1=0.9403- BiLSTM with 2 layers and dropout prevents kind of overfitting
===================================================================== July 14, 2022
- Data Class improvement
- several dictionaries to convert token,label,char to id and vice versa
- making the coded sentences and their counterparts labels
- LSTM class
CodeSwitchDatasetas well as customized DataLoader
===================================================================== July 13, 2022
- github repo initailization
- reading the paper
- starting the code with Data Class
- an issue with quoting in reading
tsvfiles
- an issue with quoting in reading



