Guys,
I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.
Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.
For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]
Dan
Guys,
I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.
Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.
For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]
Dan