Skip to content
This repository was archived by the owner on Oct 13, 2022. It is now read-only.
This repository was archived by the owner on Oct 13, 2022. It is now read-only.

Low hanging fruit: neural language model #132

@danpovey

Description

@danpovey

Guys,

I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.

Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.

For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]

Dan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions