Skip to content

Commit c14058e

Browse files
committed
update README.md
1 parent 5ae0131 commit c14058e

1 file changed

Lines changed: 28 additions & 9 deletions

File tree

README.md

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,6 @@ The original paper about GE2E loss could be found here: [Generalized End-to-End
55

66
## Usage
77

8-
You can download the pretrained models from: [Wiki - Pretrained Models](https://github.com/yistLin/dvector/wiki/Pretrained-Models).
9-
10-
Since the models are compiled with TorchScript, you can simply load and use a pretrained d-vector anywhere.
11-
128
```python
139
import torch
1410
import torchaudio
@@ -17,10 +13,28 @@ wav2mel = torch.jit.load("wav2mel.pt")
1713
dvector = torch.jit.load("dvector.pt").eval()
1814

1915
wav_tensor, sample_rate = torchaudio.load("example.wav")
20-
mel_tensor = wav2mel(wav_tensor, sample_rate)
21-
emb_tensor = dvector.embed_utterance(mel_tensor)
16+
mel_tensor = wav2mel(wav_tensor, sample_rate) # shape: (frames, mel_dim)
17+
emb_tensor = dvector.embed_utterance(mel_tensor) # shape: (emb_dim)
18+
```
19+
20+
You can also embed multiple utterances of a speaker at once:
21+
22+
```python
23+
emb_tensor = dvector.embed_utterances([mel_tensor_1, mel_tensor_2]) # shape: (emb_dim)
2224
```
2325

26+
There are 2 modules in this example:
27+
- `wav2mel.pt` is the preprocessing module which is composed of 2 modules:
28+
- `sox_effects.pt` is used to normalize volume, remove silence, resample audio to 16 KHz, 16 bits, and remix all channels to single channel
29+
- `log_melspectrogram.pt` is used to transform waveforms to log mel spectrograms
30+
- `dvector.pt` is the speaker encoder
31+
32+
Since all the modules are compiled with [TorchScript](https://pytorch.org/docs/stable/jit.html), you can simply load them and use anywhere **without any dependencies**.
33+
34+
### Pretrianed models & preprocessing modules
35+
36+
You can download them from the page of [*Releases*](https://github.com/yistLin/dvector/releases).
37+
2438
## Train from scratch
2539

2640
### Preprocess training data
@@ -38,12 +52,12 @@ python preprocess.py VoxCeleb1/dev LibriSpeech/train-clean-360 -o preprocessed
3852
```
3953

4054
If you need to modify some audio preprocessing hyperparameters, directly modify `data/wav2mel.py`.
41-
After preprocessing, 3 modules will be saved in the output directory:
55+
After preprocessing, 3 preprocessing modules will be saved in the output directory:
4256
1. `wav2mel.pt`
4357
2. `sox_effects.pt`
4458
3. `log_melspectrogram.pt`
4559

46-
> The first module `wav2mel.pt` is actually composed of the second and the third modules.
60+
> The first module `wav2mel.pt` is composed of the second and the third modules.
4761
> These modules were compiled with TorchScript and can be used anywhere to preprocess audio data.
4862
4963
### Train a model
@@ -57,7 +71,12 @@ python train.py preprocessed <model_dir>
5771
During training, logs will be put under `<model_dir>/logs` and checkpoints will be placed under `<model_dir>/checkpoints`.
5872
For more details, check the usage with `python train.py -h`.
5973

60-
### Visualize speaker embeddings
74+
### Use different speaker encoders
75+
76+
By default I'm using 3-layerd LSTM with attentive pooling as the speaker encoder, but you can use speaker encoders of different architecture.
77+
For more information, please take a look at `modules/dvector.py`.
78+
79+
## Visualize speaker embeddings
6180

6281
You can visualize speaker embeddings using a trained d-vector.
6382
Note that you have to structure speakers' directories in the same way as for preprocessing.

0 commit comments

Comments
 (0)