Skip to content

Commit c74f33d

Browse files
author
Patrick Emami
committed
Touching up documentation
1 parent ad8958b commit c74f33d

3 files changed

Lines changed: 21 additions & 18 deletions

File tree

docs/getting_started/Trying_out_EvoProtGrad.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ import evo_prot_grad
77
Create a `ProtBERT` expert from a pretrained 🤗 HuggingFace protein language model (PLM) using `evo_prot_grad.get_expert`:
88

99
```python
10-
prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0, device = 'cuda')
10+
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
1111
```
12-
The default BERT-style PLM in `EvoProtGrad` is `Rostlab/prot_bert`. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. For masked language models like `prot_bert`, we score variant sequences with the sum of amino acid log probabilities by default.
12+
The default BERT-style PLM in `EvoProtGrad` is `Rostlab/prot_bert`. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. For protein language models like `prot_bert`, we have implemented two scoring strategies: `pseudolikelihood_ratio` and `mutant_marginal`. The `pseudolikelihood_ratio` strategy computes the ratio of the "pseudo" log-likelihood (this isn't the exact log-likelihood when the protein language model is a *masked* language model) of the wild type and mutant sequence.
1313

1414
Then, we create an instance of `DirectedEvolution` and run the search, returning a list of the best variant per Markov chain (as measured by the `prot_bert` expert):
1515

@@ -29,7 +29,7 @@ This class implements PPDE, the gradient-based discrete MCMC sampler introduced
2929

3030
### Specifying the model and tokenizer
3131

32-
To load a HuggingFace expert with a specific model and tokenizer, provide them as arguments to `get_expert`:
32+
To load a HuggingFace expert with a specific model and tokenizer, provide them as arguments to [`evo_prot_grad.get_expert`](https://nrel.github.io/EvoProtGrad/api/evo_prot_grad/#get_expert):
3333

3434
```python
3535
from transformers import AutoTokenizer, EsmForMaskedLM
@@ -38,6 +38,7 @@ esm2_expert = evo_prot_grad.get_expert(
3838
'esm',
3939
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
4040
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
41+
scoring_strategy = 'mutant_marginal',
4142
temperature = 1.0,
4243
device = 'cuda')
4344
```
@@ -51,13 +52,14 @@ You can compose multiple experts by passing multiple experts to `DirectedEvoluti
5152
import evo_prot_grad
5253
from transformers import AutoModel
5354

54-
prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0, device = 'cuda')
55+
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
5556

5657
# onehot_downstream_regression are experts that predict a downstream scalar property
5758
# from a one-hot encoding of the protein sequence
5859
fluorescence_expert = evo_prot_grad.get_expert(
5960
'onehot_downstream_regression',
6061
temperature = 1.0,
62+
scoring_strategy = 'attribute_value',
6163
model = AutoModel.from_pretrained('NREL/avGFP-fluorescence-onehot-cnn',
6264
trust_remote_code=True),
6365
device = 'cuda')

docs/getting_started/experts.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -33,33 +33,33 @@ $$
3333
\log P(X) = \log F(X) + \lambda \log G(X) - \log Z.
3434
$$
3535

36-
In `EvoProtGrad`, the "score" of each expert corresponds to either $\log F(X)$ or $\log G(X)$ here. In most cases, we interpret the scalar output of a neural network as the score.
36+
In `EvoProtGrad`, the **score**"** of each expert corresponds to either $\log F(X)$ or $\log G(X)$ here. In most cases, we interpret the scalar output of a neural network as the score.
3737
The magic of the Product of Experts formulation is that it enables us to compose arbitrary numbers of experts, essentially allowing us to "plug and play" with different experts to guide the search.
3838

3939
In actuality, instead of just searching for a protein variant that maximizes $P(X)$, `EvoProtGrad` uses gradient-based discrete MCMC to *sample* from $P(X)$.
4040
MCMC is necessary for sampling from $P(X)$ because it is impractical to compute the partition function $Z$ exactly.
4141
Uniquely to `EvoProtGrad`, as long *all* experts are *differentiable*, our sampler can use the gradient of $\log F(X) + \lambda \log G(X)$ with respect to the one-hot protein $X$ to identify the most promising mutation to apply to $X$, which vastly speeds up MCMC convergence.
4242

43-
## 🤗 HuggingFace Transformers
43+
## 🤗 HuggingFace Protein Language Models (PLMs)
4444

4545
`EvoProtGrad` provides a convenient interface for defining and using experts from the HuggingFace Hub.
46-
To use pretrained PLMs from the HuggingFace Hub with gradient-based discrete MCMC, we swap out each transformer's token embedding layer for a custom [one-hot token embedding layer](https://nrel.github.io/EvoProtGrad/api/common/embeddings/#onehotembedding). This enables us to compute and access gradients with respect to one-hot input protein sequences.
46+
In detail, we modify pretrained PLMs from the HuggingFace Hub to use with gradient-based discrete MCMC by hot-swapping the Transformer's token embedding layer for a custom [one-hot embedding layer](https://nrel.github.io/EvoProtGrad/api/common/embeddings/#onehotembedding). This enables us to compute and access gradients with respect to one-hot protein sequences.
4747

48-
We provide a baseclass `evo_prot_grad.experts.base_experts.ProteinLMExpert` which is subclassed to support various types of HuggingFace PLMs. Currently, we provide three subclasses for
48+
We provide a baseclass `evo_prot_grad.experts.base_experts.ProteinLMExpert` which can be subclassed to support various types of HuggingFace PLMs. Currently, we provide three subclasses for
4949

5050
- BERT-style PLMs (`evo_prot_grad.experts.bert_expert.BertExpert`)
5151
- CausalLM-style PLMs (`evo_prot_grad.experts.causallm_expert.CausalLMExpert`)
5252
- ESM-style PLMs (`evo_prot_grad.experts.esm_expert.EsmExpert`)
5353

54-
Each HuggingFace PLM expert has to specify the model and tokenizer to use. Defaults for each type of PLM are provided.
54+
To instantiate EvoProtGrad ProteinLMExperts, we provide a simple function [`evo_prot_grad.get_expert`](https://nrel.github.io/EvoProtGrad/api/evo_prot_grad/#get_expert). The name of the expert, the variant scoring strategy, and the temperature for scaling the expert score must be provided. We provide defaults for the other arguments to `get_expert`.
5555

56-
For example, an ESM2 expert can be instantiated with `evo_prot_grad.get_expert` with only:
56+
For example, an ESM2 expert can be instantiated with:
5757

5858
```python
59-
esm2_expert = evo_prot_grad.get_expert('esm', temperature = 1.0, device = 'cuda')
59+
esm2_expert = evo_prot_grad.get_expert('esm', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
6060
```
6161

62-
using the default model `EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")` and tokenizer `AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")`.
62+
which uses the default ESM2 model `EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")` and tokenizer `AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")`.
6363

6464
To load the ESM2 expert with a specific model and tokenizer, provide them as arguments to `get_expert`:
6565

@@ -70,6 +70,7 @@ esm2_expert = evo_prot_grad.get_expert(
7070
'esm',
7171
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
7272
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
73+
scoring_strategy = 'mutant_marginal',
7374
temperature = 1.0,
7475
device = 'cuda')
7576
```
@@ -94,6 +95,7 @@ evcouplings_model = EVCouplings(
9495
evcouplings_expert = get_expert(
9596
'evcouplings',
9697
temperature = 1.0,
98+
scoring_strategy = 'attribute_value',
9799
model = evcouplings_model)
98100
```
99101

@@ -110,6 +112,7 @@ onehotcnn_model = AutoModel.from_pretrained(
110112
regression_expert = get_expert(
111113
'onehot_downstream_regression',
112114
temperature = 1.0,
115+
scoring_strategy = 'attribute_value',
113116
model = onehotcnn_model)
114117
```
115118

@@ -125,16 +128,12 @@ onehotcnn_model.load_state_dict(torch.load('onehotcnn.pt'))
125128
regression_expert = get_expert(
126129
'onehot_downstream_regression',
127130
temperature = 1.0,
131+
scoring_strategy = 'attribute_value',
128132
model = onehotcnn_model)
129133
```
130134

131135
## Choosing the Expert Temperature
132136

133137
The expert temperature $\lambda$ controls the relative importance of the expert in the Product of Experts. By default it is set to 1.
134-
135-
!!! note
136-
137-
By default, the expert's score for a variant is normalized by the wild type score, i.e., we subtract the wild type score from the variant score. This is to ensure that each expert in the Product of Experts is centered around 0. If you want to use the raw expert score, set `use_without_wildtype = True` when instantiating the expert.
138-
139-
If using wild type centering, then we recommend first trying temperatures of 1.0 for each $\lambda$.
138+
We recommend first trying temperatures of 1.0 for each $\lambda$, and checking whether each expert score is within the same order of magnitude as the other experts. If one expert's scores are much larger or smaller than the others, you may need to adjust the temperature to balance the experts.
140139
We describe a simple heuristic for selecting $\lambda$ via a grid search using a small dataset of labeled variants in Section 5.2 of our [paper](https://doi.org/10.1088/2632-2153/accacd).

evo_prot_grad/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ def get_expert(expert_name: str,
3232
expert_name = 'esm',
3333
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t36_3B_UR50D"),
3434
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t36_3B_UR50D"),
35+
scoring_strategy = 'mutant_marginal',
36+
temperature = 1.0,
3537
device = 'cuda'
3638
)
3739
```

0 commit comments

Comments
 (0)