You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -80,92 +68,97 @@ See the [List of Supported Models](#list-of-supported-models) section for all av
80
68
81
69
## Usage
82
70
83
-
Refer to the `tests` folder for more use cases.
71
+
> More examples can be found in the `examples` and `tests` folders.
72
+
73
+
`torch-molecule` supports applications in broad domains from chemistry, biology, to materials science. To get started, you can load prepared datasets from `torch_molecule.dataset` (updated after v0.1.3):
74
+
75
+
| Dataset | Description | Function |
76
+
|---------|-------------|----------|
77
+
| qm9 | Quantum chemical properties (DFT level) |`load_qm9`|
78
+
| chembl2k | Bioactive molecules with drug-like properties |`load_chembl2k`|
79
+
| broad6k | Bioactive molecules with drug-like properties |`load_broad6k`|
80
+
| toxcast | Toxicity of chemical compounds |`load_toxcast`|
81
+
| admet | Chemical absorption, distribution, metabolism, excretion, and toxicity |`load_admet`|
82
+
| gasperm | Six gas permeability properties for polymeric materials |`load_gasperm`|
83
+
84
+
85
+
```python
86
+
from torch_molecule.dataset import load_qm9
87
+
88
+
# local_dir is the local path where the dataset will be saved
(We welcome your suggestions and contributions on your datasets!)
86
100
87
-
The following example demonstrates how to use the `GREAMolecularPredictor` class from `torch_molecule`:
101
+
### Fit a Model
88
102
89
-
More examples could be found in the folders `examples` and `tests`.
103
+
After preparing the dataset, we can easily fit a model similar to how we use sklearn (actually, the coding is even simpler than sklearn, as we still need to do feature engineering in sklearn to convert molecule SMILES into vectors):
90
104
91
105
```python
92
106
from torch_molecule import GREAMolecularPredictor
93
107
94
-
# Train GREA model
95
-
grea_model = GREAMolecularPredictor(
108
+
split =int(0.8*len(smiles_list))
109
+
110
+
grea = GREAMolecularPredictor(
96
111
num_task=num_task,
97
112
task_type="regression",
98
-
model_name="GREA_multitask",
99
-
evaluate_criterion='r2',
100
-
evaluate_higher_better=True,
113
+
evaluate_higher_better=False,
101
114
verbose=True
102
115
)
103
116
104
-
# Fit the model
105
-
X_train = ['C1=CC=CC=C1', 'C1=CC=CC=C1']
106
-
y_train = [[0.5], [1.5]]
107
-
X_val = ['C1=CC=CC=C1', 'C1=CC=CC=C1']
108
-
y_val = [[0.5], [1.5]]
109
-
N_trial =10
110
-
111
-
grea_model.autofit(
112
-
X_train=X_train.tolist(),
113
-
y_train=y_train,
114
-
X_val=X_val.tolist(),
115
-
y_val=y_val,
116
-
n_trials=N_trial,
117
+
# Fit with automatic hyperparameter tuning with 10 attempts, or implement .fit() with the default/manual hyperparameters
118
+
grea.autofit(
119
+
X_train=smiles_list[:split],
120
+
y_train=property_np_array[:split],
121
+
X_val=smiles_list[split:],
122
+
y_val=property_np_array[split:],
123
+
n_trials=10,
117
124
)
118
125
```
119
126
120
127
### Checkpoints
121
128
122
-
`torch-molecule` provides checkpoint functions that can be interacted with on Hugging Face.
129
+
`torch-molecule` provides checkpoint functions that can be interacted with on Hugging Face:
123
130
124
131
```python
125
132
from torch_molecule import GREAMolecularPredictor
126
-
from sklearn.metrics import mean_absolute_error
127
-
128
-
# Define the repository ID for Hugging Face
129
-
repo_id ="user/repo_id"
130
-
131
-
# Initialize the GREAMolecularPredictor model
132
-
model = GREAMolecularPredictor()
133
-
134
-
# Train the model using autofit
135
-
model.autofit(
136
-
X_train=X.tolist(), # List of SMILES strings for training
137
-
y_train=y_train, # numpy array [n_samples, n_tasks] for training labels
138
-
X_val=X_val.tolist(),# List of SMILES strings for validation
139
-
y_val=y_val, # numpy array [n_samples, n_tasks] for validation labels
| EdgePred |[Strategies for Pre-training Graph Neural Networks. ICLR 2020](https://arxiv.org/abs/1905.12265)|
208
201
| InfoGraph |[InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. ICLR 2020](https://arxiv.org/abs/1908.01000)|
209
202
| Supervised | Supervised pretraining |
210
-
| Pretrained |[GPT2-ZINC-87M](https://huggingface.co/entropy/gpt2_zinc_87m): GPT-2 based model (87M parameters) pretrained on ZINC dataset with ~480M SMILES strings. |
211
-
||[RoBERTa-ZINC-480M](https://huggingface.co/entropy/roberta_zinc_480m): RoBERTa based model (102M parameters) pretrained on ZINC dataset with ~480M SMILES strings. |
212
-
||[UniKi/bert-base-smiles](https://huggingface.co/unikei/bert-base-smiles): BERT model pretrained on SMILES strings. |
213
-
||[ChemBERTa-zinc-base-v1](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1): RoBERTa model pretrained on ZINC dataset with ~100k SMILES strings.|
214
-
|| ChemBERTa series: Available in multiple sizes and training objectives (MLM/MTR). <br> - [ChemBERTa-5M-MLM](https://huggingface.co/DeepChem/ChemBERTa-5M-MLM)<br> - [ChemBERTa-5M-MTR](https://huggingface.co/DeepChem/ChemBERTa-5M-MTR)<br> - [ChemBERTa-10M-MLM](https://huggingface.co/DeepChem/ChemBERTa-10M-MLM)<br> - [ChemBERTa-10M-MTR](https://huggingface.co/DeepChem/ChemBERTa-10M-MTR)<br> - [ChemBERTa-77M-MLM](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM)<br> - [ChemBERTa-77M-MTR](https://huggingface.co/DeepChem/ChemBERTa-77M-MTR)|
215
-
|| ChemGPT series: GPT-Neo based models pretrained on PubChem10M dataset with SELFIES strings. <br> - [ChemGPT-1.2B](https://huggingface.co/ncfrey/ChemGPT-1.2B)<br> - [ChemGPT-4.7B](https://huggingface.co/ncfrey/ChemGPT-4.7M)<br> - [ChemGPT-19B](https://huggingface.co/ncfrey/ChemGPT-19M)|
203
+
| Pretrained | [GPT2-ZINC-87M](https://huggingface.co/entropy/gpt2_zinc_87m): GPT-2 based model (87M parameters) pretrained on ZINC dataset with ~480M SMILES strings. <br> [RoBERTa-ZINC-480M](https://huggingface.co/entropy/roberta_zinc_480m): RoBERTa based model (102M parameters) pretrained on ZINC dataset with ~480M SMILES strings. <br> [UniKi/bert-base-smiles](https://huggingface.co/unikei/bert-base-smiles): BERT model pretrained on SMILES strings. <br> [ChemBERTa-zinc-base-v1](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1): RoBERTa model pretrained on ZINC dataset with ~100k SMILES strings. <br> ChemBERTa series: Available in multiple sizes and training objectives (MLM/MTR). [ChemBERTa-5M-MLM](https://huggingface.co/DeepChem/ChemBERTa-5M-MLM), [ChemBERTa-5M-MTR](https://huggingface.co/DeepChem/ChemBERTa-5M-MTR), [ChemBERTa-10M-MLM](https://huggingface.co/DeepChem/ChemBERTa-10M-MLM), [ChemBERTa-10M-MTR](https://huggingface.co/DeepChem/ChemBERTa-10M-MTR), [ChemBERTa-77M-MLM](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM), [ChemBERTa-77M-MTR](https://huggingface.co/DeepChem/ChemBERTa-77M-MTR). <br> ChemGPT series: GPT-Neo based models pretrained on PubChem10M dataset with SELFIES strings. [ChemGPT-1.2B](https://huggingface.co/ncfrey/ChemGPT-1.2B), [ChemGPT-4.7B](https://huggingface.co/ncfrey/ChemGPT-4.7M), [ChemGPT-19B](https://huggingface.co/ncfrey/ChemGPT-19M). |
216
204
217
-
## Project Structure
205
+
<!--## Project Structure
218
206
219
-
See the structure of `torch_molecule` with the command `tree -L 2 torch_molecule -I '__pycache__|*.pyc|*.pyo|.git|old*'`
207
+
See the structure of `torch_molecule` with the command `tree -L 2 torch_molecule -I '__pycache__|*.pyc|*.pyo|.git|old*'`-->
220
208
221
-
## Plan
209
+
<!--## Plan
222
210
223
211
1. **Predictive Models**: Done: GREA, SGIR, IRM, GIN/GCN w/ virtual, DIR. SMILES-based LSTM/Transformers. TODO more
0 commit comments