Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks (Gururangan et al., ACL 2020)
DaptDataset expects largely the same dataset format to that used by
ClinicalNlpDataset. The main restriction is that there
should be a text column; datasets with text_a and text_b columns
will not be accepted.
Use cnlpt.dapt for domain-adaptive pretraining on an existing encoder.
$ python -m cnlpt.dapt --help
usage: dapt.py [-h] [--encoder_name ENCODER_NAME]
[--config_name CONFIG_NAME]
[--tokenizer_name TOKENIZER_NAME]
[--output_dir OUTPUT_DIR]
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]]
[--data_dir DATA_DIR] [--cache_dir CACHE_DIR]
[--chunk_size CHUNK_SIZE]
[--mlm_probability MLM_PROBABILITY]
[--test_size TEST_SIZE] [--seed SEED]
[--no_eval [NO_EVAL]]
optional arguments:
-h, --help show this help message and exit
--encoder_name ENCODER_NAME
Path to pretrained model or model
identifier from huggingface.co/models
(default: roberta-base)
--config_name CONFIG_NAME
Pretrained config name or path if not the
same as model_name (default: None)
--tokenizer_name TOKENIZER_NAME
Pretrained tokenizer name or path if not
the same as model_name (default: None)
--output_dir OUTPUT_DIR
Directory path to write trained model to.
(default: None)
--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
Overwrite the content of the output
directory. Use this to continue training if
output_dir points to a checkpoint
directory. (default: False)
--data_dir DATA_DIR The data dir for domain-adaptive
pretraining. (default: None)
--cache_dir CACHE_DIR
Where do you want to store the pretrained
models downloaded from s3 (default: None)
--chunk_size CHUNK_SIZE
The chunk size for domain-adaptive
pretraining. (default: 128)
--mlm_probability MLM_PROBABILITY
The token masking probability for domain-
adaptive pretraining. (default: 0.15)
--test_size TEST_SIZE
The test split proportion for domain-
adaptive pretraining. (default: 0.2)
--seed SEED The random seed to use for a train/test
split for domain-adaptive pretraining
(requires --dapt-encoder). (default: 42)
--no_eval [NO_EVAL] Don't split into train and test; just
pretrain. (default: False)
This will save the adapted encoder to the disk at --output_dir, where
it can then be passed into train_system as --encoder_name.
The common idiom will be to use cnlpt.dapt on a portion of your
unlabeled data (the task dataset), then run train_system using a
labeled dataset of in-domain data. To evaluate the effectiveness of this
idiom, you can use an artificially-unlabeled dataset and then evaluate
the fine-tuned classifier out of train_system on the labeled portion
of your task dataset.