Skip to content

Commit d92203c

Browse files
authored
Merge branch 'master' into s3-model-fix
2 parents 26fdf99 + 9d7b3b9 commit d92203c

84 files changed

Lines changed: 407 additions & 344 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -296,14 +296,15 @@ textattack attack --model bert-base-uncased-sst2 --recipe textfooler --num-examp
296296
### Augmenting Text: `textattack augment`
297297

298298
Many of the components of TextAttack are useful for data augmentation. The `textattack.Augmenter` class
299-
uses a transformation and a list of constraints to augment data. We also offer five built-in recipes
299+
uses a transformation and a list of constraints to augment data. We also offer built-in recipes
300300
for data augmentation:
301-
- `textattack.WordNetAugmenter` augments text by replacing words with WordNet synonyms
302-
- `textattack.EmbeddingAugmenter` augments text by replacing words with neighbors in the counter-fitted embedding space, with a constraint to ensure their cosine similarity is at least 0.8
303-
- `textattack.CharSwapAugmenter` augments text by substituting, deleting, inserting, and swapping adjacent characters
304-
- `textattack.EasyDataAugmenter` augments text with a combination of word insertions, substitutions and deletions.
305-
- `textattack.CheckListAugmenter` augments text by contraction/extension and by substituting names, locations, numbers.
306-
- `textattack.CLAREAugmenter` augments text by replacing, inserting, and merging with a pre-trained masked language model.
301+
- `wordnet` augments text by replacing words with WordNet synonyms
302+
- `embedding` augments text by replacing words with neighbors in the counter-fitted embedding space, with a constraint to ensure their cosine similarity is at least 0.8
303+
- `charswap` augments text by substituting, deleting, inserting, and swapping adjacent characters
304+
- `eda` augments text with a combination of word insertions, substitutions and deletions.
305+
- `checklist` augments text by contraction/extension and by substituting names, locations, numbers.
306+
- `clare` augments text by replacing, inserting, and merging with a pre-trained masked language model.
307+
307308

308309
#### Augmentation Command-Line Interface
309310
The easiest way to use our data augmentation tools is with `textattack augment <args>`. `textattack augment`
@@ -380,24 +381,23 @@ automatically loaded using the `datasets` package.
380381
#### Training Examples
381382
*Train our default LSTM for 50 epochs on the Yelp Polarity dataset:*
382383
```bash
383-
textattack train --model lstm --dataset yelp_polarity --batch-size 64 --epochs 50 --learning-rate 1e-5
384+
textattack train --model-name-or-path lstm --dataset yelp_polarity --epochs 50 --learning-rate 1e-5
384385
```
385386

386-
The training process has data augmentation built-in:
387-
```bash
388-
textattack train --model lstm --dataset rotten_tomatoes --augment eda --pct-words-to-swap .1 --transformations-per-example 4
389-
```
390-
This uses the `EasyDataAugmenter` recipe to augment the `rotten_tomatoes` dataset before training.
391387

392388
*Fine-Tune `bert-base` on the `CoLA` dataset for 5 epochs**:
393389
```bash
394-
textattack train --model bert-base-uncased --dataset glue^cola --batch-size 32 --epochs 5
390+
textattack train --model-name-or-path bert-base-uncased --dataset glue^cola --per-device-train-batch-size 8 --epochs 5
395391
```
396392

397393

398394
### To check datasets: `textattack peek-dataset`
399395

400-
To take a closer look at a dataset, use `textattack peek-dataset`. TextAttack will print some cursory statistics about the inputs and outputs from the dataset. For example, `textattack peek-dataset --dataset-from-huggingface snli` will show information about the SNLI dataset from the NLP package.
396+
To take a closer look at a dataset, use `textattack peek-dataset`. TextAttack will print some cursory statistics about the inputs and outputs from the dataset. For example,
397+
```bash
398+
textattack peek-dataset --dataset-from-huggingface snli
399+
```
400+
will show information about the SNLI dataset from the NLP package.
401401

402402

403403
### To list functional components: `textattack list`
@@ -547,6 +547,11 @@ A `SearchMethod` takes as input an initial `GoalFunctionResult` and returns a fi
547547

548548
## Multi-lingual Support
549549

550+
551+
- see example code: [https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py](https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py) for using our framework to attack French-BERT.
552+
553+
- see tutorial notebook: [https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html) for using our framework to attack French-BERT.
554+
550555
- See [README_ZH.md](https://github.com/QData/TextAttack/blob/master/README_ZH.md) for our README in Chinese
551556

552557

README_ZH.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -292,11 +292,14 @@ textattack attack --model bert-base-uncased-sst2 --recipe textfooler --num-examp
292292
### 增强文本数据:`textattack augment`
293293

294294
TextAttack 的组件中,有很多易用的数据增强工具。`textattack.Augmenter` 类使用 *变换* 与一系列的 *约束* 进行数据增强。我们提供了 5 中内置的数据增强策略:
295-
- `textattack.WordNetAugmenter` 通过基于 WordNet 同义词替换的方式增强文本
296-
- `textattack.EmbeddingAugmenter` 通过邻近词替换的方式增强文本,使用 counter-fitted 词嵌入空间中的邻近词进行替换,约束二者的 cosine 相似度不低于 0.8
297-
- `textattack.CharSwapAugmenter` 通过字符的增删改,以及临近字符交换的方式增强文本
298-
- `textattack.EasyDataAugmenter` 通过对词的增删改来增强文本
299-
- `textattack.CheckListAugmenter` 通过简写,扩写以及对实体、地点、数字的替换来增强文本
295+
- `wordnet` 通过基于 WordNet 同义词替换的方式增强文本
296+
- `embedding` 通过邻近词替换的方式增强文本,使用 counter-fitted 词嵌入空间中的邻近词进行替换,约束二者的 cosine 相似度不低于 0.8
297+
- `charswap` 通过字符的增删改,以及临近字符交换的方式增强文本
298+
- `eda` 通过对词的增删改来增强文本
299+
- `checklist` 通过简写,扩写以及对实体、地点、数字的替换来增强文本
300+
- `clare` 使用 pre-trained masked language model, 通过对词的增删改来增强文本
301+
302+
300303

301304
#### 数据增强的命令行接口
302305
使用 textattack 来进行数据增强,最快捷的方法是通过 `textattack augment <args>` 命令行接口。 `textattack augment` 使用 CSV 文件作为输入,在参数中设置需要增强的文本列,每个样本允许改变的比例,以及对于每个输入样本生成多少个增强样本。输出的结果保存为与输入文件格式一致的 CSV 文件,结果文件中为对指定的文本列生成的增强样本。
@@ -362,18 +365,13 @@ it's a enigma how the filmmaking wo be publicized in this condition .,0
362365
#### 运行训练的例子
363366
*在 Yelp 分类数据集上对 TextAttack 中默认的 LSTM 模型训练 50 个 epoch:*
364367
```bash
365-
textattack train --model lstm --dataset yelp_polarity --batch-size 64 --epochs 50 --learning-rate 1e-5
368+
textattack train --model-name-or-path lstm --dataset yelp_polarity --epochs 50 --learning-rate 1e-5
366369
```
367370

368-
训练接口中同样内置了数据增强功能:
369-
```bash
370-
textattack train --model lstm --dataset rotten_tomatoes --augment eda --pct-words-to-swap .1 --transformations-per-example 4
371-
```
372-
上面这个例子在训练之前使用 `EasyDataAugmenter` 策略对 `rotten_tomatoes` 数据集进行数据增强。
373371

374372
*`CoLA` 数据集上对 `bert-base` 模型精调 5 个 epoch:*
375373
```bash
376-
textattack train --model bert-base-uncased --dataset glue^cola --batch-size 32 --epochs 5
374+
textattack train --model-name-or-path bert-base-uncased --dataset glue^cola --per-device-train-batch-size 8 --epochs 5
377375
```
378376

379377

docs/1start/FAQ.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -42,22 +42,25 @@ conda activate textattackenv
4242
conda env list
4343
```
4444

45+
If you want to use the most-up-to-date version of textattack (normally with newer bug fixes), you can run the following:
46+
```bash
47+
git clone https://github.com/QData/TextAttack.git
48+
cd TextAttack
49+
pip install .[dev]
50+
```
51+
52+
4553
### 1. How to Train
4654

4755
For example, you can *Train our default LSTM for 50 epochs on the Yelp Polarity dataset:*
4856
```bash
49-
textattack train --model lstm --dataset yelp_polarity --batch-size 64 --epochs 50 --learning-rate 1e-5
57+
textattack train --model-name-or-path lstm --dataset yelp_polarity --epochs 50 --learning-rate 1e-5
5058
```
5159

52-
The training process has data augmentation built-in:
53-
```bash
54-
textattack train --model lstm --dataset rotten_tomatoes --augment eda --pct-words-to-swap .1 --transformations-per-example 4
55-
```
56-
This uses the `EasyDataAugmenter` recipe to augment the `rotten_tomatoes` dataset before training.
5760

5861
*Fine-Tune `bert-base` on the `CoLA` dataset for 5 epochs**:
5962
```bash
60-
textattack train --model bert-base-uncased --dataset glue^cola --batch-size 32 --epochs 5
63+
textattack train --model-name-or-path bert-base-uncased --dataset glue^cola --per-device-train-batch-size 8 --epochs 5
6164
```
6265

6366

docs/1start/api-design-tips.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,24 @@ Lessons learned in designing TextAttack
77

88
TextAttack is an open-source Python toolkit for adversarial attacks, adversarial training, and data augmentation in NLP. TextAttack unites 15+ papers from the NLP adversarial attack literature into a single shared framework, with many components reused across attacks. This framework allows both researchers and developers to test and study the weaknesses of their NLP models.
99

10+
11+
## Presentations on TextAttack
12+
13+
### 2020: Jack Morris' summary tutorial talk on TextAttack
14+
15+
- On Jul 31, 2020, Jack Morries gave an invited talk at Weights & Biases research salon on " TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP"
16+
17+
- [Youtube Talk link](https://www.youtube.com/watch?v=22Q3f7Fb110)
18+
19+
20+
### 2021: Dr. Qi's summary tutorial talk on TextAttack
21+
22+
- On April 14 2021, Prof. Qi gave an invited talk at the UVA Human and Machine Intelligence Seminar on "Generalizing Adversarial Examples to Natural Language Processing"
23+
24+
- [TalkSlide](https://qdata.github.io/qdata-page/pic/20210414-HMI-textAttack.pdf)
25+
26+
27+
1028
## Challenges in Design
1129

1230

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
TextAttack Extended Functions (Multilingual)
2+
============================================
3+
4+
5+
6+
## Multilingual Supports
7+
8+
- see example code: [https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py](https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py) for using our framework to attack French-BERT.
9+
10+
- see tutorial notebook: [https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html) for using our framework to attack French-BERT.
11+
12+
13+
14+
## We have built a new WebDemo For Visulizing TextAttack generated Examples;
15+
16+
- [TextAttack-WebDemo Github](https://github.com/QData/TextAttack-WebDemo)

docs/1start/talks-visualization.md

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)