Skip to content

Commit 881aad9

Browse files
committed
init
1 parent 6760727 commit 881aad9

11 files changed

Lines changed: 7172 additions & 2 deletions

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2019 Malte
3+
Copyright (c) 2019 Malte Ostendorff
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,127 @@
1-
# pytorch-bert-document-classification
1+
# PyTorch BERT Document Classification
22
Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
3+
4+
Content:
5+
- CLI script
6+
- author embeddings
7+
- projector files
8+
- create file
9+
- data preparation
10+
- requirements
11+
- trained model weights as release zips
12+
13+
## Installation
14+
15+
Requirements:
16+
- Python 3.6
17+
- CUDA GPU
18+
- Jupyter Notebook
19+
20+
Install dependencies:
21+
```
22+
pip install -r requirements.txt
23+
```
24+
25+
## Prepare data
26+
27+
### GermEval data
28+
29+
- Download from shared-task website: [here](https://competitions.codalab.org/competitions/20139)
30+
- Run all steps in Jupyter Notebook: [germeval-data.ipynb](#)
31+
32+
### Author Embeddings
33+
34+
- [Download pre-trained Wikidata embedding (30GB): Facebook PyTorch-BigGraph](https://github.com/facebookresearch/PyTorch-BigGraph#pre-trained-embeddings)
35+
- [Download WikiMapper index files (de+en)](https://github.com/jcklie/wikimapper#precomputed-indices)
36+
37+
```
38+
python wikidata_for_authors.py run ~/datasets/wikidata/index_enwiki-20190420.db \
39+
~/datasets/wikidata/index_dewiki-20190420.db \
40+
~/datasets/wikidata/torchbiggraph/wikidata_translation_v1.tsv.gz \
41+
~/notebooks/bert-text-classification/authors.pickle \
42+
~/notebooks/bert-text-classification/author2embedding.pickle
43+
44+
# OPTIONAL: Projector format
45+
python wikidata_for_authors.py convert_for_projector \
46+
~/notebooks/bert-text-classification/author2embedding.pickle
47+
extras/author2embedding.projector.tsv \
48+
extras/author2embedding.projector_meta.tsv
49+
50+
```
51+
52+
53+
## Reproduce paper results
54+
55+
56+
Download pre-trained models: [GitHub releases](https://github.com/malteos/pytorch-bert-document-classification/releases)
57+
58+
59+
### Available experiment settings
60+
61+
Detailed settings for each experiment can found in `cli.py`.
62+
63+
```
64+
task-a__bert-german_full
65+
task-a__bert-german_manual_no-embedding
66+
task-a__bert-german_no-manual_embedding
67+
task-a__bert-german_text-only
68+
task-a__author-only
69+
task-a__bert-multilingual_text-only
70+
71+
task-b__bert-german_full
72+
task-b__bert-german_manual_no-embedding
73+
task-b__bert-german_no-manual_embedding
74+
task-b__bert-german_text-only
75+
task-b__author-only
76+
task-b__bert-multilingual_text-only
77+
```
78+
79+
### Enviroment variables
80+
81+
- `TRAIN_DF_PATH`: Path to Pandas Dataframe (pickle)
82+
- `GPU_ID`: Run experiments on this GPU (used for `CUDA_VISIBLE_DEVICES`)
83+
- `OUTPUT_DIR`: Directory to store experiment output
84+
- `EXTRAS_DIR`: Directory where author embeddings and [gender data](https://data.world/howarder/gender-by-name) is located
85+
- `BERT_MODELS_DIR`: Directory where pre-trained BERT models are located
86+
87+
### Validation set
88+
89+
```
90+
python cli.py run_on_val <name> $GPU_ID $EXTRAS_DIR $TRAIN_DF_PATH $VAL_DF_PATH $OUTPUT_DIR --epochs 5
91+
```
92+
93+
### Test set
94+
95+
```
96+
python cli.py run_on_test <name> $GPU_ID $EXTRAS_DIR $FULL_DF_PATH $TEST_DF_PATH $OUTPUT_DIR --epochs 5
97+
```
98+
99+
### Evaluation
100+
101+
The scores from the result table can be reproduced with the `evaluation.ipynb` notebook.
102+
103+
## How to cite
104+
105+
If you are using our code, please cite our paper:
106+
```
107+
@article{,
108+
title={},
109+
author={},
110+
journal={arXiv preprint arXiv:},
111+
year={2019}
112+
}
113+
114+
```
115+
116+
## References
117+
118+
- [Google BERT Tensorflow](https://github.com/google-research/bert)
119+
- [Huggingface PyTorch Transformer](https://github.com/huggingface/pytorch-transformers)
120+
- [Deepset AI - BERT-german](https://deepset.ai/german-bert)
121+
- [Facebook PyTorch BigGraph](https://github.com/facebookresearch/PyTorch-BigGraph)
122+
123+
## License
124+
125+
MIT
126+
127+

0 commit comments

Comments
 (0)