Skip to content

Commit 90201ed

Browse files
docs: mention how to add new pdf parsers
1 parent 9156d1b commit 90201ed

1 file changed

Lines changed: 5 additions & 0 deletions

File tree

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,11 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
248248
* **What should I do if my PDF are encrypted?**
249249
* If you're on linux you can try running `qpdf --decrypt input.pdf output.pdf`
250250
* I made a quick and dirty batch script for [in this repo](https://github.com/thiswillbeyourgithub/PDF_batch_decryptor)
251+
* **How can I add my own pdf parser?**
252+
* Write a python class and add it there: `WDoc.utils.loaders.pdf_loaders['parser_name']=parser_object` then call WDoc with `--pdf_parsers=parser_name`.
253+
* The class has to take a `path` argument in `__init__`, have a `load` method taking
254+
no argument but returning a `List[Document]`. Take a look at the `OpenparseDocumentParser`
255+
class for an example.
251256

252257
## Notes
253258
* Before summarizing, if the beforehand estimate of cost is above $5, the app will abort to be safe just in case you drop a few bibles in there. (Note: the tokenizer used to count tokens to embed is the OpenAI tokenizer, which is not universal)

0 commit comments

Comments
 (0)