Skip to content

Commit 9156d1b

Browse files
update the number of pdf parser in the readme
1 parent ba0ec47 commit 9156d1b

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
6363
* **Lazy imports**: Faster statup time thanks to lazy_import
6464
* **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).
6565
* **Sophisticated faiss saver**: [faiss](https://github.com/facebookresearch/faiss/wiki) is used to quickly find the documents that match an embedding. But instead of storing as a single file, WDoc splits the index into 1 document long index identified by deterministic hashes. When creating a new index, any overlapping document will be automatically reloaded instead of recomputed.
66-
* **Good PDF parsing** PDF parsers are notoriously unreliable, so 10 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default)
66+
* **Good PDF parsing** PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default) or via [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/).
6767
* **Document filtering**: based on regex for document content or metadata.
6868
* **Fast**: Parallel document loading, parsing, embeddings, querying, etc.
6969
* **Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag)
@@ -91,7 +91,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
9191
* **auto**: default, guess the filetype for you
9292
* **url**: try many ways to load a webpage, with heuristics to find the better parsed one
9393
* **youtube**: text is then either from the yt subtitles / translation or even better: using whisper / deepgram
94-
* **pdf**: About 10 loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/)
94+
* **pdf**: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/) or [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/). Easy to add more.
9595
* **online_pdf**: via URL then treated at **local_pdf**
9696
* **anki**: any subset of an [anki](https://github.com/ankitects/anki) collection db. `alt` and `title` of images can be shown to the LLM, meaning that if you used [the ankiOCR addon](https://github.com/cfculhane/AnkiOCR) this information will help contextualize the note for the LLM.
9797
* **string**: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!

0 commit comments

Comments
 (0)