update the number of pdf parser in the readme

thiswillbeyourgithub · thiswillbeyourgithub · commit 9156d1bf614e · 2024-09-26T15:49:20.000+02:00
diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
 * **Lazy imports**: Faster statup time thanks to lazy_import
 * **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).
 * **Sophisticated faiss saver**: [faiss](https://github.com/facebookresearch/faiss/wiki) is used to quickly find the documents that match an embedding. But instead of storing as a single file, WDoc splits the index into 1 document long index identified by deterministic hashes. When creating a new index, any overlapping document will be automatically reloaded instead of recomputed.
-* **Good PDF parsing** PDF parsers are notoriously unreliable, so 10 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default)
+* **Good PDF parsing** PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default) or via [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/).
 * **Document filtering**: based on regex for document content or metadata.
 * **Fast**: Parallel document loading, parsing, embeddings, querying, etc.
 * **Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag)
@@ -91,7 +91,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
 * **auto**: default, guess the filetype for you
 * **url**: try many ways to load a webpage, with heuristics to find the better parsed one
 * **youtube**: text is then either from the yt subtitles / translation or even better: using whisper / deepgram
-* **pdf**: About 10 loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/)
+* **pdf**: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/) or [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/). Easy to add more.
 * **online_pdf**: via URL then treated at **local_pdf**
 * **anki**: any subset of an [anki](https://github.com/ankitects/anki) collection db. `alt` and `title` of images can be shown to the LLM, meaning that if you used [the ankiOCR addon](https://github.com/cfculhane/AnkiOCR) this information will help contextualize the note for the LLM.
 * **string**: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!