You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
63
63
***Lazy imports**: Faster statup time thanks to lazy_import
64
64
***LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).
65
65
***Sophisticated faiss saver**: [faiss](https://github.com/facebookresearch/faiss/wiki) is used to quickly find the documents that match an embedding. But instead of storing as a single file, WDoc splits the index into 1 document long index identified by deterministic hashes. When creating a new index, any overlapping document will be automatically reloaded instead of recomputed.
66
-
***Good PDF parsing** PDF parsers are notoriously unreliable, so 10 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default)
66
+
***Good PDF parsing** PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default) or via [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/).
67
67
***Document filtering**: based on regex for document content or metadata.
68
68
***Fast**: Parallel document loading, parsing, embeddings, querying, etc.
69
69
***Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag)
@@ -91,7 +91,7 @@ WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summa
91
91
***auto**: default, guess the filetype for you
92
92
***url**: try many ways to load a webpage, with heuristics to find the better parsed one
93
93
***youtube**: text is then either from the yt subtitles / translation or even better: using whisper / deepgram
94
-
***pdf**: About 10 loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/)
94
+
***pdf**: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/) or [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/). Easy to add more.
95
95
***online_pdf**: via URL then treated at **local_pdf**
96
96
***anki**: any subset of an [anki](https://github.com/ankitects/anki) collection db. `alt` and `title` of images can be shown to the LLM, meaning that if you used [the ankiOCR addon](https://github.com/cfculhane/AnkiOCR) this information will help contextualize the note for the LLM.
97
97
***string**: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!
0 commit comments