thiswillbeyourgithub
diff --git a/‎README.md‎
Lines changed: 39 additions & 22 deletions b/‎README.md‎
Lines changed: 39 additions & 22 deletions
diff --git a/‎WDoc/WDoc.py‎
Lines changed: 23 additions & 22 deletions b/‎WDoc/WDoc.py‎
Lines changed: 23 additions & 22 deletions
diff --git a/‎WDoc/utils/loaders.py‎
Lines changed: 8 additions & 3 deletions b/‎WDoc/utils/loaders.py‎
Lines changed: 8 additions & 3 deletions
@@ -5,27 +5,40 @@
 
 # WDoc
 
-* **Goal and project specifications** use [LangChain](https://python.langchain.com/) to summarize, search or query documents. I'm a medical student so I need to be able to query from **tens of thousands** of documents, of different types ([Supported filetypes](#Supported-filetypes)). I also have little free time so I needed a tailor made summary feature to keep up with the news.
-* **Current status**: **still under development**.
-    * I use it almost daily and have been for months now.
-    * Expect some breakage but they can be fixed usually in a few minutes if you open an issue here.
-    * The main branch is usually fine but the dev branch has more features.
-    * I accept feature requests and pull requests.
-    * Issues are extremely appreciated for any reason including typos etc.
-    * Prefer asking me before making a PR because I have many improvements in the pipeline but do this on my spare time. Do tell me if you have specific needs!
+WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources.
+
+* **Goal and project specifications**: WDoc uses [LangChain](https://python.langchain.com/) to process and analyze documents. It's capable of querying **tens of thousands** of documents across [various file types](#Supported-filetypes). The project also includes a tailored summary feature to help users efficiently keep up with large amounts of information.
+
+* **Current status**: **Under active development**
+    * Used daily by the developer for several months
+    * May have some instabilities, but issues can usually be resolved quickly
+    * The main branch is stable, while the dev branch offers more features
+    * Open to feature requests and pull requests
+    * All feedback, including reports of typos, is highly appreciated
+    * Please consult the developer before making a PR, as there may be ongoing improvements in the pipeline
+
+* **Key Features**:
+    * Supports multiple file types for comprehensive document analysis
+    * Utilizes both strong and query evaluation LLMs for accurate results
+    * Customizable summarization capabilities
+    * Efficient handling of large document corpora
 
 ### Table of contents
 - [Features](#features)
   - [Planned Features](#planned-features)
   - [Supported filetypes](#supported-filetypes)
   - [Supported tasks](#supported-tasks)
   - [Walkthrough and examples](#walkthrough-and-examples)
+- [Scripts made with WDoc](#scripts-made-with-wdoc)
 - [Getting started](#getting-started)
 - [FAQ](#faq)
 - [Notes](#notes)
   - [Known issues](#known-issues)
 
 ## Features
+* **15+ filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Supported filestypes](#Supported-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too!
+* **100+ LLMs**: OpenAI, Mistral, Claude, Ollama, Openrouter, etc. Thanks to [litellm](https://docs.litellm.ai/).
+* **Local and Private LLM**: take some measures to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading sockets, etc.
 * **Advanced RAG to query lots of diverse documents**:
     1. The documents are retrieved using embedding
     2. Then a weak LLM model ("Evaluator") is used to tell which of those document is not relevant
@@ -37,26 +50,21 @@
     * Instead of unusable "high level takeaway" points, compress the reasoning, arguments, though process etc of the author into an easy to skim markdown file.
     * The summaries are then checked again n times for correct logical indentation etc.
     * The summary can be in the same language as the documents or directly translated.
-* **Multiple LLM providers**: OpenAI, Mistral, Claude, Ollama, Openrouter, etc. Thanks to [litellm](https://docs.litellm.ai/).
-* **Private LLM**: take some measures to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading sockets, etc.
 * **Many tasks**: See [Supported tasks](#Supported-tasks).
-* **Many filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Supported filestypes](#Supported-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too!
+* **Markdown formatted answers and summaries**: using [rich](https://github.com/Textualize/rich).
 * **Sane embeddings**: By default use sophisticated embeddings like HyDE, parent retriever etc. Customizable.
-* **Documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. The full usage can be found in the file [USAGE.md](./WDoc/docs/USAGE.md) or via `python -m WDoc --help`.
+* **Fully documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. The full usage can be found in the file [USAGE.md](./WDoc/docs/USAGE.md) or via `python -m WDoc --help`. I work hard to maintain an exhaustive documentation.
+* **Scriptable / Extensible**: You can use WDoc in other python project using `--import_mode`. Take a look at the examples [below](#scripts-made-with-wdoc).
+* **Statically typed**: Runtime type checking. Opt out with an environment flag: `WDOC_TYPECHECKING="disabled / warn / crash" WDoc` (by default: `warn`). Thanks to [beartype](https://beartype.readthedocs.io/en/latest/) it shouldn't even slow down the code!
 * **Lazy imports**: Faster statup time thanks to lazy_import
 * **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).
 * **Sophisticated faiss saver**: [faiss](https://github.com/facebookresearch/faiss/wiki) is used to quickly find the documents that match an embedding. But instead of storing as a single file, WDoc splits the index into 1 document long index identified by deterministic hashes. When creating a new index, any overlapping document will be automatically reloaded instead of recomputed.
-* **Easy model testing** Include an LLM name matcher that fuzzy finds the most appropriate model given an name.
 * **Good PDF parsing** PDF parsers are notoriously unreliable, so 10 (!) different loaders are used, and the best according to a parsing scorer is kept.
-* **Markdown formatted answers and summaries**: using [rich](https://github.com/Textualize/rich).
 * **Document filtering**: based on regex for document content or metadata.
-* **Fast**: Parallel document parsing and embedding.
+* **Fast**: Parallel document loading, parsing, embeddings, querying, etc.
 * **Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag)
-* **Static typed**: Runtime type checking. Opt out with an environment flag: `WDOC_TYPECHECKING="disabled / warn / crash" WDoc` (by default: `warn`). Thanks to [beartype](https://beartype.readthedocs.io/en/latest/) it shouldn't even slow down the code!
-* **Scriptable**: You can use WDoc in other python project using `--import_mode`
 * **Notification callback**: Can be used for example to get summaries on your phone using [ntfy.sh](ntfy.sh).
-* **Fully documented**: I work hard to maintain an exhaustive documentation at `wdoc --help`
-* Very customizable, with a friendly dev! Just open an issue if you have a feature request or anything else.
+* **Hacker mindset**: I'm a friendly dev! Just open an issue if you have a feature request or anything else.
 
 ### Planned features
 *(These don't include improvements, bugfixes, refactoring etc.)*
@@ -66,15 +74,13 @@
     * More configurable HyDE.
     * Web search retriever, online information lookup via jina.ai reader and search.
     * LLM powered synonym expansion for embeddings search.
-* Investigate switching to Milvus Lite instead of handling split faiss indexes.
 * A way to specify at indexing time how trusting you are of a given set of document.
 * A way to open the documents automatically, based on the platform used. For ex if okular is installed, open pdfs directly at the appropriate page.
 * Improve the scriptability of WDoc. Add examples for how you use it with Logseq.
     * Include a server example, that mimics the OpenAI's API to make your RAG directly accessible to other apps.
     * Add a gradio GUI.
 * Include the possible whisper/deepgram extra expenses when counting costs.
 * Add support for user defined loaders.
-* Add support for custom user prompt.
 * Automatically caption document images using an LLM, especially nice for anki cards.
 
 ### Supported filetypes
@@ -118,6 +124,7 @@
 7. There is a specific recursive filetype I should mention: `--filetype="link_file"`. Basically the file designated by `--path` should contain in each line (`#comments` and empty lines are ignored) one url, that will be parsed by WDoc. I made this so that I can quickly use the "share" button on android from my browser to a text file (so it just appends the url to the file), this file is synced via [syncthing](https://github.com/syncthing/syncthing) to my browser and WDoc automatically summarize them and add them to my [Logseq](https://github.com/logseq/logseq/). Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.
 8. If you want to make sure your data remains private here's an example with ollama: `wdoc --private --llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434"}' --modelname="ollama_chat/gemma:2b" --query_eval_modelname="ollama_chat/gemma:2b" --embed_model="BAAI/bge-m3" my_task`
 9. Now say you just want to summarize [Tim Urban's TED talk on procrastination](https://www.youtube.com/watch?v=arj7oStGLkU): `wdoc summary --path 'https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="english" --disable_md_printing`:
+
 > # Summary
 > ## https://www.youtube.com/watch?v=arj7oStGLkU
 > - The speaker, Tim Urban, was a government major in college who had to write many papers
@@ -161,9 +168,13 @@
 >     - Stay aware of the Instant Gratification Monkey
 >     - Start addressing procrastination soon
 > - *Humorously* suggests not starting today, but 'sometime soon'
+> 
 > Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: '4365' ($0.00060)
+> 
 > Total cost of those summaries: '4365' ($0.00060, estimate was $0.00028)
+> 
 > Total time saved by those summaries: 8.4 minutes
+> 
 > Done summarizing.
 
 
@@ -179,7 +190,12 @@
 * To ask questions about a document: `wdoc query --path="PATH/TO/YOUR/FILE" --filetype="auto"`
 * If you want to reduce the startup time, you can use --saveas="some/path" to save the loaded embeddings from last time and --loadfrom "some/path" on every subsequent call. (In any case, the embeddings are always cached)
 * For more: read the documentation at `wdoc --help`
-* For shell autocompletion: `eval $(cat completion.cli.zsh)` and `eval $(cat completion.m.zsh)`. You can generate your own with `eval "$(wdoc -- --completion)"` and `eval "$(python -m WDoc -- --completion)"`.
+* For shell autocompletion: if you're using zsh: `eval $(cat completion.cli.zsh)` and `eval $(cat completion.m.zsh)`. You can generate your own with `eval "$(wdoc -- --completion)"` and `eval "$(python -m WDoc -- --completion)"`.
+
+## Scripts made with WDoc
+* *More to come in [the examples folder](./examples/)*
+* [Ntfy Summarizer](examples/NtfySummarizer): automatically summarize a document from your android phone using [ntfy.sh](ntfy.sh)
+* [TheFiche](examples/TheFiche): create summaries for specific notions directly as a [logseq](https://github.com/logseq/logseq) page.
 
 ## FAQ
 * **Who is this for?**
@@ -205,6 +221,7 @@
     * I use it to query my personal documents using the `--private` argument.
     * I sometimes use it to summarize a documents then go straight to asking questions about it, all in the same command.
     * I use it to ask questions about entire youtube playlists.
+    * Other use case are the reason I made the [scripts made with WDoc example section}(#scripts-made-with-wdoc)
 * **What's up with the name?** One of my favorite character (and somewhat of a rolemodel is [Winston Wolf](https://www.youtube.com/watch?v=UeoMuK536C8) and after much hesitation I decided `WolfDoc` would be too confusing and `WinstonDoc` sounds like something micro$oft would do. Also `wd` and `wdoc` were free, whereas `doctools` was already taken. The initial name of the project was `DocToolsLLM`, a play on words between 'doctor' and 'tool'.
 
 ## Notes
 
@@ -34,7 +34,7 @@
     set_func_signature
 )
 from .utils.prompts import prompts
-from .utils.tasks.query import format_chat_history, refilter_docs, check_intermediate_answer, parse_eval_output, query_eval_cache, pbar_chain, pbar_closer
+from .utils.tasks.query import format_chat_history, refilter_docs, check_intermediate_answer, parse_eval_output, query_eval_cache, pbar_chain, pbar_closer, collate_intermediate_answers
 
 from .utils.errors import NoDocumentsRetrieved
 from .utils.errors import NoDocumentsAfterLLMEvalFiltering
@@ -796,8 +796,6 @@ def summarize_documents(
             self.ntfy(
                 f"Total time saved by those summaries: {results['doc_reading_length']:.1f} minutes")
 
-        assert len(
-            self.llm.callbacks) == 1, "Unexpected number of callbacks for llm"
         llmcallback = self.llm.callbacks[0]
         total_cost = self.llm_price[0] * llmcallback.prompt_tokens + \
             self.llm_price[1] * llmcallback.completion_tokens
@@ -1518,33 +1516,27 @@ def retrieve_documents(inputs):
             intermediate_answers = output["intermediate_answers"]
 
             # next step is to combine the intermediate answers into a single answer
-            combine_answers = (
-                prompts.combine
-                | self.llm
-                | StrOutputParser()
-            )
             final_answer_chain = RunnablePassthrough.assign(
                 final_answer=RunnablePassthrough.assign(
                     question=lambda inputs: inputs["question_to_answer"],
-                    # remove answers deemed irrelevant
-                    intermediate_answers=lambda inputs: "\n".join(
-                        [
-                            inp
-                            for inp in inputs["intermediate_answers"]
-                            if check_intermediate_answer(inp)
-                        ]
-                    )
+                    intermediate_answers=lambda inputs:  collate_intermediate_answers(inputs["intermediate_answers"]),
                 )
-                | combine_answers,
+                | prompts.combine
+                | self.llm
+                | StrOutputParser()
             )
 
             if len(intermediate_answers) > 1:
+                llmcallback = self.llm.callbacks[0]
+                cost_before_combine = self.llm_price[0] * llmcallback.prompt_tokens + \
+                self.llm_price[1] * llmcallback.completion_tokens
                 all_intermediate_answers = [intermediate_answers]
                 # group the intermediate answers by batch, then do a batch reduce mapping
                 # each batch is at least 2 intermediate answers and maxes at
                 # batch_tkn_size tokens to avoid losing anything because of
                 # the context
                 batch_tkn_size = 1000
+                max_batch_size = 10
                 pbar = tqdm(
                     desc="Combibing answers",
                     unit="answer",
@@ -1559,6 +1551,9 @@ def retrieve_documents(inputs):
                         if len(batches[-1]) < 2:
                             # make sure there's at least 2 per batch
                             batches[-1].append(ia)
+                        elif len(batches[-1]) > max_batch_size:
+                            # make sure there's not too many intermediate answers
+                            batches.append([ia])
                         elif sum([get_tkn_length(b) for b in batches[-1]]) >= batch_tkn_size:
                             # cap batch size to the max tkn size
                             batches.append([ia])
@@ -1627,19 +1622,25 @@ def retrieve_documents(inputs):
                 f"Number of documents after query eval filter: {len(output['filtered_docs'])}")
             red(
                 f"Number of documents found relevant by eval llm: {len(output['relevant_filtered_docs'])}")
-            red(f"Number of steps to combine intermediate answers: {len(all_intermediate_answers) - 1}")
+            if len(all_intermediate_answers) > 1:
+                extra = '->'.join(
+                    [str(len(ia)) for ia in all_intermediate_answers]
+                )
+                extra = f"({extra})"
+            else:
+                extra = ""
+            red(f"Number of steps to combine intermediate answers: {len(all_intermediate_answers) - 1} {extra}")
             red(f"Time took by the chain: {chain_time:.2f}s")
 
-            assert len(
-                self.llm.callbacks) == 1, "Unexpected number of callbacks for llm"
             llmcallback = self.llm.callbacks[0]
             total_cost = self.llm_price[0] * llmcallback.prompt_tokens + \
                 self.llm_price[1] * llmcallback.completion_tokens
             yel(
                 f"Tokens used by strong model: '{llmcallback.total_tokens}' (${total_cost:.5f})")
+            if "cost_before_combine" in locals():
+                combine_cost = total_cost - cost_before_combine
+                yel(f"Tokens used by strong model to combine the intermediate answers: ${combine_cost:.1f}")
 
-            assert len(
-                self.eval_llm.callbacks) == 1, "Unexpected number of callbacks for eval_llm"
             evalllmcallback = self.eval_llm.callbacks[0]
             wtotal_cost = self.query_evalllm_price[0] * evalllmcallback.prompt_tokens + \
                 self.query_evalllm_price[1] * evalllmcallback.completion_tokens
 
@@ -238,6 +238,10 @@ def load_one_doc(
             debug=debug,
             text_splitter=text_splitter,
             file_hash=file_hash,
+            doccheck_min_lang_prob=doccheck_min_lang_prob,
+            doccheck_min_token=doccheck_min_token,
+            doccheck_max_token=doccheck_max_token,
+            doccheck_max_lines=doccheck_max_lines,
             **kwargs,
         )
 
@@ -770,9 +774,10 @@ def placeholder_replacer(row: pd.Series) -> str:
             strict=False,
         )[0]
     )
-    notes = notes[~notes["text"].str.contains("\[IMAGE_")]
-    notes = notes[~notes["text"].str.contains("\[SOUND_")]
-    notes = notes[~notes["text"].str.contains("\[LINK_")]
+    # remove notes that contain an image, sound or link
+    # notes = notes[~notes["text"].str.contains("\[IMAGE_")]
+    # notes = notes[~notes["text"].str.contains("\[SOUND_")]
+    # notes = notes[~notes["text"].str.contains("\[LINK_")]
     notes["text"] = notes["text"].apply(lambda x: x.strip())
     notes = notes[notes["text"].ne('')]  # remove empty text
     notes.drop_duplicates(subset="text", inplace=True)