Skip to content

Commit 0969979

Browse files
Merge branch 'dev'
2 parents b2e33b1 + dfef021 commit 0969979

11 files changed

Lines changed: 524 additions & 47 deletions

File tree

README.md

Lines changed: 39 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,40 @@
55
66
# WDoc
77

8-
* **Goal and project specifications** use [LangChain](https://python.langchain.com/) to summarize, search or query documents. I'm a medical student so I need to be able to query from **tens of thousands** of documents, of different types ([Supported filetypes](#Supported-filetypes)). I also have little free time so I needed a tailor made summary feature to keep up with the news.
9-
* **Current status**: **still under development**.
10-
* I use it almost daily and have been for months now.
11-
* Expect some breakage but they can be fixed usually in a few minutes if you open an issue here.
12-
* The main branch is usually fine but the dev branch has more features.
13-
* I accept feature requests and pull requests.
14-
* Issues are extremely appreciated for any reason including typos etc.
15-
* Prefer asking me before making a PR because I have many improvements in the pipeline but do this on my spare time. Do tell me if you have specific needs!
8+
WDoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources.
9+
10+
* **Goal and project specifications**: WDoc uses [LangChain](https://python.langchain.com/) to process and analyze documents. It's capable of querying **tens of thousands** of documents across [various file types](#Supported-filetypes). The project also includes a tailored summary feature to help users efficiently keep up with large amounts of information.
11+
12+
* **Current status**: **Under active development**
13+
* Used daily by the developer for several months
14+
* May have some instabilities, but issues can usually be resolved quickly
15+
* The main branch is stable, while the dev branch offers more features
16+
* Open to feature requests and pull requests
17+
* All feedback, including reports of typos, is highly appreciated
18+
* Please consult the developer before making a PR, as there may be ongoing improvements in the pipeline
19+
20+
* **Key Features**:
21+
* Supports multiple file types for comprehensive document analysis
22+
* Utilizes both strong and query evaluation LLMs for accurate results
23+
* Customizable summarization capabilities
24+
* Efficient handling of large document corpora
1625

1726
### Table of contents
1827
- [Features](#features)
1928
- [Planned Features](#planned-features)
2029
- [Supported filetypes](#supported-filetypes)
2130
- [Supported tasks](#supported-tasks)
2231
- [Walkthrough and examples](#walkthrough-and-examples)
32+
- [Scripts made with WDoc](#scripts-made-with-wdoc)
2333
- [Getting started](#getting-started)
2434
- [FAQ](#faq)
2535
- [Notes](#notes)
2636
- [Known issues](#known-issues)
2737

2838
## Features
39+
* **15+ filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Supported filestypes](#Supported-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too!
40+
* **100+ LLMs**: OpenAI, Mistral, Claude, Ollama, Openrouter, etc. Thanks to [litellm](https://docs.litellm.ai/).
41+
* **Local and Private LLM**: take some measures to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading sockets, etc.
2942
* **Advanced RAG to query lots of diverse documents**:
3043
1. The documents are retrieved using embedding
3144
2. Then a weak LLM model ("Evaluator") is used to tell which of those document is not relevant
@@ -37,26 +50,21 @@
3750
* Instead of unusable "high level takeaway" points, compress the reasoning, arguments, though process etc of the author into an easy to skim markdown file.
3851
* The summaries are then checked again n times for correct logical indentation etc.
3952
* The summary can be in the same language as the documents or directly translated.
40-
* **Multiple LLM providers**: OpenAI, Mistral, Claude, Ollama, Openrouter, etc. Thanks to [litellm](https://docs.litellm.ai/).
41-
* **Private LLM**: take some measures to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading sockets, etc.
4253
* **Many tasks**: See [Supported tasks](#Supported-tasks).
43-
* **Many filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Supported filestypes](#Supported-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too!
54+
* **Markdown formatted answers and summaries**: using [rich](https://github.com/Textualize/rich).
4455
* **Sane embeddings**: By default use sophisticated embeddings like HyDE, parent retriever etc. Customizable.
45-
* **Documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. The full usage can be found in the file [USAGE.md](./WDoc/docs/USAGE.md) or via `python -m WDoc --help`.
56+
* **Fully documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. The full usage can be found in the file [USAGE.md](./WDoc/docs/USAGE.md) or via `python -m WDoc --help`. I work hard to maintain an exhaustive documentation.
57+
* **Scriptable / Extensible**: You can use WDoc in other python project using `--import_mode`. Take a look at the examples [below](#scripts-made-with-wdoc).
58+
* **Statically typed**: Runtime type checking. Opt out with an environment flag: `WDOC_TYPECHECKING="disabled / warn / crash" WDoc` (by default: `warn`). Thanks to [beartype](https://beartype.readthedocs.io/en/latest/) it shouldn't even slow down the code!
4659
* **Lazy imports**: Faster statup time thanks to lazy_import
4760
* **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).
4861
* **Sophisticated faiss saver**: [faiss](https://github.com/facebookresearch/faiss/wiki) is used to quickly find the documents that match an embedding. But instead of storing as a single file, WDoc splits the index into 1 document long index identified by deterministic hashes. When creating a new index, any overlapping document will be automatically reloaded instead of recomputed.
49-
* **Easy model testing** Include an LLM name matcher that fuzzy finds the most appropriate model given an name.
5062
* **Good PDF parsing** PDF parsers are notoriously unreliable, so 10 (!) different loaders are used, and the best according to a parsing scorer is kept.
51-
* **Markdown formatted answers and summaries**: using [rich](https://github.com/Textualize/rich).
5263
* **Document filtering**: based on regex for document content or metadata.
53-
* **Fast**: Parallel document parsing and embedding.
64+
* **Fast**: Parallel document loading, parsing, embeddings, querying, etc.
5465
* **Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag)
55-
* **Static typed**: Runtime type checking. Opt out with an environment flag: `WDOC_TYPECHECKING="disabled / warn / crash" WDoc` (by default: `warn`). Thanks to [beartype](https://beartype.readthedocs.io/en/latest/) it shouldn't even slow down the code!
56-
* **Scriptable**: You can use WDoc in other python project using `--import_mode`
5766
* **Notification callback**: Can be used for example to get summaries on your phone using [ntfy.sh](ntfy.sh).
58-
* **Fully documented**: I work hard to maintain an exhaustive documentation at `wdoc --help`
59-
* Very customizable, with a friendly dev! Just open an issue if you have a feature request or anything else.
67+
* **Hacker mindset**: I'm a friendly dev! Just open an issue if you have a feature request or anything else.
6068

6169
### Planned features
6270
*(These don't include improvements, bugfixes, refactoring etc.)*
@@ -66,15 +74,13 @@
6674
* More configurable HyDE.
6775
* Web search retriever, online information lookup via jina.ai reader and search.
6876
* LLM powered synonym expansion for embeddings search.
69-
* Investigate switching to Milvus Lite instead of handling split faiss indexes.
7077
* A way to specify at indexing time how trusting you are of a given set of document.
7178
* A way to open the documents automatically, based on the platform used. For ex if okular is installed, open pdfs directly at the appropriate page.
7279
* Improve the scriptability of WDoc. Add examples for how you use it with Logseq.
7380
* Include a server example, that mimics the OpenAI's API to make your RAG directly accessible to other apps.
7481
* Add a gradio GUI.
7582
* Include the possible whisper/deepgram extra expenses when counting costs.
7683
* Add support for user defined loaders.
77-
* Add support for custom user prompt.
7884
* Automatically caption document images using an LLM, especially nice for anki cards.
7985

8086
### Supported filetypes
@@ -118,6 +124,7 @@
118124
7. There is a specific recursive filetype I should mention: `--filetype="link_file"`. Basically the file designated by `--path` should contain in each line (`#comments` and empty lines are ignored) one url, that will be parsed by WDoc. I made this so that I can quickly use the "share" button on android from my browser to a text file (so it just appends the url to the file), this file is synced via [syncthing](https://github.com/syncthing/syncthing) to my browser and WDoc automatically summarize them and add them to my [Logseq](https://github.com/logseq/logseq/). Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.
119125
8. If you want to make sure your data remains private here's an example with ollama: `wdoc --private --llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434"}' --modelname="ollama_chat/gemma:2b" --query_eval_modelname="ollama_chat/gemma:2b" --embed_model="BAAI/bge-m3" my_task`
120126
9. Now say you just want to summarize [Tim Urban's TED talk on procrastination](https://www.youtube.com/watch?v=arj7oStGLkU): `wdoc summary --path 'https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="english" --disable_md_printing`:
127+
121128
> # Summary
122129
> ## https://www.youtube.com/watch?v=arj7oStGLkU
123130
> - The speaker, Tim Urban, was a government major in college who had to write many papers
@@ -161,9 +168,13 @@
161168
> - Stay aware of the Instant Gratification Monkey
162169
> - Start addressing procrastination soon
163170
> - *Humorously* suggests not starting today, but 'sometime soon'
171+
>
164172
> Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: '4365' ($0.00060)
173+
>
165174
> Total cost of those summaries: '4365' ($0.00060, estimate was $0.00028)
175+
>
166176
> Total time saved by those summaries: 8.4 minutes
177+
>
167178
> Done summarizing.
168179
169180

@@ -179,7 +190,12 @@
179190
* To ask questions about a document: `wdoc query --path="PATH/TO/YOUR/FILE" --filetype="auto"`
180191
* If you want to reduce the startup time, you can use --saveas="some/path" to save the loaded embeddings from last time and --loadfrom "some/path" on every subsequent call. (In any case, the embeddings are always cached)
181192
* For more: read the documentation at `wdoc --help`
182-
* For shell autocompletion: `eval $(cat completion.cli.zsh)` and `eval $(cat completion.m.zsh)`. You can generate your own with `eval "$(wdoc -- --completion)"` and `eval "$(python -m WDoc -- --completion)"`.
193+
* For shell autocompletion: if you're using zsh: `eval $(cat completion.cli.zsh)` and `eval $(cat completion.m.zsh)`. You can generate your own with `eval "$(wdoc -- --completion)"` and `eval "$(python -m WDoc -- --completion)"`.
194+
195+
## Scripts made with WDoc
196+
* *More to come in [the examples folder](./examples/)*
197+
* [Ntfy Summarizer](examples/NtfySummarizer): automatically summarize a document from your android phone using [ntfy.sh](ntfy.sh)
198+
* [TheFiche](examples/TheFiche): create summaries for specific notions directly as a [logseq](https://github.com/logseq/logseq) page.
183199

184200
## FAQ
185201
* **Who is this for?**
@@ -205,6 +221,7 @@
205221
* I use it to query my personal documents using the `--private` argument.
206222
* I sometimes use it to summarize a documents then go straight to asking questions about it, all in the same command.
207223
* I use it to ask questions about entire youtube playlists.
224+
* Other use case are the reason I made the [scripts made with WDoc example section}(#scripts-made-with-wdoc)
208225
* **What's up with the name?** One of my favorite character (and somewhat of a rolemodel is [Winston Wolf](https://www.youtube.com/watch?v=UeoMuK536C8) and after much hesitation I decided `WolfDoc` would be too confusing and `WinstonDoc` sounds like something micro$oft would do. Also `wd` and `wdoc` were free, whereas `doctools` was already taken. The initial name of the project was `DocToolsLLM`, a play on words between 'doctor' and 'tool'.
209226

210227
## Notes

WDoc/WDoc.py

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
set_func_signature
3535
)
3636
from .utils.prompts import prompts
37-
from .utils.tasks.query import format_chat_history, refilter_docs, check_intermediate_answer, parse_eval_output, query_eval_cache, pbar_chain, pbar_closer
37+
from .utils.tasks.query import format_chat_history, refilter_docs, check_intermediate_answer, parse_eval_output, query_eval_cache, pbar_chain, pbar_closer, collate_intermediate_answers
3838

3939
from .utils.errors import NoDocumentsRetrieved
4040
from .utils.errors import NoDocumentsAfterLLMEvalFiltering
@@ -796,8 +796,6 @@ def summarize_documents(
796796
self.ntfy(
797797
f"Total time saved by those summaries: {results['doc_reading_length']:.1f} minutes")
798798

799-
assert len(
800-
self.llm.callbacks) == 1, "Unexpected number of callbacks for llm"
801799
llmcallback = self.llm.callbacks[0]
802800
total_cost = self.llm_price[0] * llmcallback.prompt_tokens + \
803801
self.llm_price[1] * llmcallback.completion_tokens
@@ -1518,33 +1516,27 @@ def retrieve_documents(inputs):
15181516
intermediate_answers = output["intermediate_answers"]
15191517

15201518
# next step is to combine the intermediate answers into a single answer
1521-
combine_answers = (
1522-
prompts.combine
1523-
| self.llm
1524-
| StrOutputParser()
1525-
)
15261519
final_answer_chain = RunnablePassthrough.assign(
15271520
final_answer=RunnablePassthrough.assign(
15281521
question=lambda inputs: inputs["question_to_answer"],
1529-
# remove answers deemed irrelevant
1530-
intermediate_answers=lambda inputs: "\n".join(
1531-
[
1532-
inp
1533-
for inp in inputs["intermediate_answers"]
1534-
if check_intermediate_answer(inp)
1535-
]
1536-
)
1522+
intermediate_answers=lambda inputs: collate_intermediate_answers(inputs["intermediate_answers"]),
15371523
)
1538-
| combine_answers,
1524+
| prompts.combine
1525+
| self.llm
1526+
| StrOutputParser()
15391527
)
15401528

15411529
if len(intermediate_answers) > 1:
1530+
llmcallback = self.llm.callbacks[0]
1531+
cost_before_combine = self.llm_price[0] * llmcallback.prompt_tokens + \
1532+
self.llm_price[1] * llmcallback.completion_tokens
15421533
all_intermediate_answers = [intermediate_answers]
15431534
# group the intermediate answers by batch, then do a batch reduce mapping
15441535
# each batch is at least 2 intermediate answers and maxes at
15451536
# batch_tkn_size tokens to avoid losing anything because of
15461537
# the context
15471538
batch_tkn_size = 1000
1539+
max_batch_size = 10
15481540
pbar = tqdm(
15491541
desc="Combibing answers",
15501542
unit="answer",
@@ -1559,6 +1551,9 @@ def retrieve_documents(inputs):
15591551
if len(batches[-1]) < 2:
15601552
# make sure there's at least 2 per batch
15611553
batches[-1].append(ia)
1554+
elif len(batches[-1]) > max_batch_size:
1555+
# make sure there's not too many intermediate answers
1556+
batches.append([ia])
15621557
elif sum([get_tkn_length(b) for b in batches[-1]]) >= batch_tkn_size:
15631558
# cap batch size to the max tkn size
15641559
batches.append([ia])
@@ -1627,19 +1622,25 @@ def retrieve_documents(inputs):
16271622
f"Number of documents after query eval filter: {len(output['filtered_docs'])}")
16281623
red(
16291624
f"Number of documents found relevant by eval llm: {len(output['relevant_filtered_docs'])}")
1630-
red(f"Number of steps to combine intermediate answers: {len(all_intermediate_answers) - 1}")
1625+
if len(all_intermediate_answers) > 1:
1626+
extra = '->'.join(
1627+
[str(len(ia)) for ia in all_intermediate_answers]
1628+
)
1629+
extra = f"({extra})"
1630+
else:
1631+
extra = ""
1632+
red(f"Number of steps to combine intermediate answers: {len(all_intermediate_answers) - 1} {extra}")
16311633
red(f"Time took by the chain: {chain_time:.2f}s")
16321634

1633-
assert len(
1634-
self.llm.callbacks) == 1, "Unexpected number of callbacks for llm"
16351635
llmcallback = self.llm.callbacks[0]
16361636
total_cost = self.llm_price[0] * llmcallback.prompt_tokens + \
16371637
self.llm_price[1] * llmcallback.completion_tokens
16381638
yel(
16391639
f"Tokens used by strong model: '{llmcallback.total_tokens}' (${total_cost:.5f})")
1640+
if "cost_before_combine" in locals():
1641+
combine_cost = total_cost - cost_before_combine
1642+
yel(f"Tokens used by strong model to combine the intermediate answers: ${combine_cost:.1f}")
16401643

1641-
assert len(
1642-
self.eval_llm.callbacks) == 1, "Unexpected number of callbacks for eval_llm"
16431644
evalllmcallback = self.eval_llm.callbacks[0]
16441645
wtotal_cost = self.query_evalllm_price[0] * evalllmcallback.prompt_tokens + \
16451646
self.query_evalllm_price[1] * evalllmcallback.completion_tokens

WDoc/utils/loaders.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,10 @@ def load_one_doc(
238238
debug=debug,
239239
text_splitter=text_splitter,
240240
file_hash=file_hash,
241+
doccheck_min_lang_prob=doccheck_min_lang_prob,
242+
doccheck_min_token=doccheck_min_token,
243+
doccheck_max_token=doccheck_max_token,
244+
doccheck_max_lines=doccheck_max_lines,
241245
**kwargs,
242246
)
243247

@@ -770,9 +774,10 @@ def placeholder_replacer(row: pd.Series) -> str:
770774
strict=False,
771775
)[0]
772776
)
773-
notes = notes[~notes["text"].str.contains("\[IMAGE_")]
774-
notes = notes[~notes["text"].str.contains("\[SOUND_")]
775-
notes = notes[~notes["text"].str.contains("\[LINK_")]
777+
# remove notes that contain an image, sound or link
778+
# notes = notes[~notes["text"].str.contains("\[IMAGE_")]
779+
# notes = notes[~notes["text"].str.contains("\[SOUND_")]
780+
# notes = notes[~notes["text"].str.contains("\[LINK_")]
776781
notes["text"] = notes["text"].apply(lambda x: x.strip())
777782
notes = notes[notes["text"].ne('')] # remove empty text
778783
notes.drop_duplicates(subset="text", inplace=True)

0 commit comments

Comments
 (0)