Skip to content

Commit b2e33b1

Browse files
Merge branch 'dev'
2 parents 188a1b1 + 4aad00a commit b2e33b1

8 files changed

Lines changed: 130 additions & 55 deletions

File tree

README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
<p align="center"><img src="https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/icon.png?raw=true" width="256"></p>
22

3+
> *I'm WDoc. I solve RAG problems.*
4+
> - WDoc, imitating Winston "The Wolf" Wolf
5+
36
# WDoc
47

58
* **Goal and project specifications** use [LangChain](https://python.langchain.com/) to summarize, search or query documents. I'm a medical student so I need to be able to query from **tens of thousands** of documents, of different types ([Supported filetypes](#Supported-filetypes)). I also have little free time so I needed a tailor made summary feature to keep up with the news.
@@ -114,9 +117,56 @@
114117
6. To know more about each argument supported by each filetype, `wdoc --help`
115118
7. There is a specific recursive filetype I should mention: `--filetype="link_file"`. Basically the file designated by `--path` should contain in each line (`#comments` and empty lines are ignored) one url, that will be parsed by WDoc. I made this so that I can quickly use the "share" button on android from my browser to a text file (so it just appends the url to the file), this file is synced via [syncthing](https://github.com/syncthing/syncthing) to my browser and WDoc automatically summarize them and add them to my [Logseq](https://github.com/logseq/logseq/). Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.
116119
8. If you want to make sure your data remains private here's an example with ollama: `wdoc --private --llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434"}' --modelname="ollama_chat/gemma:2b" --query_eval_modelname="ollama_chat/gemma:2b" --embed_model="BAAI/bge-m3" my_task`
117-
9. Now say you just want to summarize a webpage: `wdoc summary --path="https://arstechnica.com/science/2024/06/to-pee-or-not-to-pee-that-is-a-question-for-the-bladder-and-the-brain/"`.
120+
9. Now say you just want to summarize [Tim Urban's TED talk on procrastination](https://www.youtube.com/watch?v=arj7oStGLkU): `wdoc summary --path 'https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="english" --disable_md_printing`:
121+
> # Summary
122+
> ## https://www.youtube.com/watch?v=arj7oStGLkU
123+
> - The speaker, Tim Urban, was a government major in college who had to write many papers
124+
> - *He claims* his typical work pattern for papers was:
125+
> - Planning to spread work evenly
126+
> - Actually procrastinating until the last minute
127+
> - For his 90-page senior thesis:
128+
> - Planned to work steadily over a year
129+
> - *Actually* ended up writing 90 pages in 72 hours before the deadline
130+
> - Pulled two all-nighters
131+
> - Resulted in a 'very, very bad thesis'
132+
> - Urban is now a writer-blogger for 'Wait But Why'
133+
> - He wrote about procrastination to explain it to non-procrastinators
134+
> - *Humorously claims* to have done brain scans comparing procrastinator and non-procrastinator brains
135+
> - Introduces concept of 'Instant Gratification Monkey' in procrastinator's brain
136+
> - Monkey takes control from the Rational Decision-Maker
137+
> - Leads to unproductive activities like reading Wikipedia, checking fridge, YouTube spirals
138+
> - Monkey characteristics:
139+
> - Lives in the present moment
140+
> - No memory of past or knowledge of future
141+
> - Only cares about 'easy and fun'
142+
> - Rational Decision-Maker:
143+
> - Allows long-term planning and big picture thinking
144+
> - Wants to do what makes sense in the moment
145+
> - 'Dark Playground': where procrastinators spend time on leisure activities when they shouldn't
146+
> - Filled with guilt, dread, anxiety, self-hatred
147+
> - 'Panic Monster': procrastinator's guardian angel
148+
> - Wakes up when deadlines are close or there's danger of embarrassment
149+
> - Only thing the Monkey fears
150+
> - Urban relates his own experience procrastinating on preparing this TED talk
151+
> - *Claims* thousands of people emailed him about having the same procrastination problem
152+
> - Two types of procrastination:
153+
> - 1. Short-term with deadlines (contained by Panic Monster)
154+
> - 2. Long-term without deadlines (more damaging)
155+
> - Affects self-starter careers, personal life, health, relationships
156+
> - Can lead to long-term unhappiness and regrets
157+
> - *Urban believes* all people are procrastinators to some degree
158+
> - Presents 'Life Calendar': visual representation of weeks in a 90-year life
159+
> - Encourages audience to:
160+
> - Think about what they're procrastinating on
161+
> - Stay aware of the Instant Gratification Monkey
162+
> - Start addressing procrastination soon
163+
> - *Humorously* suggests not starting today, but 'sometime soon'
164+
> Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: '4365' ($0.00060)
165+
> Total cost of those summaries: '4365' ($0.00060, estimate was $0.00028)
166+
> Total time saved by those summaries: 8.4 minutes
167+
> Done summarizing.
168+
118169

119-
<p align="center"><img src="https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/summary.png?raw=true" width="256"></p>
120170

121171
## Getting started
122172
*Tested on python 3.10 and 3.11.7*

WDoc/WDoc.py

Lines changed: 38 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@
7979
class WDoc:
8080
"This docstring is dynamically replaced by the content of WDoc/docs/USAGE.md"
8181

82-
VERSION: str = "1.1.9"
82+
VERSION: str = "1.1.10"
8383
allowed_extra_args = extra_args_keys
8484
md_printer = md_printer
8585

@@ -116,7 +116,7 @@ def __init__(
116116
query_eval_check_number: int = 4,
117117
query_relevancy: float = 0.1,
118118

119-
summary_n_recursion: int = 1,
119+
summary_n_recursion: int = 0,
120120
summary_language: str = "the same language as the document",
121121

122122
llm_verbosity: Union[bool, int] = False,
@@ -131,18 +131,36 @@ def __init__(
131131
DIY_rolling_window_embedding: Union[bool, int] = False,
132132
import_mode: Union[bool, int] = False,
133133
disable_md_printing: bool = False,
134+
silent: bool = False,
135+
version: bool = False,
134136

135137
**cli_kwargs,
136138
) -> None:
137139
"This docstring is dynamically replaced by the content of WDoc/docs/USAGE.md"
140+
if version:
141+
print(self.VERSION)
142+
return
143+
if notification_callback is not None:
144+
@optional_typecheck
145+
def ntfy(text: str) -> str:
146+
out = notification_callback(text)
147+
assert out == text, "The notification callback must return the same string"
148+
return out
149+
ntfy("Starting WDoc")
150+
else:
151+
@optional_typecheck
152+
def ntfy(text: str) -> str:
153+
return text
154+
self.ntfy = ntfy
155+
138156
if debug:
139157
def handle_exception(exc_type, exc_value, exc_traceback):
140158
if not issubclass(exc_type, KeyboardInterrupt):
141159
@optional_typecheck
142160
def p(message: str) -> None:
143161
"print error, in red if possible"
144162
try:
145-
red(message)
163+
red(self.ntfy(message))
146164
except Exception as err:
147165
print(message)
148166
p("\n--verbose was used so opening debug console at the "
@@ -158,13 +176,26 @@ def p(message: str) -> None:
158176
sys.excepthook = handle_exception
159177
faulthandler.enable()
160178

179+
elif notification_callback:
180+
def print_exception(exc_type, exc_value, exc_traceback):
181+
if not issubclass(exc_type, KeyboardInterrupt):
182+
message = "An error has occured:\n"
183+
message += "\n".join([line for line in traceback.format_tb(exc_traceback)])
184+
message += "\n" + str(exc_type) + " : " + str(exc_value)
185+
self.ntfy(message)
186+
sys.exit(1)
187+
188+
sys.excepthook = print_exception
189+
faulthandler.enable()
190+
161191
red(pyfiglet.figlet_format("wdoc"))
162192

163193
# make sure the extra args are valid
164194
for k in cli_kwargs:
165195
if k not in self.allowed_extra_args:
166196
raise Exception(
167-
red(f"Found unexpected keyword argument: '{k}'"))
197+
red(f"Found unexpected keyword argument: '{k}'\nThe allowed arguments are {','.join(self.allowed_extra_args)}")
198+
)
168199

169200
# type checking of extra args
170201
if os.environ["WDOC_TYPECHECKING"] in ["crash", "warn"]:
@@ -376,19 +407,6 @@ def p(message: str) -> None:
376407
raise Exception(
377408
red(f"Can't find the price of {query_eval_modelname}"))
378409

379-
if notification_callback is not None:
380-
@optional_typecheck
381-
def ntfy(text: str) -> str:
382-
out = notification_callback(text)
383-
assert out == text, "The notification callback must return the same string"
384-
return out
385-
ntfy("Starting WDoc")
386-
else:
387-
@optional_typecheck
388-
def ntfy(text: str) -> str:
389-
return text
390-
self.ntfy = ntfy
391-
392410
if is_verbose:
393411
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
394412
set_verbose(True)
@@ -786,6 +804,7 @@ def summarize_documents(
786804
if llmcallback.total_tokens != results['doc_total_tokens']:
787805
red(
788806
f"Cost discrepancy? Tokens used according to the callback: '{llmcallback.total_tokens}' (${total_cost:.5f})")
807+
self.summary_results = results
789808
return results
790809

791810
@optional_typecheck
@@ -1196,6 +1215,7 @@ def query_task(self, query: Optional[str]) -> dict:
11961215
eval_args["n"] = self.query_eval_check_number
11971216
else:
11981217
red(f"Model {self.query_eval_modelname} does not support parameter 'n' so will be called multiple times instead. This might cost more.")
1218+
assert self.query_eval_modelbackend != "openai"
11991219
if "max_tokens" in self.eval_llm_params:
12001220
eval_args["max_tokens"] = 2
12011221
else:
@@ -1607,6 +1627,7 @@ def retrieve_documents(inputs):
16071627
f"Number of documents after query eval filter: {len(output['filtered_docs'])}")
16081628
red(
16091629
f"Number of documents found relevant by eval llm: {len(output['relevant_filtered_docs'])}")
1630+
red(f"Number of steps to combine intermediate answers: {len(all_intermediate_answers) - 1}")
16101631
red(f"Time took by the chain: {chain_time:.2f}s")
16111632

16121633
assert len(

WDoc/docs/USAGE.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@
139139

140140
---
141141

142-
* `--summary_n_recursion`: int, default `1`
142+
* `--summary_n_recursion`: int, default `0`
143143
* after summarizing, will go over the summary that many times to fix
144144
indentation, repetitions etc.
145145
* 0 means disabled.
@@ -222,11 +222,19 @@
222222
the default langchain SentenceTransformerEmbedding implementation
223223

224224
* `--import_mode`: bool, default `False`
225-
* if True, will return the answer from query instead of printing it
225+
* if True, will return the answer from query instead of printing it.
226+
The idea is to use if when you import WDoc instead of running
227+
it from the cli. See `--silent`
226228

227229
* `--disable_md_printing`: bool, default `True`
228230
* if True, instead of using rich to display some information, default to simpler colored prints.
229231

232+
* `--silent`: bool, default False
233+
* disable almost all prints. Can be handy if `--import_mode` is used.
234+
235+
* `--version`: bool, default False
236+
* display the version and exit
237+
230238
* `--cli_kwargs`: dict, optional
231239
* Any remaining keyword argument will be parsed as a loader
232240
specific argument ((see below)[#loader-specific-arguments]).

WDoc/utils/flags.py

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,14 @@
99
kwargs = fire.Fire(lambda *args, **kwargs: kwargs)
1010
is_linux = platform.system() == "Linux"
1111

12-
if "verbose" in kwargs and kwargs["verbose"]:
13-
is_verbose = True
14-
else:
15-
is_verbose = False
12+
def check_kwargs(arg):
13+
if arg in kwargs and kwargs[arg]:
14+
return True
15+
return False
1616

17-
if "debug" in kwargs and kwargs["debug"]:
18-
is_debug = True
19-
is_verbose = True
20-
else:
21-
is_debug = False
17+
is_debug = check_kwargs("debug")
18+
is_verbose = is_debug or check_kwargs("verbose")
2219

23-
if "disable_md_printing" in kwargs and kwargs["disable_md_printing"]:
24-
disable_md_printing = True
25-
else:
26-
disable_md_printing = False
20+
is_silent = check_kwargs("silent")
21+
22+
md_printing_disabled = check_kwargs("disable_md_printing")

WDoc/utils/logger.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
import warnings
1515

1616
from .typechecker import optional_typecheck
17-
from .flags import disable_md_printing
17+
from .flags import md_printing_disabled, is_silent
1818

1919
# ignore warnings from beautiful soup
2020
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
@@ -85,7 +85,8 @@ def printer(string: Union[str, Dict, List], **args) -> str:
8585
for k, v in colors.items():
8686
string = string.replace(v, "")
8787
logger.info(string)
88-
tqdm.write(col + string + colors["reset"], **args)
88+
if not is_silent:
89+
tqdm.write(col + string + colors["reset"], **args)
8990
return inp
9091
return printer
9192

@@ -100,7 +101,7 @@ def printer(string: Union[str, Dict, List], **args) -> str:
100101
@optional_typecheck
101102
def md_printer(message: str, color: Optional[str] = None) -> str:
102103
"markdown printing"
103-
if not disable_md_printing:
104+
if not md_printing_disabled:
104105
logger.info(message)
105106
md = Markdown(message)
106107
console.print(md, style=color)

WDoc/utils/misc.py

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -149,26 +149,29 @@ class DocDict(dict):
149149
allowed_keys = doc_kwargs_keys
150150
allowed_types = loader_specific_keys
151151

152+
def __check_values__(self, key, value):
153+
if key not in self.allowed_keys:
154+
raise Exception(
155+
f"Cannot set key '{key}' in a DocDict. Allowed keys are "
156+
f"'{','.join(self.allowed_keys)}'"
157+
)
158+
if key in self.allowed_types and value is not None:
159+
assert isinstance(value, self.allowed_types[key]), (
160+
f"Type of key {key} should be {self.allowed_types[key]},"
161+
f"not {type(value)}"
162+
)
163+
152164
def __init__(self, *args, **kwargs):
153165
for arg in args:
154166
assert isinstance(arg, dict)
155167
for k, v in arg.items():
156-
if k not in self.allowed_keys:
157-
raise Exception(f"Cannot set key '{k}' in a DocDict")
158-
if k in self.allowed_types and v is not None:
159-
assert isinstance(v, self.allowed_types[k])
168+
self.__check_values__(k, v)
160169
for k, v in kwargs.items():
161-
if k not in self.allowed_keys:
162-
raise Exception(f"Cannot set key '{k}' in a DocDict")
163-
if k in self.allowed_types and v is not None:
164-
assert isinstance(v, self.allowed_types[k])
170+
self.__check_values__(k, v)
165171
super().__init__(*args, **kwargs)
166172

167173
def __setitem__(self, key, value):
168-
if key not in self.allowed_keys:
169-
raise Exception(f"Cannot set key '{key}' in a DocDict")
170-
if key in self.allowed_types and value is not None:
171-
assert isinstance(value, self.allowed_types[key])
174+
self.__check_values__(key, value)
172175
super().__setitem__(key, value)
173176

174177

bumpver.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpver]
2-
current_version = "1.1.9"
2+
current_version = "1.1.10"
33
version_pattern = "MAJOR.MINOR.PATCH"
44
commit_message = "bump version {old_version} -> {new_version}"
55
tag_message = "{new_version}"

setup.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,11 @@ def run(self):
3131
'<p align="center"><img src="https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/icon.png?raw=true" width="256"></p>',
3232
'![icon](https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/icon.png?raw=true)',
3333
)
34-
long_description = long_description.replace(
35-
'<p align="center"><img src="https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/summary.png?raw=true" width="256"></p>',
36-
'![example](https://github.com/thiswillbeyourgithub/WDoc/blob/main/images/summary.png?raw=true)',
37-
)
3834
assert 'align="center"' not in long_description
3935

4036
setup(
4137
name="wdoc",
42-
version="1.1.9",
38+
version="1.1.10",
4339
description="A perfect AI powered RAG for document query and summary. Supports ~all LLM and ~all filetypes (url, pdf, epub, youtube (incl playlist), audio, anki, md, docx, pptx, oe any combination!)",
4440
long_description=long_description,
4541
long_description_content_type="text/markdown",

0 commit comments

Comments
 (0)