Skip to content

Commit f445717

Browse files
fix: onlinepdfloader is slow slow its better to just try downloading the file first
1 parent 09202fd commit f445717

2 files changed

Lines changed: 33 additions & 30 deletions

File tree

wdoc/docs/USAGE.md

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,11 @@
5050

5151
* `online_pdf`
5252
* Same arguments as for `--filetype=pdf`
53-
Note that the way `online_pdf` are handled is a bit different than `pdf`: we
54-
first try using langchain's integrated OnlinePDFLoader and if it fails,
55-
we download the file and parse it like if `--filetype==pdf`.
53+
Note that the way `online_pdf` are handled is a bit
54+
different than `pdf`: we first try to download it then
55+
parse it with `filetype=pdf` and as a last resort we
56+
use langchain's integrated OnlinePDFLoader as it's
57+
far slower.
5658

5759
* `anki`
5860
* Optional:
@@ -381,9 +383,10 @@
381383
Not all parsers are tried. Instead, after each parsing we check using
382384
fasttext and heuristics based on doccheck_* args to rank the quality of the parsing.
383385
When stop if 1 parsing is high enough or take the best if 3 parsing worked.
384-
Note that the way `online_pdf` are handled is a bit different: we
385-
first try using langchain's integrated OnlinePDFLoader and if it fails,
386-
we download the file and parse it like if `--filetype==pdf`.
386+
Note that the way `online_pdf` are handled is a bit different than
387+
`pdf`: we first try to download it then parse it with
388+
`filetype=pdf` and as a last resort we use langchain's
389+
integrated OnlinePDFLoader as it's far slower.
387390

388391
Currently implemented:
389392
- Okayish metadata:
@@ -653,10 +656,10 @@
653656

654657
* `WDOC_MAX_PDF_LOADER_TIMEOUT`
655658
* Number of seconds to wait for each pdf loader before giving up this loader. This includes the `online_pdf` loader.
656-
Note that it probably makes PDF parsing substantially.
657-
Default is `-1` to disable.
658-
Disabled when using `--file_loader_parallel_backend=threading` as python does not allow it.
659-
Also disabled if <= 0.
659+
Note that it probably makes PDF parsing substantially.
660+
Default is `-1` to disable.
661+
Disabled when using `--file_loader_parallel_backend=threading` as python does not allow it.
662+
Also disabled if <= 0.
660663

661664
* `WDOC_DEBUGGER`
662665
* If True, will open the debugger in case of issue. Implied by `--debug`

wdoc/utils/loaders.py

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -802,26 +802,6 @@ def load_online_pdf(
802802
whi(f"Loading online pdf: '{path}'")
803803

804804
try:
805-
loader = OnlinePDFLoader(path)
806-
if pdf_loader_max_timeout > 0:
807-
with signal_timeout(
808-
timeout=pdf_loader_max_timeout,
809-
exception=TimeoutPdfLoaderError,
810-
):
811-
docs = loader.load()
812-
try:
813-
signal.alarm(0) # disable alarm again just in case
814-
except Exception:
815-
pass
816-
else:
817-
docs = loader.load()
818-
return docs
819-
820-
except Exception as err:
821-
red(
822-
f"Failed parsing online PDF {path} using only OnlinePDFLoader be cause '{err}'.\nWill try downloading it directly."
823-
)
824-
825805
response = requests.get(path)
826806
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_file:
827807
temp_file.write(response.content)
@@ -838,6 +818,26 @@ def load_online_pdf(
838818
)
839819
return docs
840820

821+
except Exception as err:
822+
red(
823+
f"Failed parsing online PDF {path} by downloading it and trying to parse because of error '{err}'. Retrying one last time with OnlinePDFLoader."
824+
)
825+
loader = OnlinePDFLoader(path)
826+
if pdf_loader_max_timeout > 0:
827+
with signal_timeout(
828+
timeout=pdf_loader_max_timeout,
829+
exception=TimeoutPdfLoaderError,
830+
):
831+
docs = loader.load()
832+
try:
833+
signal.alarm(0) # disable alarm again just in case
834+
except Exception:
835+
pass
836+
else:
837+
docs = loader.load()
838+
839+
return docs
840+
841841

842842
@debug_return_empty
843843
@optional_strip_unexp_args

0 commit comments

Comments
 (0)