|
50 | 50 |
|
51 | 51 | * `online_pdf` |
52 | 52 | * Same arguments as for `--filetype=pdf` |
53 | | - Note that the way `online_pdf` are handled is a bit different than `pdf`: we |
54 | | - first try using langchain's integrated OnlinePDFLoader and if it fails, |
55 | | - we download the file and parse it like if `--filetype==pdf`. |
| 53 | + Note that the way `online_pdf` are handled is a bit |
| 54 | + different than `pdf`: we first try to download it then |
| 55 | + parse it with `filetype=pdf` and as a last resort we |
| 56 | + use langchain's integrated OnlinePDFLoader as it's |
| 57 | + far slower. |
56 | 58 |
|
57 | 59 | * `anki` |
58 | 60 | * Optional: |
|
381 | 383 | Not all parsers are tried. Instead, after each parsing we check using |
382 | 384 | fasttext and heuristics based on doccheck_* args to rank the quality of the parsing. |
383 | 385 | When stop if 1 parsing is high enough or take the best if 3 parsing worked. |
384 | | - Note that the way `online_pdf` are handled is a bit different: we |
385 | | - first try using langchain's integrated OnlinePDFLoader and if it fails, |
386 | | - we download the file and parse it like if `--filetype==pdf`. |
| 386 | + Note that the way `online_pdf` are handled is a bit different than |
| 387 | + `pdf`: we first try to download it then parse it with |
| 388 | + `filetype=pdf` and as a last resort we use langchain's |
| 389 | + integrated OnlinePDFLoader as it's far slower. |
387 | 390 |
|
388 | 391 | Currently implemented: |
389 | 392 | - Okayish metadata: |
|
653 | 656 |
|
654 | 657 | * `WDOC_MAX_PDF_LOADER_TIMEOUT` |
655 | 658 | * Number of seconds to wait for each pdf loader before giving up this loader. This includes the `online_pdf` loader. |
656 | | - Note that it probably makes PDF parsing substantially. |
657 | | - Default is `-1` to disable. |
658 | | - Disabled when using `--file_loader_parallel_backend=threading` as python does not allow it. |
659 | | - Also disabled if <= 0. |
| 659 | + Note that it probably makes PDF parsing substantially. |
| 660 | + Default is `-1` to disable. |
| 661 | + Disabled when using `--file_loader_parallel_backend=threading` as python does not allow it. |
| 662 | + Also disabled if <= 0. |
660 | 663 |
|
661 | 664 | * `WDOC_DEBUGGER` |
662 | 665 | * If True, will open the debugger in case of issue. Implied by `--debug` |
|
0 commit comments