You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: libs/README.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -233,14 +233,13 @@ TODO: proceed with confluence extractor.
233
233
| ms_docs_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py)| Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
234
234
| xml_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py)| Extractor used for extracting content from XML documents. |
235
235
| all_extractors |`dependency_injector.providers.List[extractor_api_lib.extractors.information_extractor.InformationExtractor]`|`dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)`| List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
236
-
| general_file_extractor |[`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py)|[`extractor_api_lib.impl.extractors.file_extractors.general_extractor.GeneralExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py)| Combines multiple file extractors and decides which one to use for the given file format. |
236
+
| general_file_extractor |[`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py)|[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py)| Combines multiple file extractors and decides which one to use for the given file format. |
237
237
| general_source_extractor |[`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py)|[`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py)| Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
238
-
| confluence_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py)| Implementation of an extractor for the source `confluence`. |
239
-
| sitemap_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py)| Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
238
+
| confluence_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py)| Implementation of an extractor for the source `confluence`. |
239
+
| sitemap_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py)| Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
240
240
| sitemap_parsing_function |`dependency_injector.providers.Factory[Callable]`|[`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py)| Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
241
241
| sitemap_meta_function |`dependency_injector.providers.Factory[Callable]`|[`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py)| Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
242
242
243
-
<!-- | file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. | -->
244
243
## 4. RAG Core Lib
245
244
246
245
The rag-core-lib contains components of the `rag-core-api` that are also useful for other services and therefore are packaged in a way that makes it easy to use.
0 commit comments