Skip to content

Commit 51634fd

Browse files
committed
fix: update extractor references and improve documentation clarity in README
1 parent 36cb7eb commit 51634fd

1 file changed

Lines changed: 13 additions & 11 deletions

File tree

libs/README.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -223,22 +223,24 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
223223

224224
### 3.3 Replaceable parts
225225

226+
TODO: proceed with confluence extractor.
227+
226228
| Name | Type | Default | Notes |
227229
|----------|---------|--------------|--------------|
228-
| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/file_services/s3_service.py) | Handles operations on the connected storage. |
229-
| database_converter | [`extractor_api_lib.document_parser.table_converters.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/document_parser/table_converters/dataframe_converter.py) | [`extractor_api_lib.document_parser.table_converters.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/document_parser/table_converters/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
230-
| pdf_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
231-
| ms_docs_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232-
| xml_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) | [`extractor_api_lib.document_parser.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/xml_extractor.py) | Extractor used for extracting content from XML documents. |
233-
| all_extractors | `dependency_injector.providers.List[extractor_api_lib.document_parser.information_extractor.InformationExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
234-
| general_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.general_extractor.GeneralExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/general_extractor.py) | Combines multiple extractors and decides which one to use for the given file format. |
235-
| file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. |
236-
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
237-
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/confluence_extractor.py) | Implementation of an esxtractor for the source `confluence`. |
238-
| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
230+
| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.impl.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
231+
| database_converter | [`extractor_api_lib.table_converter.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/table_converter/dataframe_converter.py) | [`extractor_api_lib.impl.table_converter.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/impl/table_converter/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
232+
| pdf_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
233+
| ms_docs_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
234+
| xml_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
235+
| all_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_extractor.InformationExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
236+
| general_file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.general_extractor.GeneralExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) | Combines multiple file extractors and decides which one to use for the given file format. |
237+
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
238+
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py) | Implementation of an extractor for the source `confluence`. |
239+
| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api_lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
239240
| sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
240241
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
241242

243+
<!-- | file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. | -->
242244
## 4. RAG Core Lib
243245

244246
The rag-core-lib contains components of the `rag-core-api` that are also useful for other services and therefore are packaged in a way that makes it easy to use.

0 commit comments

Comments
 (0)