Skip to content

Commit 8ba8617

Browse files
authored
docs: revise replacable parts in readme.md for the extractor api lib (#74)
This pull request introduces updates to the extractor API library readme.md, focusing on improving extractor configurations, renaming components for clarity, and enhancing metadata handling. ### Updates to extractor configurations and mappings: * Replaced `InformationExtractor` with `InformationFileExtractor` for `pdf_extractor`, `ms_docs_extractor`, and `xml_extractor` in the `README.md` file. Additionally, the `all_extractors` list was renamed to `file_extractors` for better specificity. * Added new mappers: `intern2external`, `confluence_document2information_piece`, and `sitemap_document2information_piece`, which handle specific metadata mapping for Confluence and sitemap sources. ### Renaming for clarity: * Renamed `langchain_document2information_piece` to `confluence_document2information_piece` in the `DependencyContainer` class and updated its usage in the `confluence_extractor`.
1 parent f98e973 commit 8ba8617

2 files changed

Lines changed: 12 additions & 8 deletions

File tree

libs/README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -227,17 +227,20 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
227227
|----------|---------|--------------|--------------|
228228
| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.impl.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
229229
| database_converter | [`extractor_api_lib.table_converter.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/table_converter/dataframe_converter.py) | [`extractor_api_lib.impl.table_converter.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/impl/table_converter/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
230-
| pdf_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
231-
| ms_docs_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232-
| xml_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
230+
| pdf_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
231+
| ms_docs_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232+
| xml_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.impl.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
233233
| epub_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.impl.extractors.file_extractors.epub_extractor.EPUBExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/epub_extractor.py) | Extractor used for extracting content from EPUB documents. |
234-
| file_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor, epub_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
234+
| file_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available file extractors. If you add a new type of file extractor you would have to add it to this list. |
235+
| intern2external | [`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py) | [`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py) | Maps internal information pieces to external information pieces, converting between internal and external content types. |
236+
| confluence_document2information_piece | [`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py) | [`extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece.ConfluenceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py) | Maps LangChain documents from Confluence to information pieces with Confluence-specific metadata handling. |
237+
| sitemap_document2information_piece | [`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py) | [`extractor_api_lib.impl.mapper.sitemap_document2information_piece.SitemapLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py) | Maps LangChain documents from sitemap sources to information pieces with sitemap-specific metadata handling. |
235238
| general_file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) |[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) | Combines multiple file extractors and decides which one to use for the given file format. |
236-
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
237239
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py) | Implementation of an extractor for the source `confluence`. |
238240
| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
239241
| sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
240-
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
242+
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_metadata_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
243+
| source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source and handles available extractors for confluence and sitemap sources. |
241244

242245
## 4. RAG Core Lib
243246

@@ -251,6 +254,7 @@ Examples of included components:
251254
- ...
252255

253256
### 4.1 Requirements
257+
254258
All required python libraries can be found in the [pyproject.toml](./extractor-api-lib/pyproject.toml) file.
255259
In addition to python libraries the following system packages are required:
256260

libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,15 +58,15 @@ class DependencyContainer(DeclarativeContainer):
5858
xml_extractor = Singleton(XMLExtractor, file_service)
5959

6060
intern2external = Singleton(Internal2ExternalInformationPiece)
61-
confluence_langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
61+
confluence_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
6262
langchain_document2information_piece = Singleton(LangchainDocument2InformationPiece)
6363
sitemap_document2information_piece = Singleton(SitemapLangchainDocument2InformationPiece)
6464
epub_extractor = Singleton(EpubExtractor, file_service, langchain_document2information_piece)
6565

6666
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor, epub_extractor)
6767

6868
general_file_extractor = Singleton(GeneralFileExtractor, file_service, file_extractors, intern2external)
69-
confluence_extractor = Singleton(ConfluenceExtractor, mapper=confluence_langchain_document2information_piece)
69+
confluence_extractor = Singleton(ConfluenceExtractor, mapper=confluence_document2information_piece)
7070

7171
sitemap_extractor = Singleton(
7272
SitemapExtractor,

0 commit comments

Comments
 (0)