You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: Enhance sitemap extraction and summarization features (#185)
- Added support for configurable sitemap parsing via environment
variable `SITEMAP_PARSER` with options: `docusaurus`, `astro`, and
`generic`.
- Introduced new config map for extractor sitemap settings.
- Updated `PageSummaryEnhancer` to group documents by URL for
non-numeric pages and maintain separation for paged documents.
- Enhanced `LangchainSummarizer` to respect max concurrency settings
during summarization.
- Improved error logging for source uploads in `DefaultSourceUploader`.
- Added comprehensive tests for new sitemap parsing functions and
summarization logic.
- Updated README and documentation to reflect changes and provide
guidance on memory management for backend pods.
Copy file name to clipboardExpand all lines: infrastructure/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -257,6 +257,8 @@ frontend:
257
257
258
258
The following values should be adjusted for the deployment:
259
259
260
+
> ⓘ INFO: If the backend pod gets `OOMKilled` (exit code `137`) on local k3d/Tilt setups, reduce `backend.workers` (each uvicorn worker is a separate Python process), disable reranking `RERANKER_ENABLED: false` or pin a smaller Flashrank model (e.g. `RERANKER_MODEL: ms-marco-TinyBERT-L-2-v2`), and/or increase the memory available to Docker/k3d.
Copy file name to clipboardExpand all lines: libs/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -331,6 +331,7 @@ For sitemap sources, additional parameters can be provided, e.g.:
331
331
332
332
Technically, all parameters of the `SitemapLoader` from LangChain can be provided.
333
333
334
+
The HTML parsing logic can be tuned via the `SITEMAP_PARSER` environment variable (default: `docusaurus`; options: `docusaurus`, `astro`, `generic`). For Helm deployments, set `extractor.envs.sitemap.SITEMAP_PARSER` in `infrastructure/rag/values.yaml`. You can also override the parser per upload by passing a `sitemap_parser` key/value pair (same options) in the `/upload_source` request (available as a dropdown in the admin frontend).
0 commit comments