diff --git a/scripts/knowledgebase-nav/Architecture.md b/scripts/knowledgebase-nav/Architecture.md index 621dda96ba..83f9e143b5 100644 --- a/scripts/knowledgebase-nav/Architecture.md +++ b/scripts/knowledgebase-nav/Architecture.md @@ -141,11 +141,13 @@ Functions are grouped below the way they appear in the source file. Names refer - **`plain_text`** strips Markdown (including horizontal rules), links, URLs, HTML or MDX tags, and similar so previews stay plain text (U+00A0 to space after entity decode, typographic quotes mapped to ASCII, allowlist keeps `_` and `=` for identifiers). - **`extract_body_preview`** applies `plain_text`, truncates to `BODY_PREVIEW_MAX_LENGTH`, and adds `BODY_PREVIEW_SUFFIX` when needed. +- **`_card_text_from_frontmatter_field`** extracts a usable string from a single front matter key (`docengineDescription` or `description`): returns `None` when the field is missing, not a string, or empty after processing. Processing strips one outer pair of wrapping quotes and collapses internal newlines to a single space. +- **`resolve_body_preview`** resolves the Card preview text using a three-level hierarchy: `docengineDescription` first, then `description`, then `extract_body_preview(body)`. Frontmatter overrides are not passed through `plain_text` or truncation. ### Slugs and crawling - **`tag_slug`** maps a display keyword to a filename or URL segment (lowercase, hyphenated). -- **`crawl_articles`** walks `support//articles/*.mdx` and builds article dicts (`title`, `keywords`, `featured`, `body_preview`, `page_path`, `tag_links`, and others). +- **`crawl_articles`** walks `support//articles/*.mdx` and builds article dicts (`title`, `keywords`, `featured`, `body_preview`, `page_path`, `tag_links`, and others). The `body_preview` field is resolved by `resolve_body_preview` from `docengineDescription`, `description`, or the article body. ### Tag aggregation and featured content diff --git a/scripts/knowledgebase-nav/README.md b/scripts/knowledgebase-nav/README.md index 6502156bd4..5d91ac2d4f 100644 --- a/scripts/knowledgebase-nav/README.md +++ b/scripts/knowledgebase-nav/README.md @@ -98,13 +98,52 @@ Featured articles appear in the "Featured articles" section at the top of the pr 3. There is no hard limit on how many articles can be featured per product, but we recommend keeping it to 3-5 for a clean layout. +### Customizing Card preview text + +By default, the generator creates Card preview text by stripping Markdown and MDX from the article body and truncating to 120 characters. You can override this with front matter fields so the Card shows exactly the text you want. + +The generator resolves preview text using a three-level hierarchy: + +1. **`docengineDescription`** (highest priority). Use this when you want to control the Card preview independently of SEO. This field is not used by Mintlify for anything else. +2. **`description`**. If `docengineDescription` is not set, the generator uses this field. Note that Mintlify also renders `description` as the page's `` tag for search engines. Setting it affects both the Card preview and the SEO metadata. Use `docengineDescription` instead when you want the Card text to differ from the SEO description. +3. **Auto-generated body snippet**. If neither field is set, the generator falls back to the existing behavior: convert the article body to plain text and truncate to 120 characters. + +The Card preview appears in three places: + +- Tag page Cards (for example, `support/models/tags/experiments.mdx`). +- Featured article Cards on the product index page (for example, `support/models.mdx`). +- Featured article Cards on the root support landing page (`support.mdx`). + +**Processing rules.** When using `docengineDescription` or `description`, the generator applies only minimal processing: + +- Outer wrapping quotes (`"` or `'`) are stripped. YAML sometimes preserves them depending on how you quote the value. +- Internal newlines are collapsed to a single space. YAML block scalars (`|`, `>`) can produce multiline strings, but Card bodies must be single-line. If you need precise control over the text, use a single-line quoted string in front matter. +- No other processing is applied. The value is not passed through Markdown stripping, HTML entity decoding, or truncation. + +**MDX safety.** Override text is emitted directly inside `` components without sanitization. Avoid characters or strings that break MDX parsing, such as unmatched `<`, raw ``, or unescaped `{`. + +**Example:** + +```yaml +--- +title: "How do I reset my API key?" +keywords: ["Security", "Administrator"] +docengineDescription: "Step-by-step instructions for resetting your W&B API key from the user settings page." +description: "Reset your W&B API key." +--- +``` + +In this example, the Card preview shows the `docengineDescription` value. The `description` value is used only by Mintlify for SEO. If `docengineDescription` were removed, the Card preview would show the `description` value instead. + ### Front matter quick reference -| Field | Required | Default | Description | -|------------|----------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `title` | Expected | `""` if omitted | Article title. Used in Cards and listings. The generator does not fail if it is missing. | -| `keywords` | Expected | `[]` if omitted | YAML list of tag names. Each should match an entry in `config.yaml` (case-sensitive). Controls which tag pages the article appears on. If the list is empty or missing, the article is not listed under any tag. If you accidentally use a single string (for example `keywords: "Security"`), the generator treats it as one tag and emits a warning. Other non-list types are ignored with a warning. | -| `featured` | No | `false` | Set to `true` to feature the article on the product index page. | +| Field | Required | Default | Description | +|------------------------|----------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `title` | Expected | `""` if omitted | Article title. Used in Cards and listings. The generator does not fail if it is missing. | +| `keywords` | Expected | `[]` if omitted | YAML list of tag names. Each should match an entry in `config.yaml` (case-sensitive). Controls which tag pages the article appears on. If the list is empty or missing, the article is not listed under any tag. If you accidentally use a single string (for example `keywords: "Security"`), the generator treats it as one tag and emits a warning. Other non-list types are ignored with a warning. | +| `featured` | No | `false` | Set to `true` to feature the article on the product index page. | +| `docengineDescription` | No | `""` if omitted | Card preview text for tag pages, featured sections, and the support landing page. Takes priority over `description` and the auto-generated body snippet. Use this when you want to set Card text independently of the SEO description. Outer wrapping quotes are stripped and newlines are collapsed to a single space. | +| `description` | No | `""` if omitted | Card preview text if `docengineDescription` is not set. Mintlify also uses this field for the page's `` SEO tag, so setting it affects both the Card preview and search engine metadata. Use `docengineDescription` to decouple the two. Same processing rules: outer quotes stripped, newlines collapsed. | ### Running the generator locally diff --git a/scripts/knowledgebase-nav/generate_tags.py b/scripts/knowledgebase-nav/generate_tags.py index 187ab4eec9..7a09684f00 100644 --- a/scripts/knowledgebase-nav/generate_tags.py +++ b/scripts/knowledgebase-nav/generate_tags.py @@ -482,6 +482,108 @@ def extract_body_preview(body: str, max_len: int = BODY_PREVIEW_MAX_LENGTH) -> s return text +def _card_text_from_frontmatter_field( + frontmatter: Dict[str, Any], + key: str, +) -> Optional[str]: + """ + Extract a usable Card preview string from a single front matter field. + + Returns ``None`` when the field is missing, not a string, or empty + after processing so the caller can fall through to the next candidate. + + Processing (applied only to ``str`` values): + + 1. Remove one outer pair of wrapping quotes when the first and last + characters are both ``"`` or both ``'``. + 2. Collapse internal newlines (and surrounding whitespace) to a + single space so the value is safe for single-line Card bodies. + + No other transformation (no ``plain_text``, no truncation) is applied. + + Parameters + ---------- + frontmatter : dict + Parsed YAML front matter for the article. + key : str + The front matter key to read (for example ``"docengineDescription"`` + or ``"description"``). + + Returns + ------- + str or None + The processed string, or ``None`` if the field is absent, not a + string, or empty after processing. + """ + value = frontmatter.get(key) + if value is None: + return None + if not isinstance(value, str): + return None + + # Strip one outer pair of matching quotes. + if len(value) >= 2 and value[0] == value[-1] and value[0] in ('"', "'"): + value = value[1:-1] + + # Collapse newlines (and surrounding whitespace) to a single space. + value = re.sub(r"\s*\n\s*", " ", value) + + return value if value else None + + +def resolve_body_preview( + frontmatter: Dict[str, Any], + body: str, +) -> str: + """ + Resolve the Card preview text for an article. + + Uses a three-level hierarchy: + + 1. ``docengineDescription`` front matter field (highest priority). + Lets writers control the Card preview independently of SEO. + 2. ``description`` front matter field. Mintlify also renders this as + the page's ```` tag, so setting it + affects both the Card text and SEO metadata. + 3. Auto-generated body snippet via ``extract_body_preview`` (existing + behavior: ``plain_text`` conversion and 120-character truncation). + + For levels 1 and 2, only outer wrapping quotes are stripped and + internal newlines are collapsed to a single space. No other + processing is applied. + + Parameters + ---------- + frontmatter : dict + Parsed YAML front matter for the article. + body : str + The raw article body text (Markdown/MDX content). + + Returns + ------- + str + The Card preview string. + + Example + ------- + >>> resolve_body_preview({"docengineDescription": "Custom text."}, "Body.") + 'Custom text.' + >>> resolve_body_preview({"description": "SEO and card."}, "Body.") + 'SEO and card.' + >>> resolve_body_preview({}, "Body content here.") + 'Body content here.' + """ + text = _card_text_from_frontmatter_field(frontmatter, "docengineDescription") + if text is not None: + return text + + text = _card_text_from_frontmatter_field(frontmatter, "description") + if text is not None: + return text + + return extract_body_preview(body) + + # --------------------------------------------------------------------------- # Slug generation # --------------------------------------------------------------------------- @@ -817,7 +919,13 @@ def crawl_articles(repo_root: Path, product_slug: str) -> List[Dict[str, Any]]: ``_normalize_keywords`` (see that function for string and type coercion). - ``featured`` (bool): Whether the article has ``featured: true``. - - ``body_preview`` (str): Truncated plain-text preview of the body. + - ``body_preview`` (str): Card preview text, resolved by + ``resolve_body_preview``: uses ``docengineDescription`` front + matter if present, then ``description``, then falls back to + ``extract_body_preview(body)`` (plain-text conversion and + 120-character truncation). Frontmatter overrides are not + passed through ``plain_text`` or truncation; only outer + wrapping quotes are stripped and newlines are collapsed. - ``page_path`` (str): The URL path without leading slash (for example ``support/models/articles/my-article``). - ``mdx_path`` (str): Repo-relative path to the MDX file using forward @@ -854,7 +962,7 @@ def crawl_articles(repo_root: Path, product_slug: str) -> List[Dict[str, Any]]: raw_keywords = [] keywords = _normalize_keywords(raw_keywords, mdx_file) featured = frontmatter.get("featured", False) - body_preview = extract_body_preview(body) + body_preview = resolve_body_preview(frontmatter, body) file_stem = mdx_file.stem # Build Badge link data for each keyword so templates can render diff --git a/scripts/knowledgebase-nav/tests/test_generate_tags.py b/scripts/knowledgebase-nav/tests/test_generate_tags.py index 44225b1e0f..e1e5896f60 100644 --- a/scripts/knowledgebase-nav/tests/test_generate_tags.py +++ b/scripts/knowledgebase-nav/tests/test_generate_tags.py @@ -530,6 +530,209 @@ def test_extract_body_preview_custom_max_len(self): assert result == "Hello ..." +# =========================================================================== +# Tests: _card_text_from_frontmatter_field +# =========================================================================== + +class TestCardTextFromFrontmatterField: + """Tests for the _card_text_from_frontmatter_field helper.""" + + def test_returns_none_when_key_missing(self): + """A missing key should return None so the resolver falls through.""" + assert generate_tags._card_text_from_frontmatter_field({}, "description") is None + + def test_returns_none_when_value_is_none(self): + """An explicit None value should return None.""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": None}, "description" + ) is None + + def test_returns_none_for_non_string_int(self): + """An integer value should return None (no coercion).""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": 42}, "description" + ) is None + + def test_returns_none_for_non_string_list(self): + """A list value should return None (no coercion).""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": ["a", "b"]}, "description" + ) is None + + def test_returns_none_for_non_string_bool(self): + """A boolean value should return None (no coercion).""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": True}, "description" + ) is None + + def test_strips_outer_double_quotes(self): + """Wrapping double quotes should be removed.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": '"Hello world."'}, "description" + ) + assert result == "Hello world." + + def test_strips_outer_single_quotes(self): + """Wrapping single quotes should be removed.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "'Hello world.'"}, "description" + ) + assert result == "Hello world." + + def test_does_not_strip_mismatched_quotes(self): + """Mismatched outer quotes should not be stripped.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "\"Hello world.'"}, "description" + ) + assert result == "\"Hello world.'" + + def test_does_not_strip_inner_quotes(self): + """Quotes inside the string should be preserved.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": 'She said "hello" today.'}, "description" + ) + assert result == 'She said "hello" today.' + + def test_returns_none_when_empty_after_quote_strip(self): + """A value that is just a pair of quotes should return None.""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": '""'}, "description" + ) is None + assert generate_tags._card_text_from_frontmatter_field( + {"description": "''"}, "description" + ) is None + + def test_returns_none_for_empty_string(self): + """An empty string should return None.""" + assert generate_tags._card_text_from_frontmatter_field( + {"description": ""}, "description" + ) is None + + def test_collapses_newlines_to_single_space(self): + """Internal newlines should be collapsed to a single space.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "Line one.\nLine two.\nLine three."}, "description" + ) + assert result == "Line one. Line two. Line three." + + def test_collapses_newlines_with_surrounding_whitespace(self): + """Whitespace around newlines should collapse too.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "Hello \n world."}, "description" + ) + assert result == "Hello world." + + def test_preserves_single_line_string(self): + """A normal single-line string passes through unchanged.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "Simple description."}, "description" + ) + assert result == "Simple description." + + def test_short_string_no_strip(self): + """A single character should not be processed as quotes.""" + result = generate_tags._card_text_from_frontmatter_field( + {"description": "X"}, "description" + ) + assert result == "X" + + +# =========================================================================== +# Tests: resolve_body_preview +# =========================================================================== + +class TestResolveBodyPreview: + """Tests for the resolve_body_preview function.""" + + def test_docengine_description_takes_priority(self): + """docengineDescription should win over description and body.""" + fm = { + "docengineDescription": "From docengine.", + "description": "From description.", + } + result = generate_tags.resolve_body_preview(fm, "Body text here.") + assert result == "From docengine." + + def test_description_used_when_no_docengine(self): + """description should be used when docengineDescription is absent.""" + fm = {"description": "From description."} + result = generate_tags.resolve_body_preview(fm, "Body text here.") + assert result == "From description." + + def test_body_fallback_when_neither_field(self): + """Falls back to extract_body_preview when no override fields exist.""" + result = generate_tags.resolve_body_preview({}, "Short body.") + assert result == "Short body." + + def test_body_fallback_when_both_fields_empty(self): + """Empty strings in both fields should fall through to body.""" + fm = {"docengineDescription": "", "description": ""} + result = generate_tags.resolve_body_preview(fm, "Fallback body.") + assert result == "Fallback body." + + def test_description_used_when_docengine_is_none(self): + """Explicit None for docengineDescription falls through to description.""" + fm = {"docengineDescription": None, "description": "SEO text."} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "SEO text." + + def test_description_used_when_docengine_is_non_string(self): + """Non-string docengineDescription falls through to description.""" + fm = {"docengineDescription": 123, "description": "Fallback desc."} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "Fallback desc." + + def test_body_used_when_description_is_non_string(self): + """Non-string description falls through to body.""" + fm = {"description": ["not", "a", "string"]} + result = generate_tags.resolve_body_preview(fm, "Body text.") + assert result == "Body text." + + def test_body_preview_truncation_preserved(self): + """Body fallback still truncates at 120 characters.""" + long_body = "A" * 200 + result = generate_tags.resolve_body_preview({}, long_body) + assert result == "A" * 120 + " ..." + + def test_docengine_not_truncated(self): + """Frontmatter overrides should not be truncated.""" + long_text = "B" * 200 + fm = {"docengineDescription": long_text} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == long_text + + def test_description_not_truncated(self): + """description override should not be truncated.""" + long_text = "C" * 200 + fm = {"description": long_text} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == long_text + + def test_docengine_outer_quotes_stripped(self): + """Outer quotes on docengineDescription are stripped.""" + fm = {"docengineDescription": '"Quoted text."'} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "Quoted text." + + def test_description_outer_quotes_stripped(self): + """Outer quotes on description are stripped.""" + fm = {"description": "'Quoted text.'"} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "Quoted text." + + def test_docengine_newlines_collapsed(self): + """Newlines in docengineDescription are collapsed.""" + fm = {"docengineDescription": "Line one.\nLine two."} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "Line one. Line two." + + def test_docengine_only_quotes_falls_through(self): + """docengineDescription that is just quotes falls through to description.""" + fm = {"docengineDescription": '""', "description": "Actual text."} + result = generate_tags.resolve_body_preview(fm, "Body.") + assert result == "Actual text." + + # =========================================================================== # Tests: tag_slug # =========================================================================== @@ -697,6 +900,56 @@ def test_crawl_non_list_non_string_keywords_empty(self, tmp_path): assert articles[0]["tag_links"] == [] assert any("int" in str(x.message).lower() for x in w) + def test_crawl_body_preview_from_docengine_description(self, tmp_path): + """docengineDescription in front matter overrides the body preview.""" + articles_dir = tmp_path / "support" / "widgets" / "articles" + articles_dir.mkdir(parents=True) + (articles_dir / "test.mdx").write_text(textwrap.dedent("""\ + --- + title: "Test" + keywords: ["Alpha"] + docengineDescription: "Custom card text from docengine." + --- + + This body text should not appear in the preview. + """), encoding="utf-8") + + articles = generate_tags.crawl_articles(tmp_path, "widgets") + assert articles[0]["body_preview"] == "Custom card text from docengine." + + def test_crawl_body_preview_from_description_fallback(self, tmp_path): + """description is used for body_preview when docengineDescription is absent.""" + articles_dir = tmp_path / "support" / "widgets" / "articles" + articles_dir.mkdir(parents=True) + (articles_dir / "test.mdx").write_text(textwrap.dedent("""\ + --- + title: "Test" + keywords: ["Alpha"] + description: "SEO and card text." + --- + + This body text should not appear in the preview. + """), encoding="utf-8") + + articles = generate_tags.crawl_articles(tmp_path, "widgets") + assert articles[0]["body_preview"] == "SEO and card text." + + def test_crawl_body_preview_body_fallback(self, tmp_path): + """Without description fields, body_preview uses the body snippet.""" + articles_dir = tmp_path / "support" / "widgets" / "articles" + articles_dir.mkdir(parents=True) + (articles_dir / "test.mdx").write_text(textwrap.dedent("""\ + --- + title: "Test" + keywords: ["Alpha"] + --- + + Short body text. + """), encoding="utf-8") + + articles = generate_tags.crawl_articles(tmp_path, "widgets") + assert articles[0]["body_preview"] == "Short body text." + # =========================================================================== # Tests: get_featured_articles