Skip to content

🤖 core: Render Arabic / RTL text in TextField#23625

Open
evil1morty wants to merge 2 commits into
ruffle-rs:masterfrom
evil1morty:arabic-rtl-text-rendering
Open

🤖 core: Render Arabic / RTL text in TextField#23625
evil1morty wants to merge 2 commits into
ruffle-rs:masterfrom
evil1morty:arabic-rtl-text-rendering

Conversation

@evil1morty
Copy link
Copy Markdown
Contributor

@evil1morty evil1morty commented May 4, 2026

Description

Arabic in any path that flows through font::FontLike::evaluate (plain TextField, EditText, FTE TextLine since text_block::create_text_line synthesises an EditText, and TLF since it uses FTE) is unreadable today:

  • font::evaluate walks codepoints one by one through ttf_parser's raw cmap lookup, with no GSUB shaping. Even when a font ships isolated forms in cmap (Tahoma, Segoe UI, etc.) the text appeared as detached letters in logical (LTR) order.
  • Hebrew, Syriac, Thaana, N'Ko etc. don't need shaping but suffer the same logical-order problem.

This is a pragmatic fix that does not introduce a full shaping engine. Modern Arabic-aware fonts carry the Arabic Presentation Forms-A and -B blocks in their cmap as a compatibility encoding for exactly this case — the joined initial/medial/final/isolated shapes are addressable directly by codepoint. Reshaping base Arabic letters into Presentation Forms makes them resolvable through the existing cmap-only lookup.

Two commits, both load-bearing

  1. core: Collect every font in the default_font fallback chainLibrary::default_font had a stale // TODO: Return multiple fonts when it's needed. break; that returned at most one font per chain. The downstream FontSet treats element 0 as the main font and the rest as fallbacks, so discarding the rest meant glyph misses on the primary font had nowhere to fall through to even when the caller had configured a multi-name chain via set_default_font(_, vec![a, b, ...]). Single-element chains behave identically to before — no caller that configured one font per default sees a behavior change. This is a real latent bug independent of the Arabic work but the Arabic work needs it (so a host page's Arabic font becomes a real fallback).

  2. core: Reshape Arabic and reorder RTL runs in font::evaluate — the rendering fix. New maybe_reshape_rtl in core/src/font.rs runs from evaluate before the per-codepoint loop. For text without RTL codepoints it's a single linear scan that returns None and the existing fast path runs unchanged. For RTL text it reshapes Arabic via arabic_reshaper, runs unicode_bidi, mirrors paired ASCII punctuation in RTL runs. Returns None for FontType::Embedded — embedded SWF fonts typically ship only the base Arabic block, so substituting Presentation Forms there would resolve to nothing. Skips re-running the joiner on text already in Presentation Forms. Adds arabic_reshaper = "0.4.2" and unicode-bidi = "0.3.18" to core/Cargo.toml (unicode-bidi was already an indirect dependency).

Where Arabic-bearing glyphs come from

  • Desktop: OS device fonts (Tahoma / Segoe UI on Windows; macOS / Linux equivalents) carry the Arabic Presentation Forms in their cmap, so the shaping is enough.
  • Web: host pages register their own Arabic font (e.g. fetch Noto Sans Arabic and pass to Ruffle's addFont + setDefaultFont JS APIs). The library.rs chain fix above is what makes a multi-name default chain actually fall through on glyph misses.

This PR does not bundle a font itself — that would inflate the universal fallback for every user. (An earlier revision of this PR did bundle Noto Sans Arabic as a separate FALLBACK_DEVICE_FONT_ARABIC const; dropped on review.)

Out of scope (deliberate)

  • No real OpenType shaper (rustybuzz). The right long-term answer but requires reworking Glyph/GlyphSource to be keyed by glyph ID rather than char — much larger change.
  • No GPOS positioning of combining marks (harakat). Vowel marks still get positive advance and lay out as separate spacing characters.
  • No RTL paragraph alignment. Lines of Arabic now read correctly but a default-aligned (left) paragraph stays left-anchored. That's a layout-engine concern (core/src/html/layout.rs), not a font concern.
  • pos no longer round-trips for shaped runs. The pos argument evaluate passes to its glyph callback is the byte offset into the reshaped WString, not the source WStr. Cursor positioning, hit testing and selection highlighting in display_object::edit_text will be off for Arabic spans. Editing Arabic in a TextField is rare in practice; preserving correctness here would need a source↔shaped position map that arabic-reshaper doesn't expose.

Transparency for non-RTL content

The evaluate change is two lines:

let shaped_text = maybe_reshape_rtl(text, self.font_type());
let text: &WStr = shaped_text.as_deref().unwrap_or(text);

maybe_reshape_rtl returns None for any text without RTL codepoints, and the rest of evaluate runs against the original text. So English/Latin/CJK content goes through one extra O(n) codepoint scan and otherwise hits the existing path identically.

Testing

Verified against safari2025.swf from cdn.safariislandsgame.com (Cocolani), which uses TLF with Direction.RTL and feeds raw logical-order Arabic from a runtime Language table. Before: empty rectangles where Arabic should be. After: correctly joined, right-to-left Arabic on desktop (using OS Tahoma/Segoe UI) and on web with a host-page-supplied Arabic font.

cargo fmt --all clean. cargo clippy -p ruffle_core --features default_font --tests produces zero warnings.

I have not added an automated visual regression test in tests/tests/swfs/. Happy to add one if reviewers point me at an existing visual-text fixture I can adapt.

Notes for reviewers

  • Why arabic_reshaper over rolling our own joining table? The crate is a Python-port and weighs trivially in the WASM (~30 KB). The alternative (in-tree joining table) would duplicate Unicode property data we'd then need to keep current.

  • The break; in library.rs::default_font is a real latent bug, not just a fix in service of this PR. Anyone who configures set_default_font with a multi-name chain on master today is silently losing every name after the first. The TODO comment from the original author suggests this was planned work; this PR is the smallest patch that does it.

Checklist

  • I, a human, have self-reviewed this PR and fully understand the changes within.
  • I have made or updated tests where possible.
  • All of my commits are properly scoped, compile successfully, and pass all tests.
  • This PR does not make sense to split up into smaller PRs.
  • An LLM was involved in the authoring of this code.

Library::default_font was returning at most one font per chain because of
a stale `// TODO: Return multiple fonts when it's needed.` `break`. The
downstream FontSet treats element 0 as the main font and the rest as
fallbacks, so discarding the rest meant glyph misses on the primary font
had nowhere to fall through to even when the caller had configured a
multi-name chain via `set_default_font(_, vec![a, b, ...])`.

Both passes (exact match, then compatible match) now collect the full
chain. The compatible-match pass deduplicates against the exact-match
pass by FontDescriptor so the same font isn't listed twice.

Single-element chains behave identically to before, so no caller that
configured one font per default sees a behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dinnerbone Dinnerbone added A-input Area: Input handling T-compat Type: Compatibility with Flash Player llm The PR contains mostly LLM-generated code A-rendering Area: Rendering & Graphics and removed A-input Area: Input handling labels May 4, 2026
Comment thread core/src/player.rs Outdated
/// fall through to it — in particular the Arabic Presentation Forms that
/// `font::evaluate` produces after reshaping base Arabic letters.
#[cfg(feature = "default_font")]
pub const FALLBACK_DEVICE_FONT_ARABIC: &[u8] =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to support Arabic in the fallback font—as the name suggests, it's a fallback, you should provide Arabic fonts yourself if you need them, the default font has to be small in order to be included everywhere.

This should get resolved automatically when we finish implementing canvas font renderer though.

Comment thread core/src/player.rs Outdated
Comment on lines +87 to +89
/// Arabic strings). Bundled separately from the main Noto Sans fallback so
/// the ~76 KB Arabic glyph table is only carried when the `default_font`
/// feature is enabled. Registered as a second device font and appended to
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bundled separately from the main Noto Sans fallback so the ~76 KB Arabic glyph table is only carried when the default_font feature is enabled.

Bundled separately from the main fallback so the Arabic glyph table is only carried... exactly when the main fallback is?

Arabic in any path that flows through `font::FontLike::evaluate` (plain
TextField, EditText, FTE TextLine since `text_block::create_text_line`
synthesises an EditText, and TLF since it uses FTE) was unreadable
because:

- `font::evaluate` walks codepoints one by one through ttf_parser's raw
  cmap lookup, with no GSUB shaping. Even when a font ships isolated
  forms in cmap (Tahoma, Segoe UI, etc.) the text appeared as detached
  letters in logical (LTR) order.
- Hebrew, Syriac, Thaana, N'Ko etc. didn't need shaping but suffered
  the same logical-order problem.

This is a pragmatic fix that does not introduce a full shaping engine:
modern Arabic-aware fonts carry the Arabic Presentation Forms-A and -B
blocks (U+FB50..U+FDFF, U+FE70..U+FEFF) in their cmap as a compatibility
encoding for exactly this case. The joined initial/medial/final/isolated
shapes are addressable directly by codepoint, so reshaping base Arabic
letters into Presentation Forms makes them resolvable through the
existing cmap-only lookup.

  * core/src/font.rs: new `maybe_reshape_rtl`, called from `evaluate`
    before the per-codepoint loop. Detects RTL codepoints; for runs with
    base Arabic letters, runs `arabic_reshaper::arabic_reshape` to map
    each base letter to its Presentation Forms-B codepoint (and produce
    the lam-alef ligatures from Forms-A). Runs `unicode_bidi::BidiInfo`
    and reverses RTL runs into visual order, mirroring paired ASCII
    punctuation. Returns None (fast path) for text without RTL
    codepoints, so the vast majority of content pays only an O(n) scan.
    Returns None for FontType::Embedded — embedded SWF fonts typically
    ship only the base Arabic block, so substituting Presentation Forms
    there would resolve to nothing. Skips re-running the joiner on text
    that is already in Presentation Forms (some SWF-side helpers emit
    those directly; re-shaping corrupts them).

  * core/Cargo.toml: adds `arabic_reshaper = "0.4.2"` and
    `unicode-bidi = "0.3.18"`. unicode-bidi was already an indirect
    dependency via the desktop egui chrome.

The actual font with Presentation Forms in its cmap is not bundled. On
desktop the OS fonts (Tahoma, Segoe UI on Windows; equivalents
elsewhere) cover them. On web, host pages can register an Arabic font
themselves via the `addFont` + `setDefaultFont` JS APIs — the
fallback-chain fix in the previous commit makes a multi-font default
chain actually fall through on glyph misses.

Out of scope for this change, deliberately kept small:

  * No real OpenType shaper (rustybuzz). The right long-term answer but
    requires reworking Glyph/GlyphSource to be keyed by glyph ID rather
    than `char`.
  * No GPOS positioning of combining marks (harakat). Vowel marks still
    get positive advance and lay out as separate spacing characters.
  * No RTL paragraph alignment. Lines of Arabic now read correctly but a
    default-aligned (left) paragraph stays left-anchored. That's a
    layout-engine concern, not a font concern.
  * `pos` in `evaluate`'s callback no longer round-trips for shaped
    runs. Cursor positioning, hit testing and selection highlighting in
    `display_object::edit_text` will be off for Arabic spans. Editing
    Arabic in a TextField is rare; preserving correctness here would
    need a source<->shaped position map that arabic_reshaper doesn't
    expose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@evil1morty evil1morty force-pushed the arabic-rtl-text-rendering branch from 5562f1d to d82f0f9 Compare May 4, 2026 15:42
@evil1morty
Copy link
Copy Markdown
Contributor Author

Thanks for the review @kjarosh — agreed on both points. Pushed a force-update that drops the bundled Noto Sans Arabic, the build script and the player.rs registration. The PR is now just the shaping/bidi logic plus the library.rs fallback-chain fix.

Rationale for keeping the rest:

  • On desktop the OS device fonts (Tahoma, Segoe UI, equivalents on macOS/Linux) carry the Arabic Presentation Forms in their cmap, so the shaping is enough to render Arabic correctly without any new bundled assets.
  • On web, host pages can register their own Arabic font via addFont + setDefaultFont — the library.rs chain fix is what makes a multi-name default chain actually fall through on glyph misses (it was silently dropping every name after the first).
  • Once the canvas font renderer ships, none of this is on the hot path anyway.

Also slimmed the doc comments to match the surrounding file's density (the long block on maybe_reshape_rtl is gone). Let me know if you'd like further changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-rendering Area: Rendering & Graphics llm The PR contains mostly LLM-generated code T-compat Type: Compatibility with Flash Player

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants