Skip to content

Implement placeholder extraction for icu_pattern#8074

Open
sffc wants to merge 25 commits into
unicode-org:mainfrom
sffc:feature/pattern-extraction
Open

Implement placeholder extraction for icu_pattern#8074
sffc wants to merge 25 commits into
unicode-org:mainfrom
sffc:feature/pattern-extraction

Conversation

@sffc

@sffc sffc commented Jun 12, 2026

Copy link
Copy Markdown
Member

This PR implements a new feature for icu_pattern to extract placeholder values from a formatted string.

Design

We introduce a new sealed trait ExtractionBackend: PatternBackend to represent backends that support placeholder extraction.

  • Pattern::extract_placeholders is implemented generically for Pattern<B> where B: ExtractionBackend.
  • SinglePlaceholder and DoublePlaceholder implement ExtractionBackend unconditionally (alloc-free).
  • MultiNamedPlaceholder does not implement the trait.

We also implement a generic PlaceholderMatches::get that delegates to the backend via ExtractionBackend::get_match, maintaining consistency with the rest of icu_pattern.

🤖 This pull request was created by an AI agent working with @sffc.

Changelog

  • icu_pattern: Add PlaceholderMatches struct.
  • icu_pattern: Add ExtractionBackend trait.
  • icu_pattern: Add Pattern::extract_placeholders method (generic, gated by ExtractionBackend).
  • icu_pattern: Implement ExtractionBackend for SinglePlaceholder.
  • icu_pattern: Implement ExtractionBackend for DoublePlaceholder.

sffc and others added 19 commits June 12, 2026 13:44
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…erically, gate MultiNamed matches

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…d bound

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… unused imports

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ests

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…tract

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…racking tests

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ve and highly optimized

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…attern matching

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@sffc sffc marked this pull request as ready for review June 12, 2026 22:03
@sffc sffc requested a review from a team as a code owner June 12, 2026 22:03
@sffc

sffc commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

I'm happy with this now. The agent did a great job figuring out the initial architecture and boilerplate, but I ended up hand-writing significant chunks of the actual algorithm in order to make it more readable.

@sffc sffc requested a review from Manishearth June 12, 2026 22:03

@Manishearth Manishearth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the actual extraction code will probably take a while (and I'm prioritizing the .NET diplomat review today)

Comment thread components/pattern/src/common.rs Outdated
type DecodedMatches<'p, 'a>;

/// Extract matches from the store.
#[doc(hidden)] // TODO(#4467): Should be internal

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: let's just call these unstable_extract or internal_extract etc, to follow #6659. Technically it allows for hidden APIs to be named whatever, but we can call this internal so I think we should.

@sffc sffc Jun 13, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The trait was written before we had actual guidance on this. It makes sense to add the new trait functions in accordance with the new style.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This comment is written by an AI agent working with @sffc.

This has been addressed by prefixing the internal trait items with internal_ (i.e., InternalDecodedMatches, internal_extract, and internal_get_match) to follow the naming guidelines in #6659.

@sffc sffc Jun 16, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc-bot I prefer _unstable. Call the trait fn extract_unstable. and also get_match_unstable

@sffc

sffc commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

Reviewing the actual extraction code will probably take a while (and I'm prioritizing the .NET diplomat review today)

Ok. I did spend time making the "actual extraction code" clean and readable. It is not intended to be so complex that it takes a long time to review.

@robertbastian

Copy link
Copy Markdown
Member

Without looking at the code, I would have assumed that "placeholder extraction" means getting the "foo" from "Hello {foo}". I don't think "placeholder extraction" is a good name for what this is, and the PR description and changelog should also be more explanatory.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@sffc

sffc commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

How about extract_values?

sffc and others added 2 commits June 16, 2026 15:51
…ms with internal_

Follows the naming guidelines in unicode-org#6659 for internal items on public sealed traits.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Removes #[doc(hidden)] from internal_extract, internal_get_match, and
InternalDecodedMatches, and adds the 🚧 unstable stability banner to their
documentation, allowing them to be visible on docs.rs while warning users.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
sffc and others added 3 commits June 16, 2026 15:59
Renames internal_extract to extract_unstable, internal_get_match to
get_match_unstable, and InternalDecodedMatches to DecodedMatchesUnstable to
prefer the suffix style for unstable APIs.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Adds the "🚧 [Unstable]" prefix to the first line of the doc comments for
DecodedMatchesUnstable, extract_unstable, and get_match_unstable to clearly mark
them as unstable APIs in the documentation.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Rustdoc treats [Unstable] as an intra-doc link. Escaping it as \[Unstable\]
prevents the unresolved link warning/error.

Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@Manishearth

Copy link
Copy Markdown
Member

Since I'm OOO till next week and the discussion between Shane and I on this was in a private chat, I wanted to document my thoughts publicly so they don't have to be rehashed in future discussion.

Initially I started out somewhat against this, especially when it had the "multi" code. My main concern was this was a lot of scaffolding for not much benefit, and a fair amount of new API that I didn't feel we had enough information to design correctly.

It appears a point of disagreement between @sffc and me is that he considered icu_pattern, and other utils-like versioned crates1 to be exactly like "unstable" code in ICU4X crates, in that it is "free" to add new things or make breaking changes, and we don't have to spend much time on API design. My position is that it is easier to add things and make breaking changes but I would still like effort put into API design, and effort put into avoiding breaking changes.

That having been identified, on to the concern of the actual API size. I think there are many ways in which my one-off range glue pattern parsing code could be made reusable: I'm not convinced it needs to be in icu_pattern, and I'm not convinced it needs to be super generic. I think it makes some sense that if we're making it pretty generic it should go in icu_pattern, and overall moving a very simple API (like, a single method) to icu_pattern would be totally fine. I can see this being just a method on Pattern<DoublePlaceholder> and still being pretty useful, lying in wait until we need to genericize. But I can see the downside of landing something insufficiently generic.

I generally perceive trait-based APIs to be heavier: you have to design in a lot more "directions", which is costlier. They're very often the right call, but here I see many ways in which my one-off parsing code for range patterns could have been made reusable.

That said, icu_pattern is a heavy user of trait-based APIs, and I think that makes it more fine to add more traits. Similar to how I'd probably not bat an eyelid at Yet Another icu_datetime scaffold trait (provided there are no obvious wins from merging it).

As for motivation: the more complex the API, the more a priori motivation I want to see, not from a "justify this" POV but from a "how are we sure we've made the right API" POV. So far, we have one concrete use case: being able to look at CLDR patterns, compare them against a glue pattern, and compact them by referencing the glue. Shane mentioned this type of thing may come up again in units/currency, and would have a similar shape. It seems like this API ought to work there? It's unclear.

Shane also mentioned that parsing is a shape of a use case for this API, but I am highly skeptical of this API as-designed being good for parsing. But we don't consider parsing to be in scope anyway, so I'm ignoring that use case.

With all that, I am at this point -0 on adding this: I'm not enthusiastic, but I don't mind it. API seems fine enough for a "breakable" crate.

Without looking at the code, I would have assumed that "placeholder extraction" means getting the "foo" from "Hello {foo}". I don't think "placeholder extraction" is a good name for what this is, and the PR description and changelog should also be more explanatory.

This was also my initial confusion when looking at this PR. I'm not sure I like the names, but I don't have concrete suggestions just yet.

Footnotes

  1. Barring crates that have ICU4X-imposed semver constraints, like yoke/zerovec/fixed_decimal/writeable/etc.

@robertbastian

Copy link
Copy Markdown
Member

How about extract_values?

I think uninterpolate would be the most obvious name for the inverse of interpolate.

Now, again without having looked at the code:

  • why do we need this?
  • this seems to be a harder problem than pattern interpolation. to do this correctly, you need lookahead, can you explain how this is handled?

@sffc

sffc commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

I think uninterpolate would be the most obvious name for the inverse of interpolate.

Fine with me.

  • why do we need this?

In the new datetime range datagen code, there are strings like "MMM - MMM" and patterns like "{0} - {1}", and we want to identify that we can represent this more efficiently in date.

There are other cases that may benefit from this logic for similar data-optimization use cases; for example, "E h:mm" patterns that could be instead be represented as "E", "h:mm", and the glue pattern. We don't currently implement this optimization in large part because it seemed too complicated at the time.

Although the current use case and hypothetical use case are both in datetime, this is squarely not a datetime-specific concern.

An argument could be made that this could be appropriate for a "pattern helpers" module in the datagen crate, since pattern manipulation is kind-of what datagen does. But, I see such "helper" modules as workarounds for third-party crates whose scope is not what ICU4X needs. For util-like crates that we control, I would rather collect all these helpers in one place, and re-evaluate the scope when we work on graduating that particular util.

  • this seems to be a harder problem than pattern interpolation. to do this correctly, you need lookahead, can you explain how this is handled?

For single placeholder, there is no lookahead. It is prefix and suffix.

For double placeholder, it is possible to have multiple inputs that interpolate to the same output. The "value extraction" / "uninterpolate" code uses str::split to find the infix, returning the first one.

This is not a general-purpose parsing API with fancy lookahead configurability. Use Regex for that.


Furthermore, I hold that the fact that we have even 1 client doing complex pattern logic in a non-pattern crate is enough to justify putting this code into the pattern crate. We do not need a more thorough discussion on use cases, tradeoffs, etc., because this is an unstable crate that should evolve as our needs evolve.

In other words: this PR aims to offer an API that satisfies the current need, in a minimalistic way, consistent with the current design of icu_pattern.

@Manishearth

Manishearth commented Jun 18, 2026

Copy link
Copy Markdown
Member

For util-like crates that we control, I would rather collect all these helpers in one place, and re-evaluate the scope when we work on graduating that particular util.

Can we instead collect these helpers in datagen, wait for them to get used more, polish, and move them to icu_pattern when that happens?

I'm not asking for a full graduation, but I think a little bit of baking is nice to have. I even bet "where can you use this" is the type of exploratory work you can just task an LLM with.

For double placeholder, it is possible to have multiple inputs that interpolate to the same output. The "value extraction" / "uninterpolate" code uses str::split to find the infix, returning the first one.

Figuring out this behavior is one of the reasons I was reluctant to go deep into review of the "actual extraction code". Worth documenting this explicitly.

because this is an unstable crate that should evolve as our needs evolve.

We are not exactly aligned on this. We could get aligned, but as it stands I find icu_pattern breakages to be a minor cost1 that I would rather not have to pay all the time. Motivated breakages, sure, but I don't want us to preemptively decide for things to break whenever we want. In other words: if we needed to break an existing API for a reason, I'd not oppose it, but I don't think that implies that new APIs should not have the level of thought we put into design to avoid needless breakage.

FWIW, the policy you wrote on cross-crate internal "unstable" APIs was that we should design them to avoid breakage. This is a util crate, so not quite the same, but I think its internalness makes it similarly effective.

Footnotes

  1. A concrete cost borne by me: every breaking crate update requires slow third-party-review in google3, and may also trigger unsafe review. This slows down the upgrade process, and often stacks because not all third-party reviewers are comfortable approving CLs with large diffbases.

@sffc

sffc commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

FWIW, the policy you wrote on cross-crate internal "unstable" APIs was that we should design them to avoid breakage. This is a util crate, so not quite the same, but I think its internalness makes it similarly effective.

#8105

For util-like crates that we control, I would rather collect all these helpers in one place, and re-evaluate the scope when we work on graduating that particular util.

Can we instead collect these helpers in datagen, wait for them to get used more, polish, and move them to icu_pattern when that happens?

I consider this better than the status quo but substantially less good than this PR for reasons I won't repeat.

We are not exactly aligned on this. We could get aligned, but as it stands I find icu_pattern breakages to be a minor cost1 that I would rather not have to pay all the time.

I find it hard to predict your behavior on this subject. You sold me on "let's let core crates upgradeable out-of-sync with component crates" so much so that I was advocating more strongly than you were on avoiding an unstable tilde dependency on icu_provider in #3913 (comment). You said that the problem was mostly not a problem anymore. But now you are saying it is a problem again, for these crates that hadn't even been part of the conversation before.

I thought 0.x crates were added anew to vendor sources, so there's no synchronization problem. You have 0.7, you add 0.8, you integrate all the dependents, and then you delete 0.7.

Motivated breakages, sure, but I don't want us to preemptively decide for things to break whenever we want. In other words: if we needed to break an existing API for a reason, I'd not oppose it, but I don't think that implies that new APIs should not have the level of thought we put into design to avoid needless breakage.

You're trying to assign a cost (which you acknowledge is a low cost) to a hypothetical future that may or may not ever occur.

The change increases the API surface of icu_pattern by like 10% and increases the risk of needing to create a future breakage by also about 10%. The other 90% has never gone through an actual graduation review and is at risk of the same type of problems that this new API has. If I had landed this code when I originally rewrote icu_pattern, no one would have noticed.

@Manishearth

Copy link
Copy Markdown
Member

#8105

To be clear, I interpreted that policy correctly, I just think its motivation applies here too. That's why I said "it's not quite the same".

I find it hard to predict your behavior on this subject. You sold me on "let's let core crates upgradeable out-of-sync with component crates" so much so that I was advocating more strongly than you were on avoiding an unstable tilde dependency on icu_provider in #3913 (comment). You said that the problem was mostly not a problem anymore. But now you are saying it is a problem again, for these crates that hadn't even been part of the conversation before.

You're conflating two problems here, leading to confusion, which is understandable (I haven't documented a lot of this). The hint that these are two problems is that these crates weren't in the conversation before.

The problem with icu_locale_core/provider was that it was very hard to do incremental upgrades when there are ~ deps on some crates from everyone. This isn't a problem for utils-like crates: breaking changes in utils go in a new epoch so you can land those as separate CLs. ~ deps are just uncommon in the Cargo world so vendor tooling has less-well-trodden user flows around that1, and my proposal was to remove key ~ deps that caused the most problems.

This was a major problem when we were initially doing stuff in google3 because the tooling had issues and doing batched multi-crate upgrades was painful. Being able to split things into smaller CLs helped. I think google3 upgrades took 2ish weeks back then, similar to the painfulness of ICU4C updates.

A lot of that is fixed now, due to a lot of work on my end and the Rust team's end to improve workflows. The core ICU4X upgrade can be done in less than a week, and most of that time is waiting for reviews. I think being judicious about ~ deps is still nice because Cargo still has bugs where the resolver isn't smart enough around ~2.

The second problem is that any upgrade to a new epoch (i.e. a semver break) slows down CLs in both google3 and Chromium. This is because they need third-party review (can take days), and also because by-policy they need to be a separate CL. Chromium's desire to have clean "moved file" histories in Gerrit makes this worse3.

This is a separate problem: it is not a new problem or a "problem again". You're right that these crates were not a part of the conversation before, because the conversation before was focusing on a different problem with different crates.

So I have a preference to avoid "frivolous" breakages in utils. If we can avoid breaking something, we should try (but not too hard), and if we can avoid future breakages via API design, we should also try.

I'm happy to turn that into a policy addition that is clearer about what I mean if that would help.

Can we instead collect these helpers in datagen, wait for them to get used more, polish, and move them to icu_pattern when that happens?

I consider this better than the status quo but substantially less good than this PR for reasons I won't repeat.

Sure. I told you what will take my -0 to a +1 on this PR. I'm still open to landing this, just that I would slightly prefer not to. If you and Robert want to land this, then go for it, my -0 is not blocking. Otherwise I will eventually review this and probably be fine with it landing.

The change increases the API surface of icu_pattern by like 10% and increases the risk of needing to create a future breakage by also about 10%. The other 90% has never gone through an actual graduation review and is at risk of the same type of problems that this new API has. If I had landed this code when I originally rewrote icu_pattern, no one would have noticed.

icu_pattern was designed based off of multiple pattern use cases in ICU4X. I don't think it's a perfect graduatable API, but it's an API that from the get go was known to support multiple use cases well enough.

This new API is designed based on one concrete use case and some speculative ones. It is not narrowly designed for the concrete use case, it is understandably generic. I agree that the speculative use cases make sense. A thing I have not yet thought through, which I think may be a problem for some of the use cases, is exactly how this resolves the ambiguities (see: your comment on str::split). I don't want this API to be "almost right but not exactly" when we want it in the future, if we can avoid the breakage by thinking it through now, or avoid the breakage by having it be in icu_provider_source with a goal of moving to icu_pattern once we reuse it.

(Not wanting to think about ambiguity resolution for multi was why I asked for it to be removed completely. I think double is more manageable here and I plan to think about this when I review the actual code)

Footnotes

  1. In this case, the google3 epoch approach (one folder per "semver epoch", not per version), which is favored by most places doing vendoring including Chromium, thinks that you only ever need one copy of the crate per semver epoch.

  2. I can't find the exact bug right now but https://github.com/rust-lang/cargo/issues/9029 is an example of this class of bug

  3. Gerrit is just bad at this. If I move (or branch) files in Gerrit and also make changes to them, the reviewer cannot tell what the changes were since they just see deleted/new files. The Chromium Rust reviewers recommend using Gerrit's internal git history for this: each git cl upload creates a new commit "backing" the CL, so you can do this incrementally, and they can review the backing commits. There is some tooling for this, but it's just overall painful. LLMs hopefully can make this better but I haven't yet tried.

@sffc sffc requested a review from Manishearth June 24, 2026 21:45
impl ExtractionBackend for DoublePlaceholder {
type DecodedMatchesUnstable<'p, 'a> = [Option<&'a str>; 2];

fn extract_unstable<'p, 'a>(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: please document the greedy/etc behavior here, how it is expected to handle ambiguity

You said some of this here:

For double placeholder, it is possible to have multiple inputs that interpolate to the same output. The "value extraction" / "uninterpolate" code uses str::split to find the infix, returning the first one.

I'd like documentation from a user's perspective.

/// </div>
type DecodedMatchesUnstable<'p, 'a>;

/// 🚧 \[Unstable\] Extract matches from the store.

@Manishearth Manishearth Jun 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: all the assoc items here should be hidden as well.

See table in #6659 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented in that issue that I don't agree with using doc(hidden) for associated types. They should be called Unstable.

@Manishearth Manishearth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM. I am still -0 on this change, so as I said I am open to landing this and would slightly prefer to not. If you can align with Robert on this, go ahead and land.

input: &'a str,
) -> Option<Self::DecodedMatchesUnstable<'p, 'a>> {
let mut prefix = None;
let mut ph = None;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "ph"? placeholder? fine to use succinct names but explain what it means

match item {
PatternItem::Literal(s) => {
if ph.is_none() {
debug_assert!(prefix.is_none());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assertion string please. "found adjacent literals in pattern" or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants