Implement placeholder extraction for icu_pattern#8074
Conversation
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…erically, gate MultiNamed matches Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…d bound Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… unused imports Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ests Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…tract Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…racking tests Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ve and highly optimized Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…attern matching Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e mutable references
…ition, consistent with DoublePlaceholder
…n comments and inlined unwraps
|
I'm happy with this now. The agent did a great job figuring out the initial architecture and boilerplate, but I ended up hand-writing significant chunks of the actual algorithm in order to make it more readable. |
Manishearth
left a comment
There was a problem hiding this comment.
Reviewing the actual extraction code will probably take a while (and I'm prioritizing the .NET diplomat review today)
| type DecodedMatches<'p, 'a>; | ||
|
|
||
| /// Extract matches from the store. | ||
| #[doc(hidden)] // TODO(#4467): Should be internal |
There was a problem hiding this comment.
issue: let's just call these unstable_extract or internal_extract etc, to follow #6659. Technically it allows for hidden APIs to be named whatever, but we can call this internal so I think we should.
There was a problem hiding this comment.
Good point. The trait was written before we had actual guidance on this. It makes sense to add the new trait functions in accordance with the new style.
There was a problem hiding this comment.
@sffc-bot I prefer _unstable. Call the trait fn extract_unstable. and also get_match_unstable
Ok. I did spend time making the "actual extraction code" clean and readable. It is not intended to be so complex that it takes a long time to review. |
|
Without looking at the code, I would have assumed that "placeholder extraction" means getting the "foo" from "Hello {foo}". I don't think "placeholder extraction" is a good name for what this is, and the PR description and changelog should also be more explanatory. |
Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
How about |
…ms with internal_ Follows the naming guidelines in unicode-org#6659 for internal items on public sealed traits. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Removes #[doc(hidden)] from internal_extract, internal_get_match, and InternalDecodedMatches, and adds the 🚧 unstable stability banner to their documentation, allowing them to be visible on docs.rs while warning users. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Renames internal_extract to extract_unstable, internal_get_match to get_match_unstable, and InternalDecodedMatches to DecodedMatchesUnstable to prefer the suffix style for unstable APIs. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Adds the "🚧 [Unstable]" prefix to the first line of the doc comments for DecodedMatchesUnstable, extract_unstable, and get_match_unstable to clearly mark them as unstable APIs in the documentation. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Rustdoc treats [Unstable] as an intra-doc link. Escaping it as \[Unstable\] prevents the unresolved link warning/error. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Since I'm OOO till next week and the discussion between Shane and I on this was in a private chat, I wanted to document my thoughts publicly so they don't have to be rehashed in future discussion. Initially I started out somewhat against this, especially when it had the "multi" code. My main concern was this was a lot of scaffolding for not much benefit, and a fair amount of new API that I didn't feel we had enough information to design correctly. It appears a point of disagreement between @sffc and me is that he considered That having been identified, on to the concern of the actual API size. I think there are many ways in which my one-off range glue pattern parsing code could be made reusable: I'm not convinced it needs to be in I generally perceive trait-based APIs to be heavier: you have to design in a lot more "directions", which is costlier. They're very often the right call, but here I see many ways in which my one-off parsing code for range patterns could have been made reusable. That said, As for motivation: the more complex the API, the more a priori motivation I want to see, not from a "justify this" POV but from a "how are we sure we've made the right API" POV. So far, we have one concrete use case: being able to look at CLDR patterns, compare them against a glue pattern, and compact them by referencing the glue. Shane mentioned this type of thing may come up again in units/currency, and would have a similar shape. It seems like this API ought to work there? It's unclear. Shane also mentioned that parsing is a shape of a use case for this API, but I am highly skeptical of this API as-designed being good for parsing. But we don't consider parsing to be in scope anyway, so I'm ignoring that use case. With all that, I am at this point -0 on adding this: I'm not enthusiastic, but I don't mind it. API seems fine enough for a "breakable" crate.
This was also my initial confusion when looking at this PR. I'm not sure I like the names, but I don't have concrete suggestions just yet. Footnotes
|
I think Now, again without having looked at the code:
|
Fine with me.
In the new datetime range datagen code, there are strings like "MMM - MMM" and patterns like "{0} - {1}", and we want to identify that we can represent this more efficiently in date. There are other cases that may benefit from this logic for similar data-optimization use cases; for example, "E h:mm" patterns that could be instead be represented as "E", "h:mm", and the glue pattern. We don't currently implement this optimization in large part because it seemed too complicated at the time. Although the current use case and hypothetical use case are both in datetime, this is squarely not a datetime-specific concern. An argument could be made that this could be appropriate for a "pattern helpers" module in the datagen crate, since pattern manipulation is kind-of what datagen does. But, I see such "helper" modules as workarounds for third-party crates whose scope is not what ICU4X needs. For util-like crates that we control, I would rather collect all these helpers in one place, and re-evaluate the scope when we work on graduating that particular util.
For single placeholder, there is no lookahead. It is prefix and suffix. For double placeholder, it is possible to have multiple inputs that interpolate to the same output. The "value extraction" / "uninterpolate" code uses This is not a general-purpose parsing API with fancy lookahead configurability. Use Regex for that. Furthermore, I hold that the fact that we have even 1 client doing complex pattern logic in a non-pattern crate is enough to justify putting this code into the pattern crate. We do not need a more thorough discussion on use cases, tradeoffs, etc., because this is an unstable crate that should evolve as our needs evolve. In other words: this PR aims to offer an API that satisfies the current need, in a minimalistic way, consistent with the current design of icu_pattern. |
Can we instead collect these helpers in datagen, wait for them to get used more, polish, and move them to I'm not asking for a full graduation, but I think a little bit of baking is nice to have. I even bet "where can you use this" is the type of exploratory work you can just task an LLM with.
Figuring out this behavior is one of the reasons I was reluctant to go deep into review of the "actual extraction code". Worth documenting this explicitly.
We are not exactly aligned on this. We could get aligned, but as it stands I find FWIW, the policy you wrote on cross-crate internal "unstable" APIs was that we should design them to avoid breakage. This is a util crate, so not quite the same, but I think its internalness makes it similarly effective. Footnotes
|
I consider this better than the status quo but substantially less good than this PR for reasons I won't repeat.
I find it hard to predict your behavior on this subject. You sold me on "let's let core crates upgradeable out-of-sync with component crates" so much so that I was advocating more strongly than you were on avoiding an unstable tilde dependency on icu_provider in #3913 (comment). You said that the problem was mostly not a problem anymore. But now you are saying it is a problem again, for these crates that hadn't even been part of the conversation before. I thought 0.x crates were added anew to vendor sources, so there's no synchronization problem. You have 0.7, you add 0.8, you integrate all the dependents, and then you delete 0.7.
You're trying to assign a cost (which you acknowledge is a low cost) to a hypothetical future that may or may not ever occur. The change increases the API surface of icu_pattern by like 10% and increases the risk of needing to create a future breakage by also about 10%. The other 90% has never gone through an actual graduation review and is at risk of the same type of problems that this new API has. If I had landed this code when I originally rewrote icu_pattern, no one would have noticed. |
|
To be clear, I interpreted that policy correctly, I just think its motivation applies here too. That's why I said "it's not quite the same".
You're conflating two problems here, leading to confusion, which is understandable (I haven't documented a lot of this). The hint that these are two problems is that these crates weren't in the conversation before. The problem with This was a major problem when we were initially doing stuff in google3 because the tooling had issues and doing batched multi-crate upgrades was painful. Being able to split things into smaller CLs helped. I think google3 upgrades took 2ish weeks back then, similar to the painfulness of ICU4C updates. A lot of that is fixed now, due to a lot of work on my end and the Rust team's end to improve workflows. The core ICU4X upgrade can be done in less than a week, and most of that time is waiting for reviews. I think being judicious about The second problem is that any upgrade to a new epoch (i.e. a semver break) slows down CLs in both google3 and Chromium. This is because they need third-party review (can take days), and also because by-policy they need to be a separate CL. Chromium's desire to have clean "moved file" histories in Gerrit makes this worse3. This is a separate problem: it is not a new problem or a "problem again". You're right that these crates were not a part of the conversation before, because the conversation before was focusing on a different problem with different crates. So I have a preference to avoid "frivolous" breakages in utils. If we can avoid breaking something, we should try (but not too hard), and if we can avoid future breakages via API design, we should also try. I'm happy to turn that into a policy addition that is clearer about what I mean if that would help.
Sure. I told you what will take my -0 to a +1 on this PR. I'm still open to landing this, just that I would slightly prefer not to. If you and Robert want to land this, then go for it, my -0 is not blocking. Otherwise I will eventually review this and probably be fine with it landing.
This new API is designed based on one concrete use case and some speculative ones. It is not narrowly designed for the concrete use case, it is understandably generic. I agree that the speculative use cases make sense. A thing I have not yet thought through, which I think may be a problem for some of the use cases, is exactly how this resolves the ambiguities (see: your comment on (Not wanting to think about ambiguity resolution for Footnotes
|
| impl ExtractionBackend for DoublePlaceholder { | ||
| type DecodedMatchesUnstable<'p, 'a> = [Option<&'a str>; 2]; | ||
|
|
||
| fn extract_unstable<'p, 'a>( |
There was a problem hiding this comment.
issue: please document the greedy/etc behavior here, how it is expected to handle ambiguity
You said some of this here:
For double placeholder, it is possible to have multiple inputs that interpolate to the same output. The "value extraction" / "uninterpolate" code uses str::split to find the infix, returning the first one.
I'd like documentation from a user's perspective.
| /// </div> | ||
| type DecodedMatchesUnstable<'p, 'a>; | ||
|
|
||
| /// 🚧 \[Unstable\] Extract matches from the store. |
There was a problem hiding this comment.
issue: all the assoc items here should be hidden as well.
See table in #6659 (comment)
There was a problem hiding this comment.
I think I commented in that issue that I don't agree with using doc(hidden) for associated types. They should be called Unstable.
Manishearth
left a comment
There was a problem hiding this comment.
code LGTM. I am still -0 on this change, so as I said I am open to landing this and would slightly prefer to not. If you can align with Robert on this, go ahead and land.
| input: &'a str, | ||
| ) -> Option<Self::DecodedMatchesUnstable<'p, 'a>> { | ||
| let mut prefix = None; | ||
| let mut ph = None; |
There was a problem hiding this comment.
nit: "ph"? placeholder? fine to use succinct names but explain what it means
| match item { | ||
| PatternItem::Literal(s) => { | ||
| if ph.is_none() { | ||
| debug_assert!(prefix.is_none()); |
There was a problem hiding this comment.
nit: assertion string please. "found adjacent literals in pattern" or something
This PR implements a new feature for
icu_patternto extract placeholder values from a formatted string.Design
We introduce a new sealed trait
ExtractionBackend: PatternBackendto represent backends that support placeholder extraction.Pattern::extract_placeholdersis implemented generically forPattern<B> where B: ExtractionBackend.SinglePlaceholderandDoublePlaceholderimplementExtractionBackendunconditionally (alloc-free).MultiNamedPlaceholderdoes not implement the trait.We also implement a generic
PlaceholderMatches::getthat delegates to the backend viaExtractionBackend::get_match, maintaining consistency with the rest oficu_pattern.🤖 This pull request was created by an AI agent working with @sffc.
Changelog
icu_pattern: AddPlaceholderMatchesstruct.icu_pattern: AddExtractionBackendtrait.icu_pattern: AddPattern::extract_placeholdersmethod (generic, gated byExtractionBackend).icu_pattern: ImplementExtractionBackendforSinglePlaceholder.icu_pattern: ImplementExtractionBackendforDoublePlaceholder.