Skip to content

Add language tag format to registry#79

Open
TimvdLippe wants to merge 1 commit into
OAI:mainfrom
TimvdLippe:TimvdLippe-patch-1
Open

Add language tag format to registry#79
TimvdLippe wants to merge 1 commit into
OAI:mainfrom
TimvdLippe:TimvdLippe-patch-1

Conversation

@TimvdLippe

Copy link
Copy Markdown

RFC5646 1 defines a standardized format for language tags. This RFC is included in the more commonly known BCP47 2.

RFC5646 [1] defines a standardized format for language tags.
This RFC is included in the more commonly known BCP47 [2].

[1]: https://www.rfc-editor.org/info/rfc5646/
[2]: https://www.rfc-editor.org/info/bcp47/
@TimvdLippe TimvdLippe requested a review from a team as a code owner May 28, 2026 08:47
@TimvdLippe

Copy link
Copy Markdown
Author

@handrews Do you mind reviewing this PR or let me know who else might be able to take a look? Thanks in advance for your time!

@miqui miqui left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@handrews I don't see a problem with this.

@handrews

Copy link
Copy Markdown
Member

It's unclear to me how this could possibly be validated. And for a pure annotation, an extension keyword specifically about language contents would seem better?

@TimvdLippe

Copy link
Copy Markdown
Author

Can you clarify what you mean with "an extension keyword"? I am not following what your alternative proposed solution would be for denoting that a particular value indicates a language in the format of this RFC.

@handrews

handrews commented Jun 23, 2026

Copy link
Copy Markdown
Member

[EDIT: Never mind, see below]

@TimvdLippe if you are on OAS 3.1+, you can make up your own JSON Schema keywords that are treated as JSON Schema annotation (like title or readOnly). As with readOnly, your application would then look at the annotation and decide whether to do additional validation.

The format keyword is used for formats that have some sort of validation rules that either can't be expressed otherwise, or are overly complex and unclear to do so (e.g. the regular expression for IPv6 addresses is mind-boggling, and probably not the fastest way to validate such things).

There are not clear ways to "validate" which language is being used. You could run a heuristic, but it's not going to be interoperable, which is the point of standardizing things.

Really, what is going on here is informing tools and applications what the expected human language usage is. Automatic OpenAPI (or JSON Schema) tools can't do much with that. But an application layer could (for example, it could have a multi-lingual LLM analyze the text if validating the language usage is really important), so using an annotation to communicate the language to the application layer is the correct architecture.

@handrews

Copy link
Copy Markdown
Member

@TimvdLippe wait... I may have this completely wrong 😅

Are you just validating that the text is a language tag? Er... yeah that's probably fine... lemme creep off and be embarrassed for a while and then respond properly 🤦

@TimvdLippe

Copy link
Copy Markdown
Author

No worries at all, happy to elaborate more. Yes this is indeed to specify that a particular value is in the language format. As an example, given an API response:

{
  "header": {
    "lang": "nl-NL",
    "value": "Titel"
  }
}

Then we want to say that the lang field in this response has "format": "language". We want to use this a potential new rule for the Dutch government API Design Rules where we standardize how to handle languages.

@karenetheridge

karenetheridge commented Jun 24, 2026

Copy link
Copy Markdown
Member

If this is intended to be used as a format, wouldn't it be better to add to the format registry, rather than create a new thing here?

edit: my mistake! I misunderstood what this patch was doing.

@handrews

Copy link
Copy Markdown
Member

@TimvdLippe OK great. To clarify, is this format intended to restrict values only to full tags, excluding subtags and codes? Or is it intended to allow any of those three ways of indicating a language? If it applies only to full tags, we might want to add several formats instead (e.g. lang-tag, lang-subtag, lang-code).

@TimvdLippe

Copy link
Copy Markdown
Author

It was intended to allow any format allowed by the RFC. I am also okay to split it up into more granular formats that reference specific grammar constructs in the RFC

@handrews

Copy link
Copy Markdown
Member

@TimvdLippe I suspect "any of the formats" is pretty common, so I think the best thing to do would be to clarify this PR to explicitly say what ABNF productions or IANA registries define the allowed values (I have not read the RFC in detail, please use whatever terminology makes sense). If someone wants more precise formats, they can add them. I just wanted to avoid tying language (a generic term) to a very specific type of language descriptor.

@TimvdLippe

Copy link
Copy Markdown
Author

I wanted to incorporate your feedback, but then I realised I actually already did that. Or at least that was the intent. Language-Tag was the syntax I wanted to allow, which is the whole format, see https://www.rfc-editor.org/info/rfc5646/#section-2.1

@handrews

Copy link
Copy Markdown
Member

@TimvdLippe wow I've really not been on the ball in this PR... my apologies. Yes, this looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants