Protobuf has two file formats that look superficially similar but are fundamentally different languages:
.protofiles — schema definitions (declare structure: messages, fields, types).textproto/.txtpbfiles — data instances (assign values to fields of a message)
They share vocabulary (field names, type names, enum values) but differ in syntax, purpose,
and — critically — in how meaning is determined. A .proto file is self-contained: its
meaning comes from its own declarations. A .textproto file is meaningless without a
schema — you can't validate field names, check types, or resolve enums unless you know
which message type the file represents.
Reusing the proto parser would mean forcing an instance format into a definition grammar.
The grammars are structurally incompatible: proto has message Foo { int32 bar = 1; }
while textproto has bar: 42. Shared parsing would require so many special cases that the
"shared" code would be harder to maintain than two separate implementations.
Choice: ProtoText is a completely independent IntelliJ language with its own grammar
(prototext.bnf, ~200 lines vs proto's ~600), its own lexer, its own PSI tree hierarchy
(ProtoTextElement vs ProtobufElement), and its own token types (notably # comments
instead of // and /* */).
What would sharing have broken?
- Grammar structure. Proto's top-level is
(Edition|Syntax)? (Import|Package|Option|Message|Enum|Service)*. TextProto's top-level is justField*. Sharing a grammar would mean the textproto parser carries dead weight for every proto construct. - Token types. TextProto uses
#for comments; proto uses//and/* */. Sharing a lexer would mean conditional comment handling everywhere. - PSI semantics. A
FieldNamein proto is a declaration site. AFieldNamein textproto is a usage site that must resolve to a proto declaration. Same name, opposite roles — sharing the class would conflate definition and reference.
What IS shared: The reference resolution infrastructure. TextProto references resolve
to proto PSI classes (ProtobufFieldDefinition, ProtobufEnumValueDefinition,
ProtobufMessageDefinition). Both languages use ProtobufSymbolResolver for scope-based
lookup and the same stub indexes (ShortNameIndex, QualifiedNameIndex) for cross-file
resolution. The separation is at the language boundary; the integration is at the semantic
boundary.
Choice: TextProto files declare their schema using special comments at the top of the file:
# proto-file: path/to/schema.proto
# proto-message: package.MessageName
These comments are parsed by ProtoTextSharpCommentReferenceContributor and create
real IntelliJ references:
ProtoTextHeaderFileReferenceresolves# proto-file:to aProtobufFile(supports both relative paths and module paths viaProtobufRootResolver)ProtoTextHeaderMessageReferenceresolves# proto-message:within the resolved file's scope to aProtobufMessageDefinition
Why comments instead of a language construct? Because the text format specification
doesn't define a schema-linking syntax. These header comments are a de facto convention
(used by Google's internal tools and buf), not part of the grammar. Making them comments
means:
- Files remain valid textproto regardless of tooling
- The linking mechanism is optional — files without headers just lose IDE features
- The
ProtoTextFile.schema()method returnsnullgracefully when headers are missing
What alternatives exist? External configuration (a project-level mapping file) or filename conventions. Both are more fragile and less discoverable than inline comments.
Choice: TextProto references always point outward to proto definitions. Proto never references textproto. This creates a one-directional dependency graph.
The three reference types in textproto are:
| Reference | Source | Resolves To |
|---|---|---|
ProtoTextFieldReference |
Field name in assignment | ProtobufFieldLike in the message schema |
ProtoTextTypeNameReference |
Extension/any type bracket [pkg.Type] |
ProtobufFieldDefinition or global type via stub index |
ProtoTextEnumValueReference |
Enum literal value | ProtobufEnumValueDefinition in the field's enum type |
Resolution strategy: Field references walk up the textproto tree to find the owner
message context, then search that message's items for a matching field name. For map fields,
the synthetic key and value fields are handled specially. Type references for extensions
use ProtobufSymbolResolver; for Any types, they use the global QualifiedNameIndex.
Completion leverages the same resolution: knowing the current message context lets the
plugin suggest all valid field names (with " {}" for message fields, ": " for scalars)
and all valid enum values.
.textproto file
↓
Parse with prototext grammar → ProtoTextFile PSI tree
↓
Read header comments:
# proto-file: → ProtoTextHeaderFileReference → ProtobufFile
# proto-message: → ProtoTextHeaderMessageReference → ProtobufMessageDefinition
↓
For each field assignment:
FieldName → ProtoTextFieldReference → ProtobufFieldLike
If field type is message → recurse into nested message context
If field type is enum → ProtoTextEnumValueReference → ProtobufEnumValueDefinition
If extension syntax [Type] → ProtoTextTypeNameReference → resolve globally
The schema acts as the "type system" for textproto. Without it, field names are just strings. With it, the plugin provides completion, validation, navigation, and rename refactoring — all flowing from the two-line header comment.
The core design tension is: textproto is syntactically independent but semantically dependent on proto. The architecture resolves this by making the languages fully separate (different grammars, lexers, PSI trees) while making the reference system fully integrated (textproto references resolve to proto PSI nodes, using the same symbol resolution infrastructure). This means each language can evolve its parsing independently, but IDE features like go-to-definition and completion work seamlessly across the boundary.