sem uses tree-sitter grammars to extract semantic entities (functions, classes, etc.) from source code. Adding a new language is straightforward: you define a config struct and add a cargo dependency. No parser code needed.
This guide walks through the process step by step.
All language support lives in two files:
crates/sem-core/Cargo.toml(tree-sitter grammar dependency)crates/sem-core/src/parser/plugins/code/languages.rs(language config)
Each language gets a LanguageConfig that tells sem which AST node types represent code entities.
Add the grammar crate to crates/sem-core/Cargo.toml:
[dependencies]
tree-sitter-scala = "0.23"Most grammars are published on crates.io as tree-sitter-{lang}. Check crates.io for the latest version. The 0.23 series works with tree-sitter 0.26.
In languages.rs, add a function that returns the tree-sitter Language:
fn get_scala() -> Option<Language> {
Some(tree_sitter_scala::LANGUAGE.into())
}Some crates export the language differently. Check the crate's docs. Common patterns:
tree_sitter_python::LANGUAGE(most languages)tree_sitter_typescript::LANGUAGE_TYPESCRIPT(when a crate has multiple grammars)tree_sitter_php::LANGUAGE_PHP(same)
Add a static config in languages.rs:
static SCALA_CONFIG: LanguageConfig = LanguageConfig {
id: "scala",
extensions: &[".scala", ".sc"],
entity_node_types: &[
"function_definition",
"class_definition",
"object_definition",
"trait_definition",
"val_definition",
"var_definition",
"type_definition",
],
container_node_types: &["template_body", "block"],
call_entity_identifiers: &[],
suppressed_nested_entities: &[],
get_language: get_scala,
};Add a reference to the ALL_CONFIGS array:
static ALL_CONFIGS: &[&LanguageConfig] = &[
// ... existing configs ...
&SCALA_CONFIG,
];Add all file extensions to get_all_code_extensions():
static EXTENSIONS: &[&str] = &[
// ... existing extensions ...
".scala", ".sc",
];Add a test in crates/sem-core/src/parser/plugins/code/mod.rs:
#[test]
fn test_scala_entity_extraction() {
let code = r#"
class UserService {
def getUsers(): List[User] = {
db.findAll()
}
}
object AppConfig {
val version = "1.0"
}
trait Repository[T] {
def findById(id: String): Option[T]
}
"#;
let plugin = CodeParserPlugin;
let entities = plugin.extract_entities(code, "UserService.scala");
let names: Vec<&str> = entities.iter().map(|e| e.name.as_str()).collect();
eprintln!("Scala entities: {:?}", entities.iter().map(|e| (&e.name, &e.entity_type)).collect::<Vec<_>>());
assert!(names.contains(&"UserService"), "got: {:?}", names);
assert!(names.contains(&"AppConfig"), "got: {:?}", names);
assert!(names.contains(&"Repository"), "got: {:?}", names);
}cd crates && cargo testAll existing tests plus your new test should pass.
The tree-sitter AST node types that represent top-level code entities. These are the things sem tracks: functions, classes, interfaces, etc.
For most languages, this is all you need. Examples:
| Language | Node types |
|---|---|
| Python | function_definition, class_definition, decorated_definition |
| Rust | function_item, struct_item, enum_item, impl_item, trait_item |
| Go | function_declaration, method_declaration, type_declaration |
AST nodes that can contain nested entities. When sem finds a container, it looks inside for child entities and sets up parent-child relationships.
For example, in Java a class_body contains method declarations. Setting container_node_types: &["class_body"] lets sem extract methods as children of the class.
Common containers: block, class_body, declaration_list, compound_statement.
For languages where entities are defined via function calls rather than syntax. Elixir is the primary example:
defmodule MyApp do # "defmodule" is a call, not a keyword
def greet(name) do # "def" is a call
"Hello #{name}"
end
endSet entity_node_types to &[] and list the call identifiers instead:
call_entity_identifiers: &["defmodule", "def", "defp", "defmacro", ...],Most languages don't need this. Leave it as &[].
Prevents double-extraction when a child entity type shouldn't be extracted inside a parent entity type. Used by HCL to suppress nested attribute nodes inside block nodes (since the block already captures that content).
suppressed_nested_entities: &[SuppressedNestedEntity {
parent_entity_node_type: "block",
child_entity_node_type: "attribute",
}],Most languages don't need this. Leave it as &[].
The hardest part is figuring out which AST node types your language uses. Here's how:
Go to tree-sitter.github.io/tree-sitter/playground. Paste some sample code and look at the parse tree. The node type names in the tree are exactly what you put in entity_node_types.
Every tree-sitter grammar has a grammar.js or src/node-types.json in its repo. Search for the node types you need. The GitHub repos are usually at tree-sitter/tree-sitter-{lang} or tree-sitter-grammars/tree-sitter-{lang}.
If you have the tree-sitter CLI installed:
tree-sitter parse sample.scalaThis prints the full AST with node types.
- Start with the obvious ones:
function_definition,class_definition, etc. - Use
eprintln!in your test to see what entities are extracted. The existing tests all do this. - If something isn't extracted, the node type name is probably different. Check the AST.
- If too many things are extracted, you may be including container nodes or low-level syntax.
Open an issue or check the existing language configs in languages.rs for reference. The simplest configs (Python, Bash) are good starting points. The Elixir config shows the call-based approach.