I've implemented configurable language-level embeddings for CocoIndex using an external wrapper approach that doesn't modify CocoIndex source code. The solution uses GraphCodeBERT and UniXcode models with intelligent automatic selection based on programming language.
Based on your requirement: "You've got solution. But it includes modifying the cocoindex source code. If possible, I would like to avoid this."
The external approach provides all the same intelligent functionality while treating CocoIndex purely as a dependency.
LanguageModelSelector - Intelligent model selection engine:
- Maps 20+ programming languages to optimal embedding models
- Supports file extension detection (
.py→python→GraphCodeBERT) - Handles language normalization (
js→javascript,rs→rust) - Configurable fallback models for unsupported languages
create_smart_code_embedding() - Main external API:
# Returns CocoIndex's SentenceTransformerEmbed with intelligent model selection
embedding_func = create_smart_code_embedding(file_extension=".py")
# Automatically selects microsoft/graphcodebert-base for Python
chunk["embedding"] = chunk["text"].transform(embedding_func)| Language Group | Languages | Model | Extensions |
|---|---|---|---|
| GraphCodeBERT | Python, Java, JavaScript, PHP, Ruby, Go, C, C++ | microsoft/graphcodebert-base |
.py, .java, .js, .php, .rb, .go, .c, .cpp |
| UniXcode | Rust, TypeScript, C#, Kotlin, Scala, Swift, Dart | microsoft/unixcoder-base |
.rs, .ts, .cs, .kt, .scala, .swift, .dart |
| Fallback | Haskell, OCaml, others | sentence-transformers/all-mpnet-base-v2 |
.hs, .ml, others |
Core Functions:
create_smart_code_embedding()- Automatic model selectioncreate_smart_embedding_from_file_context()- CocoIndex flow integrationget_supported_languages()- Language → model mappingget_supported_extensions()- Extension → language mapping
Convenience Functions:
create_python_embedding()- Pre-configured for Python (GraphCodeBERT)create_rust_embedding()- Pre-configured for Rust (UniXcode)create_javascript_embedding()- Pre-configured for JavaScriptcreate_typescript_embedding()- Pre-configured for TypeScript
Complete working CocoIndex flow:
@cocoindex.flow_def
def external_code_embedding_flow(flow_builder, data_scope):
with flow_builder.read_files(data_scope.input_directory) as file:
file = file.filter(lambda f: f["extension"] in [
".py", ".rs", ".js", ".ts", ".java", ".kt"
])
file["chunks"] = file["content"].transform(
cocoindex.functions.SplitRecursively(),
language=file["extension"], chunk_size=1000
)
with file["chunks"].row() as chunk:
# External smart embedding - no CocoIndex modification!
chunk["embedding"] = chunk["text"].transform(
create_smart_code_embedding(file_extension=file["extension"])
)
file["chunks"].save(cocoindex.targets.QdrantTarget(
collection_name="external_code_embeddings"
))Standalone Tests (18 test cases, 100% pass rate):
- Language normalization and detection
- Model selection for all supported languages
- File extension mapping
- Custom model arguments and overrides
- Integration scenarios and edge cases
Test Results:
$ python -m pytest tests/test_external_embedding_standalone.py -v
============================== 18 passed in 0.06s ==============================def create_smart_code_embedding(file_extension=".py"):
selector = LanguageModelSelector()
# Intelligent model selection
model = selector.select_model(file_extension=file_extension)
# ".py" → "python" → "microsoft/graphcodebert-base"
args = selector.get_model_args(model)
# Adds trust_remote_code=True for Microsoft models
# Return CocoIndex's existing function with smart model
return cocoindex.functions.SentenceTransformerEmbed(
model=model, args=args
)✅ No CocoIndex Source Modification
- Uses CocoIndex purely as external dependency
- Submodule remains for reference only, not integration
✅ Full Compatibility
- Works with all existing CocoIndex workflows
- Drop-in replacement for
SentenceTransformerEmbed
✅ Independent Maintenance
- Can be packaged as separate library
- Updates don't require CocoIndex changes
✅ Easy Integration
# Before: Generic embedding
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-mpnet-base-v2"
)
)
# After: Intelligent language-aware embedding
from cocoindex_code_mcp_server.smart_code_embedding import create_smart_code_embedding
chunk["embedding"] = chunk["text"].transform(
create_smart_code_embedding(file_extension=file["extension"])
)embedding_func = create_smart_code_embedding(file_extension=".py")
# Automatically: .py → python → GraphCodeBERTembedding_func = create_smart_code_embedding(language="rust")
# Manually: rust → UniXcodeembedding_func = create_smart_code_embedding(
language="python",
force_model="microsoft/graphcodebert-base"
)embedding_func = create_smart_code_embedding(
language="python",
model_args={"device": "cpu", "batch_size": 16}
)- External wrapper implementation
- Language-aware model selection
- Comprehensive test suite (18 tests passing)
- Complete integration examples
- Documentation and usage patterns
- No CocoIndex source modification (per your requirement)
- GraphCodeBERT default for supported languages
- UniXcode default for other supported languages
- Fallback handling for unsupported languages
- Configurable language-level embeddings
- Tested external wrapper functions
- Working CocoIndex flow examples
- Drop-in compatibility with existing workflows
pip install cocoindex[embeddings]from cocoindex_code_mcp_server.smart_code_embedding import create_smart_code_embedding
# In your CocoIndex flow
embedding_func = create_smart_code_embedding(file_extension=".py")
chunk["embedding"] = chunk["text"].transform(embedding_func)from cocoindex_code_mcp_server.smart_code_embedding import get_supported_languages
for language, model in get_supported_languages().items():
print(f"{language:12} → {model}")The external approach achieves all your requirements:
- ✅ Configurable language-level embeddings
- ✅ GraphCodeBERT as default for supported languages
- ✅ UniXcode as default for other supported languages
- ✅ No CocoIndex source code modification
- ✅ CocoIndex used purely as external dependency
The solution wraps CocoIndex's existing SentenceTransformerEmbed with intelligent model selection, providing the same end-user experience while respecting the constraint of not modifying the CocoIndex codebase.