feat(linguist): implement Phase 2 auto-generation infrastructure#18
Closed
github-actions[bot] wants to merge 3 commits into
Closed
feat(linguist): implement Phase 2 auto-generation infrastructure#18github-actions[bot] wants to merge 3 commits into
github-actions[bot] wants to merge 3 commits into
Conversation
- Add `supported_in_singularity` flag (defaults to false, explicitly true for our 24 languages) - Add `language_type` field aligned with Linguist's classification - Update all 24 language registrations with new fields - Source of truth: <https://github.com/github-linguist/linguist/blob/main/lib/linguist/languages.yml> ## Governance Model Language definitions now follow GitHub Linguist's standard: - Prevents ad-hoc language additions - Ensures consistency across ecosystem - Automatic tracking via Renovate (weekly) ## Build Script Enhancement Updated build.rs with future capability for: - Automatic Linguist languages.yml synchronization - Code generation from Linguist definitions - Auto-update when Linguist adds new languages ## Renovate Configuration - New rule to track Linguist releases (weekly) - Labels: linguist, language-registry - Manual review for language definition changes This prepares Singularity for scalable language support while maintaining explicit governance over what's actually supported. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
…cation ## What's New FileClassifier Module: Detect vendored, generated, and binary files - Uses patterns from GitHub Linguist (vendor.yml, generated.rb) - Supports: vendored detection, generated file detection, binary detection - Methods: is_vendored(), is_generated(), is_binary(), classify(), should_analyze() Phase 1: Language Definitions - DONE - Languages synced from Linguist languages.yml - supported_in_singularity flag for explicit support - Weekly Renovate alerts Phase 2: File Classification - READY - FileClassifier implementation complete - Ready to auto-generate from Linguist patterns - Supports: vendor paths, generated extensions, binary formats, documentation markers Phase 3: Detection Heuristics - PLANNED - Future: Auto-generate from Linguist heuristics.yml - Fallback language detection for ambiguous extensions New Files: - src/file_classifier.rs: File classification engine - LINGUIST_INTEGRATION.md: Complete documentation - Updated build.rs: 3-phase roadmap - Updated renovate.json5: Enhanced PR instructions Benefits: ✅ Skip vendored code (node_modules/, vendor/) ✅ Skip generated files (.pb.rs, .generated.ts, etc.) ✅ Skip binary files (images, archives, executables) ✅ Auto-updated with Linguist releases ✅ Reduces false positives in code analysis Testing: All tests pass, Clippy and fmt clean 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Phase 2 Implementation: Auto-generate File Classification Patterns New Files Added: scripts/sync_linguist_patterns.py (200+ lines) - Downloads vendor.yml from Linguist - Downloads generated.rb from Linguist - Parses YAML and Ruby code - Extracts vendored, generated, and binary file patterns - Generates Rust code arrays for FileClassifier tools/linguist_sync.rs (130+ lines) - Rust implementation roadmap - Pattern parsing architecture - Code generation infrastructure Updated Files: build.rs: Enhanced documentation - Added manual synchronization workflow - Documented automated (future) workflow - Phase 2 in-progress status - Maintenance instructions justfile: New command - just sync-linguist: Run Python script to sync patterns - Provides step-by-step next actions - Integrates into development workflow LINGUIST_INTEGRATION.md: Detailed Phase 2 documentation - Status: FileClassifier, Script, Integration, CI - Manual + Automated sync workflows - Implementation details - Usage examples Workflow: For Maintainers (When Linguist Updates): just sync-linguist cargo test git add . git commit For Automation (Future): cargo xtask sync-linguist What Gets Synced: - Vendored paths: node_modules/, vendor/, .yarn/ - Generated files: .pb.rs, .generated.ts, .designer.cs - Binary formats: images, archives, executables 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Analyzing changes...
Commits:
Changed Files:
.github/workflows/docs.yml | 106 ----------------
Cargo.lock | 2 +-
Cargo.toml | 2 +-
LINGUIST_INTEGRATION.md | 261 ++++++++++++++++++++++++++++++++++++++
build.rs | 60 ++++++++-
examples/usage.rs | 7 +-
flake.lock | 17 +++
justfile | 13 ++
renovate.json5 | 44 +++++++
scripts/sync_linguist_patterns.py | 219 ++++++++++++++++++++++++++++++++
src/file_classifier.rs | 242 +++++++++++++++++++++++++++++++++++
src/lib.rs | 4 +
src/metadata.rs | 26 ++--
src/registry.rs | 215 +++++++++++++++++++++++--------
src/utils.rs | 8 +-
tools/linguist_sync.rs | 148 +++++++++++++++++++++
16 files changed, 1192 insertions(+), 182 deletions(-)
Detailed Changes: