Skip to content

Commit a3017e6

Browse files
mikkihugoclaude
andcommitted
feat(linguist): implement Phase 2 auto-generation infrastructure
Phase 2 Implementation: Auto-generate File Classification Patterns New Files Added: scripts/sync_linguist_patterns.py (200+ lines) - Downloads vendor.yml from Linguist - Downloads generated.rb from Linguist - Parses YAML and Ruby code - Extracts vendored, generated, and binary file patterns - Generates Rust code arrays for FileClassifier tools/linguist_sync.rs (130+ lines) - Rust implementation roadmap - Pattern parsing architecture - Code generation infrastructure Updated Files: build.rs: Enhanced documentation - Added manual synchronization workflow - Documented automated (future) workflow - Phase 2 in-progress status - Maintenance instructions justfile: New command - just sync-linguist: Run Python script to sync patterns - Provides step-by-step next actions - Integrates into development workflow LINGUIST_INTEGRATION.md: Detailed Phase 2 documentation - Status: FileClassifier, Script, Integration, CI - Manual + Automated sync workflows - Implementation details - Usage examples Workflow: For Maintainers (When Linguist Updates): just sync-linguist cargo test git add . git commit For Automation (Future): cargo xtask sync-linguist What Gets Synced: - Vendored paths: node_modules/, vendor/, .yarn/ - Generated files: .pb.rs, .generated.ts, .designer.cs - Binary formats: images, archives, executables 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 431613e commit a3017e6

5 files changed

Lines changed: 490 additions & 31 deletions

File tree

LINGUIST_INTEGRATION.md

Lines changed: 67 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,13 @@ if lang.supported_in_singularity {
4545
- **Action**: Manual review required before merge
4646
- **Update**: When Linguist releases a new version
4747

48-
## Phase 2: File Classification (Ready)
48+
## Phase 2: File Classification (In Progress)
49+
50+
### Status
51+
-**FileClassifier module**: Implemented with 5 tests
52+
-**Synchronization script**: Created (`scripts/sync_linguist_patterns.py`)
53+
- 🔧 **Integration in progress**: Add `sync-linguist` justfile command
54+
- 📋 **Next**: Add to CI workflow
4955

5056
### What Will Be Added
5157

@@ -57,6 +63,7 @@ Auto-skip third-party dependencies:
5763
- .yarn/
5864
- Pods/
5965
- third_party/
66+
- Carthage/
6067
```
6168

6269
#### Generated File Detection
@@ -78,8 +85,52 @@ Skip non-text files:
7885
- *.pdf, *.docx (Documents)
7986
```
8087

81-
### Implementation
82-
Using `FileClassifier` (already implemented):
88+
### How It Works
89+
90+
#### Step 1: Manual Synchronization (Current)
91+
When Linguist updates (Renovate alert):
92+
```bash
93+
# Sync patterns from Linguist to Rust code
94+
python3 scripts/sync_linguist_patterns.py > src/file_classifier_generated.rs
95+
96+
# Run tests to validate patterns
97+
cargo test
98+
99+
# Commit the generated patterns
100+
git add src/file_classifier_generated.rs
101+
git commit -m "chore(linguist): sync file classification patterns"
102+
```
103+
104+
#### Step 2: Automated Synchronization (Future)
105+
```bash
106+
# Automatic sync via justfile
107+
just sync-linguist
108+
109+
# Or via cargo xtask
110+
cargo xtask sync-linguist
111+
```
112+
113+
### Implementation Details
114+
115+
#### Synchronization Script (`scripts/sync_linguist_patterns.py`)
116+
1. **Downloads from Linguist**:
117+
- `vendor.yml`: Vendored code patterns (6.5KB)
118+
- `generated.rb`: Generated file detection logic (29.8KB)
119+
- `heuristics.yml`: Language detection rules (35KB, Phase 3)
120+
121+
2. **Parses patterns**:
122+
- YAML parsing for `vendor.yml`
123+
- Ruby AST parsing for `generated.rb`
124+
- Regex extraction and normalization
125+
126+
3. **Generates Rust code**:
127+
- Static arrays: `VENDORED_PATTERNS_FROM_LINGUIST`
128+
- Static arrays: `GENERATED_PATTERNS_FROM_LINGUIST`
129+
- Static arrays: `BINARY_PATTERNS_FROM_LINGUIST`
130+
131+
4. **Output**: `src/file_classifier_generated.rs` (auto-generated)
132+
133+
#### FileClassifier Usage
83134
```rust
84135
use singularity_language_registry::FileClassifier;
85136

@@ -99,11 +150,19 @@ if classifier.should_analyze(path) {
99150

100151
### Source Data
101152
- **`vendor.yml`**: Vendored code patterns (6.5KB)
102-
- **`generated.rb`**: Generated file detection logic (29.8KB)
103-
- File path patterns (node_modules/, dist/, build/)
104-
- Extension patterns (.pb.rs, .generated.ts)
105-
- Content markers (Generated by, DO NOT EDIT)
106-
- Structural analysis (minified lines, closure patterns)
153+
- Dependency manager directories
154+
- IDE/editor artifacts
155+
- Build output directories
156+
- Framework-specific paths
157+
158+
- **`generated.rb`**: Generated file detection (29.8KB)
159+
- File path patterns
160+
- Extension matching
161+
- Content header signatures (Generated by, DO NOT EDIT)
162+
- Minification detection
163+
- Metadata inspection
164+
165+
- **`heuristics.yml`**: Language detection rules (Phase 3)
107166

108167
## Phase 3: Detection Heuristics (Planned)
109168

build.rs

Lines changed: 43 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -8,35 +8,55 @@
88
//! This ensures Singularity language definitions stay consistent with GitHub's standard.
99
//! Renovate automatically alerts when Linguist updates (weekly schedule).
1010
//!
11-
//! ## Future: Extended Linguist Integration (Option 2 Roadmap)
11+
//! ## Extended Linguist Integration (Option 2 - In Progress)
1212
//!
13-
//! In the future, this build script can be extended to automatically synchronize with Linguist:
14-
//!
15-
//! ### Phase 1: Language Definitions (DONE)
13+
//! ### Phase 1: Language Definitions (✅ DONE)
1614
//! - ✅ `languages.yml` synced to registry
1715
//! - ✅ `supported_in_singularity` flag for explicit support
18-
//! - ✅ Weekly Renovate alerts for updates
16+
//! - ✅ Weekly Renovate alerts
17+
//!
18+
//! ### Phase 2: File Classification (🔧 IN PROGRESS)
19+
//!
20+
//! #### Implementation Step 1: Manual Synchronization (Current)
21+
//! Run the synchronization script when Linguist updates:
22+
//! ```bash
23+
//! python3 scripts/sync_linguist_patterns.py > src/file_classifier_generated.rs
24+
//! cargo test
25+
//! git add src/file_classifier_generated.rs
26+
//! git commit -m "chore(linguist): sync file classification patterns"
27+
//! ```
28+
//!
29+
//! #### Implementation Step 2: Automated Synchronization (Future)
30+
//! This build script can be extended to:
31+
//! ```bash
32+
//! cargo xtask sync-linguist
33+
//! ```
34+
//!
35+
//! Which will:
36+
//! 1. Download `vendor.yml` from Linguist
37+
//! 2. Download `generated.rb` from Linguist
38+
//! 3. Parse and extract patterns
39+
//! 4. Generate Rust code arrays
40+
//! 5. Update `src/file_classifier_generated.rs`
41+
//! 6. Run tests to validate
1942
//!
20-
//! ### Phase 2: File Classification (READY FOR IMPLEMENTATION)
21-
//! - Extract `vendor.yml` patterns from Linguist
22-
//! - Extract `generated.rb` heuristics from Linguist
23-
//! - Auto-generate:
24-
//! - Vendored path patterns (`node_modules/`, `vendor/`, `.yarn/`, etc.)
25-
//! - Generated file extensions (`.pb.rs`, `.generated.ts`, etc.)
26-
//! - Generated file content markers
27-
//! - Result: `FileClassifier` is kept in sync with Linguist
43+
//! #### Patterns Extracted
44+
//! - **Vendored**: `node_modules/`, `vendor/`, `.yarn/`, `Pods/`, `dist/`, `build/`
45+
//! - **Generated**: `.pb.rs`, `.pb.go`, `.generated.ts`, `.designer.cs`, `.meta`
46+
//! - **Binary**: `.png`, `.jpg`, `.zip`, `.exe`, `.dll`, `.pdf`
2847
//!
29-
//! ### Phase 3: Detection Heuristics (Future)
30-
//! - Extract `heuristics.yml` from Linguist
31-
//! - Generate language detection rules for ambiguous extensions
32-
//! - Support fallback detection when extension alone is unclear
48+
//! ### Phase 3: Detection Heuristics (📋 PLANNED)
49+
//! - Extract `heuristics.yml` from Linguist (35KB)
50+
//! - Generate fallback language detection for ambiguous extensions
51+
//! - Support: `.pl` (Perl vs Prolog), `.m` (Objective-C vs Matlab), etc.
3352
//!
34-
//! ### Implementation
35-
//! When Renovate detects a Linguist update (weekly):
36-
//! 1. Review the changes in the PR
37-
//! 2. If significant: regenerate file classification and language definitions
38-
//! 3. Run full test suite
39-
//! 4. Merge and release new registry version
53+
//! ### Maintenance Workflow
54+
//! When Renovate creates a Linguist update PR:
55+
//! 1. Review language definition changes
56+
//! 2. Run: `python3 scripts/sync_linguist_patterns.py`
57+
//! 3. Run: `cargo test`
58+
//! 4. Commit changes: `git add . && git commit`
59+
//! 5. Merge and create release
4060
//!
4161
//! This can be used to ensure registry metadata matches actual library capabilities.
4262
//! Run with: cargo build --features validate-metadata

justfile

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,19 @@ ci-local:
118118
changelog:
119119
git log --pretty=format:"- %s (%h)" --reverse > CHANGELOG.md
120120

121+
# Sync file classification patterns from GitHub Linguist (Phase 2)
122+
sync-linguist:
123+
#!/usr/bin/env bash
124+
set -e
125+
echo "Synchronizing file classification patterns from GitHub Linguist..."
126+
python3 scripts/sync_linguist_patterns.py > src/file_classifier_generated.rs
127+
echo "✅ Patterns synced to src/file_classifier_generated.rs"
128+
echo ""
129+
echo "Next steps:"
130+
echo " 1. cargo test"
131+
echo " 2. git add src/file_classifier_generated.rs"
132+
echo " 3. git commit -m 'chore(linguist): sync file classification patterns'"
133+
121134
# Verify everything before PR
122135
verify: fmt clippy test audit renovate-validate doc
123136
@echo "✅ All checks passed!"

0 commit comments

Comments
 (0)