AI Language Identification System

A system for identifying programming languages in source code files using machine learning. This implementation supports both a legacy TensorFlow-based model and a new scikit-learn-based model with improved performance and resource efficiency.

Features

Supports 8 programming languages: Python, Java, C++, Groovy, JavaScript, XML, JSON, and YAML
Resource-efficient implementation (runs on CPU with <512MB RAM)
Fast prediction (more than 4 files/second)
Handles class imbalance through balanced class weights
Provides confidence scores for predictions
Supports batch processing of multiple files

Requirements

Python 3.6+
Dependencies listed in requirements.txt

Installation

Clone the repository
Install dependencies:
```
pip install -r requirements.txt
```

Usage

Training the Model

python ai_predict_lang.py --train --train-dir file

Predicting Language (CLI)

Single file:

python ai_predict_lang.py --file path/to/file.txt

Directory (batch) prediction:

python ai_predict_lang.py --dir path/to/directory/

Show top-k predictions:

python ai_predict_lang.py --file path/to/file.txt --top-k 3

Other options:

--model-dir: Directory containing trained model (default: 'models')
--output: Output file for batch results
--validate: Validate predictions against file extensions (for batch)
--legacy: Use legacy TensorFlow model

Implementation Details

Model Architecture

The new implementation uses:

HashingVectorizer with character n-grams (1-3) for feature extraction
SGDClassifier (logistic regression) with balanced class weights
Lightweight, language-specific features (keywords, syntax markers, comment styles)
Efficient memory usage and fast inference

Resource Constraints

The implementation is optimized for:

Single quad-core CPU
512MB RAM
Local file storage
Processing > 4 files/second

Results

Test accuracy: ~93%
Validation accuracy: ~94%
Macro F1-score: ~0.78 (test set)
Weighted F1-score: ~0.94 (test set)
Resource usage: Peak memory ~139MB, training time ~75s, prediction speed >4 files/sec

Limitations

May struggle with:
- Very short files
- Files with mixed languages
- Files with non-standard extensions
- Binary files or non-UTF-8 encoded files
Resource constraints:
- Limited to CPU processing
- Memory usage must stay under 512MB
- Processing speed target of 4 files/second

Future Improvements

Model improvements:
- Add file extension as a feature
- Implement ensemble methods
- Add confidence thresholds
- Create language-specific preprocessing rules
Performance optimizations:
- Implement batch processing for training
- Add caching for frequently accessed files
- Optimize feature extraction pipeline
Additional features:
- Support for more languages
- Better handling of mixed-language files
- Improved error handling and logging
- API for integration with other tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Language Identification System

Features

Requirements

Installation

Usage

Training the Model

Predicting Language (CLI)

Implementation Details

Model Architecture

Resource Constraints

Results

Limitations

Future Improvements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AI Language Identification System

Features

Requirements

Installation

Usage

Training the Model

Predicting Language (CLI)

Implementation Details

Model Architecture

Resource Constraints

Results

Limitations

Future Improvements