Skip to content

Latest commit

 

History

History
116 lines (85 loc) · 3.14 KB

File metadata and controls

116 lines (85 loc) · 3.14 KB

AI Language Identification System

A system for identifying programming languages in source code files using machine learning. This implementation supports both a legacy TensorFlow-based model and a new scikit-learn-based model with improved performance and resource efficiency.

Features

  • Supports 8 programming languages: Python, Java, C++, Groovy, JavaScript, XML, JSON, and YAML
  • Resource-efficient implementation (runs on CPU with <512MB RAM)
  • Fast prediction (more than 4 files/second)
  • Handles class imbalance through balanced class weights
  • Provides confidence scores for predictions
  • Supports batch processing of multiple files

Requirements

  • Python 3.6+
  • Dependencies listed in requirements.txt

Installation

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt

Usage

Training the Model

python ai_predict_lang.py --train --train-dir file

Predicting Language (CLI)

Single file:

python ai_predict_lang.py --file path/to/file.txt

Directory (batch) prediction:

python ai_predict_lang.py --dir path/to/directory/

Show top-k predictions:

python ai_predict_lang.py --file path/to/file.txt --top-k 3

Other options:

  • --model-dir: Directory containing trained model (default: 'models')
  • --output: Output file for batch results
  • --validate: Validate predictions against file extensions (for batch)
  • --legacy: Use legacy TensorFlow model

Implementation Details

Model Architecture

The new implementation uses:

  • HashingVectorizer with character n-grams (1-3) for feature extraction
  • SGDClassifier (logistic regression) with balanced class weights
  • Lightweight, language-specific features (keywords, syntax markers, comment styles)
  • Efficient memory usage and fast inference

Resource Constraints

The implementation is optimized for:

  • Single quad-core CPU
  • 512MB RAM
  • Local file storage
  • Processing > 4 files/second

Results

  • Test accuracy: ~93%
  • Validation accuracy: ~94%
  • Macro F1-score: ~0.78 (test set)
  • Weighted F1-score: ~0.94 (test set)
  • Resource usage: Peak memory ~139MB, training time ~75s, prediction speed >4 files/sec

Limitations

  1. May struggle with:

    • Very short files
    • Files with mixed languages
    • Files with non-standard extensions
    • Binary files or non-UTF-8 encoded files
  2. Resource constraints:

    • Limited to CPU processing
    • Memory usage must stay under 512MB
    • Processing speed target of 4 files/second

Future Improvements

  1. Model improvements:

    • Add file extension as a feature
    • Implement ensemble methods
    • Add confidence thresholds
    • Create language-specific preprocessing rules
  2. Performance optimizations:

    • Implement batch processing for training
    • Add caching for frequently accessed files
    • Optimize feature extraction pipeline
  3. Additional features:

    • Support for more languages
    • Better handling of mixed-language files
    • Improved error handling and logging
    • API for integration with other tools