A machine learning-based OCR solution that can detect, classify, and accurately recognize text from multiple Indian regional scripts and convert it into editable Unicode text.
- Title: Multi-Language OCR System for Indian Regional Scripts (Indic-OCR)
- Author: Dhananjayan H
- Roll No: AA.SC.P2MCA24070151
- Course: MCA Minor Project (21CSA697A)
- Develop a multi-language OCR system capable of identifying the script and extracting text from handwritten or printed Indian regional languages
- Convert extracted text into Unicode digital text
- Support applications like digitization of academic notes, historical records, government documents, and accessibility enhancement
- Multi-Script Support: Devanagari (Hindi), Malayalam, Tamil
- Script Detection: Automatic identification of the input script
- Image Preprocessing: Advanced preprocessing for improved accuracy
- Deep Learning OCR: State-of-the-art recognition using PaddleOCR/TrOCR
- Web Interface: User-friendly Streamlit-based interface
- Evaluation Tools: CER/WER metrics for performance assessment
Indic-OCR/
โโโ app/
โ โโโ streamlit_app.py # Web interface
โโโ config/
โ โโโ config.yaml # Configuration file
โโโ data/
โ โโโ raw/ # Raw dataset
โ โโโ processed/ # Processed images
โ โโโ train/ # Training set
โ โโโ val/ # Validation set
โ โโโ test/ # Test set
โโโ models/ # Saved models
โโโ notebooks/ # Jupyter notebooks
โโโ outputs/ # OCR outputs
โโโ src/
โ โโโ __init__.py
โ โโโ preprocessing.py # Image preprocessing
โ โโโ script_classifier.py # Script identification model
โ โโโ ocr_engine.py # OCR recognition
โ โโโ dataset.py # Dataset utilities
โ โโโ evaluation.py # Evaluation metrics
โ โโโ utils.py # Utility functions
โโโ tests/ # Unit tests
โโโ requirements.txt # Dependencies
โโโ train.py # Training script
โโโ inference.py # Inference script
โโโ README.md # Documentation
- Python 3.8 or higher
- pip package manager
- GPU (optional, for faster training)
- Clone the repository:
cd "MCA Project/Indic-OCR"- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- For GPU support (optional):
pip install paddlepaddle-gpu # For CUDA 11.x
# or
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118-
Public Datasets:
- Devanagari Handwritten Character Dataset
- Tamil Handwritten Dataset
- BHaratWrites Dataset
-
Self-collected Data:
- Scanned documents
- Handwritten notes
- Printed materials
python -c "from src.dataset import DatasetManager; dm = DatasetManager(); dm.split_dataset()"python train.py --task script_classifier --epochs 50 --batch_size 32python train.py --task ocr --model paddleocr --language hindipython inference.py --image path/to/image.png --output results.txtfrom src import ImagePreprocessor, ScriptClassifier, OCREngine
# Initialize components
preprocessor = ImagePreprocessor()
classifier = ScriptClassifier()
ocr = OCREngine()
# Process image
image = preprocessor.preprocess("document.png")
script, confidence = classifier.predict(image)
result = ocr.recognize(image, language=script)
print(f"Detected Script: {script}")
print(f"Extracted Text: {result['text']}")Launch the Streamlit web application:
cd Indic-OCR
streamlit run app/streamlit_app.pyAccess the interface at http://localhost:8501
python -c "from src.evaluation import evaluate_ocr_results; evaluate_ocr_results(predictions, ground_truths)"- CER (Character Error Rate): Measures character-level accuracy
- WER (Word Error Rate): Measures word-level accuracy
- Accuracy: Percentage of correctly recognized samples
| Week | Milestone |
|---|---|
| 1 | Literature review & requirements analysis |
| 2 | Dataset collection & preprocessing |
| 3 | Script identification model training |
| 4 | OCR engine integration & testing |
| 5 | UI development & deployment |
| 6 | Accuracy evaluation & improvements |
| 7 | Final testing, documentation & presentation |
| Category | Tools |
|---|---|
| Language | Python |
| Libraries | OpenCV, PaddleOCR/TrOCR, TensorFlow/PyTorch |
| IDE | VS Code, Jupyter Notebook |
| Interface | Streamlit |
| Version Control | GitHub |
| Hardware | Laptop (8GB+ RAM) + Google Colab GPU |
- ML & DL: Model training, image classification, evaluation
- Computer Vision: Preprocessing, thresholding, deskewing, feature extraction
- OCR Systems: Text detection, recognition, and Unicode conversion
- Research Methodology: Dataset preparation, benchmarking, literature review
- Web Development: Creating a functional front-end for OCR usage
- PaddleOCR Documentation: https://github.com/PaddlePaddle/PaddleOCR
- TrOCR Paper: https://arxiv.org/abs/2109.10282
- OpenCV Documentation: https://docs.opencv.org/
- Streamlit Documentation: https://docs.streamlit.io/
This project is developed purely for academic and research-related purposes.
Dhananjayan H
Roll No: AA.SC.P2MCA24070151
Department of Computer Science
Last Updated: November 2024