Skip to content

BytesByJay/Indic-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Indic-OCR: Multi-Language OCR System for Indian Regional Scripts

Python License Status

A machine learning-based OCR solution that can detect, classify, and accurately recognize text from multiple Indian regional scripts and convert it into editable Unicode text.

๐Ÿ“‹ Project Information

  • Title: Multi-Language OCR System for Indian Regional Scripts (Indic-OCR)
  • Author: Dhananjayan H
  • Roll No: AA.SC.P2MCA24070151
  • Course: MCA Minor Project (21CSA697A)

๐ŸŽฏ Objectives

  1. Develop a multi-language OCR system capable of identifying the script and extracting text from handwritten or printed Indian regional languages
  2. Convert extracted text into Unicode digital text
  3. Support applications like digitization of academic notes, historical records, government documents, and accessibility enhancement

โœจ Features

  • Multi-Script Support: Devanagari (Hindi), Malayalam, Tamil
  • Script Detection: Automatic identification of the input script
  • Image Preprocessing: Advanced preprocessing for improved accuracy
  • Deep Learning OCR: State-of-the-art recognition using PaddleOCR/TrOCR
  • Web Interface: User-friendly Streamlit-based interface
  • Evaluation Tools: CER/WER metrics for performance assessment

๐Ÿ—๏ธ Project Structure

Indic-OCR/
โ”œโ”€โ”€ app/
โ”‚   โ””โ”€โ”€ streamlit_app.py       # Web interface
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ config.yaml            # Configuration file
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                   # Raw dataset
โ”‚   โ”œโ”€โ”€ processed/             # Processed images
โ”‚   โ”œโ”€โ”€ train/                 # Training set
โ”‚   โ”œโ”€โ”€ val/                   # Validation set
โ”‚   โ””โ”€โ”€ test/                  # Test set
โ”œโ”€โ”€ models/                    # Saved models
โ”œโ”€โ”€ notebooks/                 # Jupyter notebooks
โ”œโ”€โ”€ outputs/                   # OCR outputs
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ preprocessing.py       # Image preprocessing
โ”‚   โ”œโ”€โ”€ script_classifier.py   # Script identification model
โ”‚   โ”œโ”€โ”€ ocr_engine.py          # OCR recognition
โ”‚   โ”œโ”€โ”€ dataset.py             # Dataset utilities
โ”‚   โ”œโ”€โ”€ evaluation.py          # Evaluation metrics
โ”‚   โ””โ”€โ”€ utils.py               # Utility functions
โ”œโ”€โ”€ tests/                     # Unit tests
โ”œโ”€โ”€ requirements.txt           # Dependencies
โ”œโ”€โ”€ train.py                   # Training script
โ”œโ”€โ”€ inference.py               # Inference script
โ””โ”€โ”€ README.md                  # Documentation

๐Ÿš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • GPU (optional, for faster training)

Setup

  1. Clone the repository:
cd "MCA Project/Indic-OCR"
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. For GPU support (optional):
pip install paddlepaddle-gpu  # For CUDA 11.x
# or
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

๐Ÿ“Š Dataset

Supported Datasets

  1. Public Datasets:

  2. Self-collected Data:

    • Scanned documents
    • Handwritten notes
    • Printed materials

Data Preparation

python -c "from src.dataset import DatasetManager; dm = DatasetManager(); dm.split_dataset()"

๐ŸŽ“ Training

Train Script Classifier

python train.py --task script_classifier --epochs 50 --batch_size 32

Train OCR Model (Fine-tuning)

python train.py --task ocr --model paddleocr --language hindi

๐Ÿ”ฎ Inference

Command Line

python inference.py --image path/to/image.png --output results.txt

Python API

from src import ImagePreprocessor, ScriptClassifier, OCREngine

# Initialize components
preprocessor = ImagePreprocessor()
classifier = ScriptClassifier()
ocr = OCREngine()

# Process image
image = preprocessor.preprocess("document.png")
script, confidence = classifier.predict(image)
result = ocr.recognize(image, language=script)

print(f"Detected Script: {script}")
print(f"Extracted Text: {result['text']}")

๐ŸŒ Web Interface

Launch the Streamlit web application:

cd Indic-OCR
streamlit run app/streamlit_app.py

Access the interface at http://localhost:8501

๐Ÿ“ˆ Evaluation

Run Evaluation

python -c "from src.evaluation import evaluate_ocr_results; evaluate_ocr_results(predictions, ground_truths)"

Metrics

  • CER (Character Error Rate): Measures character-level accuracy
  • WER (Word Error Rate): Measures word-level accuracy
  • Accuracy: Percentage of correctly recognized samples

๐Ÿ“… Timeline & Milestones

Week Milestone
1 Literature review & requirements analysis
2 Dataset collection & preprocessing
3 Script identification model training
4 OCR engine integration & testing
5 UI development & deployment
6 Accuracy evaluation & improvements
7 Final testing, documentation & presentation

๐Ÿ› ๏ธ Tools & Technologies

Category Tools
Language Python
Libraries OpenCV, PaddleOCR/TrOCR, TensorFlow/PyTorch
IDE VS Code, Jupyter Notebook
Interface Streamlit
Version Control GitHub
Hardware Laptop (8GB+ RAM) + Google Colab GPU

๐Ÿ“š Learning Outcomes

  • ML & DL: Model training, image classification, evaluation
  • Computer Vision: Preprocessing, thresholding, deskewing, feature extraction
  • OCR Systems: Text detection, recognition, and Unicode conversion
  • Research Methodology: Dataset preparation, benchmarking, literature review
  • Web Development: Creating a functional front-end for OCR usage

๐Ÿ“– References

  1. PaddleOCR Documentation: https://github.com/PaddlePaddle/PaddleOCR
  2. TrOCR Paper: https://arxiv.org/abs/2109.10282
  3. OpenCV Documentation: https://docs.opencv.org/
  4. Streamlit Documentation: https://docs.streamlit.io/

๐Ÿ“„ License

This project is developed purely for academic and research-related purposes.

๐Ÿ‘ค Author

Dhananjayan H
Roll No: AA.SC.P2MCA24070151
Department of Computer Science


Last Updated: November 2024

About

Indic-OCR: Multi-Language OCR System for Indian Regional Scripts. Its a machine learning-based OCR solution that can detect, classify, and accurately recognize text from multiple Indian regional scripts and convert it into editable Unicode text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors