Skip to content

Latest commit

 

History

History
74 lines (56 loc) · 1.74 KB

File metadata and controls

74 lines (56 loc) · 1.74 KB

Python OCR Extractor

A web application that extracts text from scanned PDF documents using OCR (Optical Character Recognition) technology. The application consists of a Python Flask backend and a frontend interface.

Features

  • PDF file upload functionality
  • OCR text extraction from scanned PDFs
  • Real-time text extraction processing
  • Cross-Origin Resource Sharing (CORS) enabled
  • Clean and simple API endpoint

Prerequisites

Before running this application, make sure you have the following installed:

Installation

  1. Clone the repository:
git clone https://github.com/cozyCodr/python-ocr-extractor.git
cd python-ocr-extractor
  1. Install backend dependencies
cd backend
pip install -r requirements.txt
  1. Configure Tesseract and Poppler paths:
  • Open backend/app.py
  • Update the following paths according to your system
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    POPPLER_PATH = r"C:\poppler-24.08.0\Library\bin"

Project Structure

python-ocr-extractor/
├── backend/
│   ├── app.py              # Flask application
│   └── requirements.txt    # Python dependencies
├── frontend-app/          # Frontend application
└── .gitignore

API Endpoints

POST /extract_text Extracts text from an uploaded PDF file.

Request:

Method: POST
Content-Type: multipart/form-data
Body: pdf_file (PDF file)

Usage

  1. Start the backend server:
cd backend
python app.py
  1. The server will start running on http://localhost:5000