Notebook in Google Colab #302

Gamalmohamed2016 · 2026-03-19T15:19:54Z

Gamalmohamed2016
Mar 19, 2026

I tried hard to test opendataloader-pdf in Google Colab, I tried for complex pdf document, I followed the step but i failed, I not a coder , so if anyone tested this parser please share you code if any.

PedroParro1902 · 2026-03-19T20:19:51Z

PedroParro1902
Mar 19, 2026

Hi, I am Pedro
Can I help?

0 replies

Gamalmohamed2016 · 2026-03-19T20:58:12Z

Gamalmohamed2016
Mar 19, 2026
Author

Hi Pedro, I am interested in testing opendataloader-pdf to parse a PDF into Markdown. My test case is quite complex, as it contains LaTeX math, tables, and images. I am using Google Colab for this project. If you have a code example or a notebook tailored for use in Google Colab, could you please share it with me? Best Regards Gamal

…

On Thu, Mar 19, 2026 at 11:20 PM PedroParro1902 ***@***.***> wrote: Hi, I am Pedro *Can I help?* — Reply to this email directly, view it on GitHub <#302?email_source=notifications&email_token=AFCZBC4LJLNUPBQRZDM5CVL4RRI7ZA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRSGEZTSMZQUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16213930>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFCZBC63TMOITLH47MRVS634RRI7ZAVCNFSM6AAAAACWYCUPB6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMMRRGM4TGMA> . You are receiving this because you authored the thread.Message ID: <opendataloader-project/opendataloader-pdf/repo-discussions/302/comments/16213930 @github.com>

0 replies

hnc-jglee · 2026-03-19T23:55:54Z

hnc-jglee
Mar 19, 2026
Maintainer

Hi @Gamalmohamed2016,

Welcome! No coding experience needed — I'll walk you through it step by step.

What you'll need

A Google account (you already have one if you use Gmail)
Your PDF file ready to upload

Instructions

Go to https://colab.research.google.com and click "New Notebook"
Copy the code below and paste it into the first cell
Press the ▶ play button (or Ctrl+Enter)
A file picker will pop up — select your PDF and wait

That's it! The code will install everything, convert your PDF, show a preview, and download the results.

# ============================================================
# OpenDataLoader PDF — Google Colab Quick Start
# Just press ▶ and upload your PDF when the file picker appears
# ============================================================

# --- 1. Install (takes about 1 minute) ---
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U opendataloader-pdf -q
print("✅ Step 1/4: Installation complete!\n")

# --- 2. Upload your PDF ---
print("📂 Step 2/4: Select your PDF file below:\n")
from google.colab import files
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")

# --- 3. Convert ---
print("⏳ Step 3/4: Converting... please wait.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
    input_path=pdf_name,
    output_dir="output/",
    format="markdown,json",
    image_output="embedded",
)
print("✅ Step 3/4: Conversion complete!\n")

# --- 4. Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
    with open(md_files[0]) as f:
        print("📄 Step 4/4: Here's a preview of your Markdown output:\n")
        print(f.read()[:5000])
        print("\n... (preview limited to 5000 characters)")

!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")

What's included

Tables and images — Supported out of the box with the code above.
LaTeX math formulas — This requires an advanced setup called "hybrid mode" which downloads additional AI models (~1-2 GB). If you need this, let me know and I can provide those instructions too.

I'd recommend trying the code above first and seeing how the output looks. For many documents, the basic mode already does a great job with tables and images.

Let us know how it goes! 🙂

0 replies

Gamalmohamed2016 · 2026-03-20T12:35:30Z

Gamalmohamed2016
Mar 20, 2026
Author

Hi Jonggyu,

Thanks for the quick start guide.

I’ve tested the code, and while it is very fast, I noticed that it did not capture the LaTeX math or the tables. Additionally, the images were included as base64 encoded strings, but the output seems to have doubled the page count; the 4-page PDF resulted in 8 images (two for each page).

I assume some adjustments are needed to properly extract the LaTeX, tables, and diagrams. I have attached the output for your reference.
Test_02 (1).md

0 replies

hnc-jglee · 2026-03-23T04:59:23Z

hnc-jglee
Mar 23, 2026
Maintainer

Hi @Gamalmohamed2016,

Thanks for testing and sharing the output! I can see what happened — the first code I shared was a basic version that doesn't support LaTeX math or complex tables. Let me give you an upgraded version that handles everything.

What went wrong

No LaTeX math — The basic version can't extract math formulas. You need an upgraded setup that includes AI models for formula recognition.
No tables — Your PDF likely has complex or borderless tables that need the AI-powered detection.
Double images — The first code generated both Markdown and JSON files, each with their own copy of the images. The new code below fixes this.

Updated Code

Just like before — paste this into a new Colab cell and press ▶. It will take longer than the basic version because it downloads AI models (~1-2 GB) on the first run, but the results will be much better.

⚠️ Important: Use a GPU runtime for best results. Go to Runtime → Change runtime type → T4 GPU before running.

# ============================================================
# OpenDataLoader PDF — Google Colab (Full Version)
# Supports: LaTeX math, complex tables, and images
# Just press ▶ and upload your PDF when prompted
# ============================================================

# --- Step 1: Install (takes 1-2 minutes) ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")

# --- Step 2: Start the AI engine for math & tables ---
# This runs in the background. First time takes a while to download models.
import subprocess, time
print("⏳ Step 2/5: Loading AI models (first run downloads ~1-2 GB, please be patient)...\n")
proc = subprocess.Popen(
    ["opendataloader-pdf-hybrid", "--enrich-formula"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)
time.sleep(60)
print("✅ Step 2/5: AI engine ready!\n")

# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")

# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a minute.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
    input_path=pdf_name,
    output_dir="output/",
    format="markdown",
    hybrid="docling-fast",
    hybrid_mode="full",
    image_output="external",
    image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")

# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
    with open(md_files[0]) as f:
        print("📄 Step 5/5: Preview:\n")
        print(f.read()[:5000])
        print("\n... (preview limited to 5,000 characters)")

!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")

What's different from the first version?

	First version	This version
LaTeX math	❌ Not supported	✅ Extracted as `$$...$$`
Tables	Basic (borders only)	✅ AI-powered (borderless too)
Images	Doubled in output	✅ Saved once as separate files
Speed	Very fast	Slower (AI processing)
Install size	~200 MB	~1-2 GB (AI models)

How it works (technical details)

For those curious about what's happening under the hood:

Hybrid mode (hybrid="docling-fast"): OpenDataLoader PDF has two engines — a fast Java-based engine for basic parsing, and an AI-powered Python backend (Docling) for advanced features. The "hybrid" option combines both: Java handles the fast parts, while the AI backend handles math formulas and complex table structures.
hybrid_mode="full": This tells the tool to send every page through the AI backend. Without this, the tool tries to decide automatically which pages need AI processing — but formula detection can be missed in "auto" mode. If your PDF has math on every page, "full" ensures nothing is skipped.
--enrich-formula: This flag activates the formula recognition model on the AI backend. It detects mathematical expressions in your PDF and converts them to LaTeX notation (e.g., $$\frac{a}{b}$$).
image_output="external": Instead of embedding images as Base64 text inside the Markdown file, images are saved as separate .png files in an images/ folder. The Markdown file references them with image links like ![](images/image1.png). This avoids duplication and keeps the output clean.

Let us know how it goes! If you run into any issues, feel free to share the output here and we'll help. 🙂

0 replies

Gamalmohamed2016 · 2026-03-23T14:31:26Z

Gamalmohamed2016
Mar 23, 2026
Author

Hi Jonggyu, Thanks for the updated code and the detailed explanation. I've tested the full version as suggested, but unfortunately, it still isn't working as expected. I am still not seeing any LaTeX math or tables in the output. Additionally, the Markdown file output continues to generate two .png files for each PDF page. Please let me know if there are any other adjustments I should try to get these elements to extract correctly. I attached the original 4 test pdf pages, you could try it. Best Regards Gamal

…

On Mon, Mar 23, 2026 at 7:59 AM Jonggyu Lee ***@***.***> wrote: Hi @Gamalmohamed2016 <https://github.com/Gamalmohamed2016>, Thanks for testing and sharing the output! I can see what happened — the first code I shared was a basic version that doesn't support LaTeX math or complex tables. Let me give you an upgraded version that handles everything. What went wrong - *No LaTeX math* — The basic version can't extract math formulas. You need an upgraded setup that includes AI models for formula recognition. - *No tables* — Your PDF likely has complex or borderless tables that need the AI-powered detection. - *Double images* — The first code generated both Markdown and JSON files, each with their own copy of the images. The new code below fixes this. Updated Code Just like before — paste this into a new Colab cell and press ▶. It will take longer than the basic version because it downloads AI models (~1-2 GB) on the first run, but the results will be much better. *⚠️ Important*: Use a *GPU runtime* for best results. Go to *Runtime → Change runtime type → T4 GPU* before running. # ============================================================# OpenDataLoader PDF — Google Colab (Full Version)# Supports: LaTeX math, complex tables, and images# Just press ▶ and upload your PDF when prompted# ============================================================ # --- Step 1: Install (takes 1-2 minutes) ---print("⏳ Step 1/5: Installing... please wait.\n") !apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1 !pip install -U "opendataloader-pdf[hybrid]" -qprint("✅ Step 1/5: Installation complete!\n") # --- Step 2: Start the AI engine for math & tables ---# This runs in the background. First time takes a while to download models.import subprocess, timeprint("⏳ Step 2/5: Loading AI models (first run downloads ~1-2 GB, please be patient)...\n")proc = subprocess.Popen( ["opendataloader-pdf-hybrid", "--enrich-formula"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL )time.sleep(60)print("✅ Step 2/5: AI engine ready!\n") # --- Step 3: Upload your PDF ---from google.colab import filesprint("📂 Step 3/5: Select your PDF file below:\n")uploaded = files.upload()pdf_name = list(uploaded.keys())[0]print(f"\n✅ Uploaded: {pdf_name}\n") # --- Step 4: Convert ---print("⏳ Step 4/5: Converting... this may take a minute.\n")import opendataloader_pdfopendataloader_pdf.convert( input_path=pdf_name, output_dir="output/", format="markdown", hybrid="docling-fast", hybrid_mode="full", image_output="external", image_format="png", )print("✅ Step 4/5: Conversion complete!\n") # --- Step 5: Preview & Download ---import globmd_files = glob.glob("output/**/*.md", recursive=True)if md_files: with open(md_files[0]) as f: print("📄 Step 5/5: Preview:\n") print(f.read()[:5000]) print("\n... (preview limited to 5,000 characters)") !zip -r -q output.zip output/files.download("output.zip")print("\n✅ Done! Check your Downloads folder for output.zip") What's different from the first version? First version This version *LaTeX math* ❌ Not supported ✅ Extracted as $$...$$ *Tables* Basic (borders only) ✅ AI-powered (borderless too) *Images* Doubled in output ✅ Saved once as separate files *Speed* Very fast Slower (AI processing) *Install size* ~200 MB ~1-2 GB (AI models) How it works (technical details) For those curious about what's happening under the hood: - *Hybrid mode* (hybrid="docling-fast"): OpenDataLoader PDF has two engines — a fast Java-based engine for basic parsing, and an AI-powered Python backend (Docling <https://github.com/docling-project/docling>) for advanced features. The "hybrid" option combines both: Java handles the fast parts, while the AI backend handles math formulas and complex table structures. - *hybrid_mode="full"*: This tells the tool to send *every page* through the AI backend. Without this, the tool tries to decide automatically which pages need AI processing — but formula detection can be missed in "auto" mode. If your PDF has math on every page, "full" ensures nothing is skipped. - *--enrich-formula*: This flag activates the formula recognition model on the AI backend. It detects mathematical expressions in your PDF and converts them to LaTeX notation (e.g., $$\frac{a}{b}$$). - *image_output="external"*: Instead of embedding images as Base64 text inside the Markdown file, images are saved as separate .png files in an images/ folder. The Markdown file references them with image links like ![](images/image1.png). This avoids duplication and keeps the output clean. Let us know how it goes! If you run into any issues, feel free to share the output here and we'll help. 🙂 — Reply to this email directly, view it on GitHub <#302?email_source=notifications&email_token=AFCZBC2WIOP7OH7LFFIZLYL4SDAD7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRSGY3TSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-16267919>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFCZBC65C2UPTE4NHZPYDBD4SDAD7AVCNFSM6AAAAACWYCUPB6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMMRWG44TCOI> . You are receiving this because you were mentioned.Message ID: <opendataloader-project/opendataloader-pdf/repo-discussions/302/comments/16267919 @github.com>

0 replies

hnc-jglee · 2026-03-25T09:54:09Z

hnc-jglee
Mar 25, 2026
Maintainer

Hi @Gamalmohamed2016,

Thanks for your patience — I looked into what's going wrong and I think I can help.

About the attached PDF

It looks like your last comment was sent via email reply, which unfortunately doesn't support file attachments on GitHub. The PDF didn't come through. Could you post a new comment directly on this page (not via email) and drag-and-drop your PDF into the comment box? That way I can test with your exact file.

What I found from your Markdown output

I analyzed the Test_02 (1).md file you shared earlier. Here's what happened:

Issue	What I found
8 images (2 per page)	The first code used `format="markdown,json"` — this generates both formats, each with its own copy of the images. That's why you got double.
No LaTeX math	The basic Java engine can't recognize math formulas. It extracted them as broken text like `F = + 2a(b - 1) X,` instead of proper LaTeX like `$$F = \hat{\alpha} + 2\hat{\alpha}(\hat{\beta}-1)\sum X_i$$`
No tables	Your PDF (a statistics textbook with regression/ANOVA content) has complex tables that need AI-powered detection.
Headings wrong	Some math expressions were misidentified as headings.

Why the second code (hybrid mode) also didn't work

The updated code I gave you had a critical problem: the hybrid server may not have started successfully, but all errors were hidden. Here's what likely happened:

Server logs were suppressed — stdout=DEVNULL, stderr=DEVNULL hid all error messages, so you couldn't see if it failed
Fixed 60-second wait — On the first run, Docling needs to download ~1-2 GB of AI models. 60 seconds may not be enough, especially on Colab's network
No health check — The code assumed the server was ready after 60 seconds, but never verified it
Silent fallback — If the server wasn't running, the Java engine processed everything alone (same as basic mode), which explains why the output looked identical

Fixed Colab Code

Here's a corrected version with proper server startup verification:

# ============================================================
# OpenDataLoader PDF — Google Colab (Full Version, Fixed)
# Supports: LaTeX math, complex tables, and images
# ============================================================

# --- Step 1: Install ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")

# --- Step 2: Start the AI engine and VERIFY it's ready ---
import subprocess, time, urllib.request

print("⏳ Step 2/5: Starting AI engine (first run downloads ~1-2 GB of models)...\n")
print("   This may take 5-10 minutes on the first run. Please be patient.\n")

# Start server WITH visible logs so you can see what's happening
log_file = open("hybrid_server.log", "w")
proc = subprocess.Popen(
    ["opendataloader-pdf-hybrid", "--enrich-formula"],
    stdout=log_file, stderr=subprocess.STDOUT
)

# Poll the health endpoint until the server is actually ready
max_wait = 600  # 10 minutes max for first-time model download
for i in range(max_wait // 5):
    time.sleep(5)
    try:
        resp = urllib.request.urlopen("http://localhost:5002/health", timeout=3)
        if resp.status == 200:
            print(f"✅ Step 2/5: AI engine ready! (took {(i+1)*5} seconds)\n")
            break
    except Exception:
        if i % 6 == 0:  # Print progress every 30 seconds
            print(f"   Still loading... ({(i+1)*5}s elapsed)")
        if proc.poll() is not None:
            print("❌ Server crashed! Check the log below:\n")
            !cat hybrid_server.log
            raise RuntimeError("Hybrid server failed to start")
else:
    print("❌ Server did not start within 10 minutes. Log output:\n")
    !cat hybrid_server.log
    raise RuntimeError("Timeout waiting for hybrid server")

# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")

# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a few minutes.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
    input_path=pdf_name,
    output_dir="output/",
    format="markdown",
    hybrid="docling-fast",
    hybrid_mode="full",
    image_output="external",
    image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")

# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
    with open(md_files[0]) as f:
        print("📄 Step 5/5: Preview:\n")
        print(f.read()[:5000])
        print("\n... (preview limited to 5,000 characters)")

!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")

Key changes from the previous version:

Server logs are saved to hybrid_server.log (visible if something goes wrong)
Health check polling — waits until the server actually responds before proceeding
Up to 10 minutes for first-time model download
Progress updates every 30 seconds so you know it's not frozen
Clear error messages if the server crashes or times out

Alternative: Test locally (recommended for debugging)

If Colab keeps giving you trouble, testing on your own machine gives you full visibility into what's happening. Here are the minimum requirements:

Requirement	Minimum
Java	11+
Python	3.10+
RAM	8 GB
Disk	~2 GB (for AI model cache)
GPU	Not required — runs entirely on CPU

Step 1 — Install:

pip install "opendataloader-pdf[hybrid]"

Step 2 — Open two terminal windows:

# Terminal 1: Start the AI server (keep this running)
opendataloader-pdf-hybrid --enrich-formula
# Wait until you see: "DocumentConverter initialized in XXs"

# Terminal 2: Convert your PDF
opendataloader-pdf -i your_file.pdf -o output/ \
    --format markdown \
    --hybrid docling-fast \
    --hybrid-mode full \
    --image-output external

The advantage of running locally is that you can see the server logs in real time. If something goes wrong, the error will be visible immediately in Terminal 1.

Troubleshooting

If it still doesn't work after these changes, please share:

The output you see in the Colab cell (or Terminal 1 logs if local)
Whether you're using a GPU runtime in Colab (Runtime → Change runtime type → T4 GPU)
Your PDF file — please attach it by posting a comment directly on this GitHub page (not via email reply), and drag-and-drop the file into the comment box

We want to make sure this works for you! 🙂

0 replies

Gamalmohamed2016 · 2026-03-25T13:48:49Z

Gamalmohamed2016
Mar 25, 2026
Author

This the pdf test file, I used, it is attached.
Test_02.pdf

0 replies

hnc-jglee · 2026-03-27T06:55:44Z

hnc-jglee
Mar 27, 2026
Maintainer

Hi @Gamalmohamed2016,

Thank you for uploading the PDF! I tested it locally and found the root cause of your problem.

The root cause: timeout

The hybrid server needs about 100 seconds to process your 4-page PDF, but the default timeout is only 30 seconds. When the timeout expires, the tool silently falls back to basic Java-only processing — which is why your output looked the same both times.

The previous Colab code I shared was missing the hybrid_timeout parameter. Sorry about that!

Actual results from your PDF

I ran your Test_02.pdf through both modes. Here's the comparison:

	Basic mode	Hybrid mode (`--enrich-formula`)
Images	8 (2 per page — duplicates)	1 (only the actual chart, Figure 3.3.1)
LaTeX math	0 (formulas broken into plain text)	38 formula blocks in `$$...$$`
Tables	0 (data extracted as bullet lists)	2 proper Markdown tables (Table 3.3.1 and 3.3.2)
Headings	0	1 (`# Example 3.3.1`)

You can see the actual output files here:

Basic mode result — broken math, no tables, 8 duplicate images
Hybrid mode result — LaTeX formulas, proper tables, 1 image

Fixed Colab Code

The key fix is adding hybrid_timeout="300000" (5 minutes). Here's the corrected version:

# ============================================================
# OpenDataLoader PDF — Google Colab (Full Version, Fixed)
# Supports: LaTeX math, complex tables, and images
# ============================================================

# --- Step 1: Install ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")

# --- Step 2: Start the AI engine and VERIFY it's ready ---
import subprocess, time, urllib.request

print("⏳ Step 2/5: Starting AI engine (first run downloads ~1-2 GB of models)...\n")
print("   This may take 5-10 minutes on the first run. Please be patient.\n")

# Start server with visible logs
log_file = open("hybrid_server.log", "w")
proc = subprocess.Popen(
    ["opendataloader-pdf-hybrid", "--enrich-formula"],
    stdout=log_file, stderr=subprocess.STDOUT
)

# Poll the health endpoint until the server is actually ready
max_wait = 600  # 10 minutes max for first-time model download
for i in range(max_wait // 5):
    time.sleep(5)
    try:
        resp = urllib.request.urlopen("http://localhost:5002/health", timeout=3)
        if resp.status == 200:
            print(f"✅ Step 2/5: AI engine ready! (took {(i+1)*5} seconds)\n")
            break
    except Exception:
        if i % 6 == 0:
            print(f"   Still loading... ({(i+1)*5}s elapsed)")
        if proc.poll() is not None:
            print("❌ Server crashed! Check the log below:\n")
            !cat hybrid_server.log
            raise RuntimeError("Hybrid server failed to start")
else:
    print("❌ Server did not start within 10 minutes. Log output:\n")
    !cat hybrid_server.log
    raise RuntimeError("Timeout waiting for hybrid server")

# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")

# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a few minutes.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
    input_path=pdf_name,
    output_dir="output/",
    format="markdown",
    hybrid="docling-fast",
    hybrid_mode="full",
    hybrid_timeout="300000",       # ← THIS WAS MISSING! 5 min timeout
    image_output="external",
    image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")

# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
    with open(md_files[0]) as f:
        print("📄 Step 5/5: Preview:\n")
        print(f.read()[:5000])
        print("\n... (preview limited to 5,000 characters)")

!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")

Alternative: Test on your own machine

If Colab keeps giving you trouble, running locally gives you full control. Minimum requirements:

Requirement	Minimum
Java	11+
Python	3.10+
RAM	8 GB
Disk	~2 GB (for AI model cache)
GPU	Not required — runs entirely on CPU

# Install
pip install "opendataloader-pdf[hybrid]"

# Terminal 1: Start the AI server (keep running)
opendataloader-pdf-hybrid --enrich-formula
# Wait for: "DocumentConverter initialized in XXs"

# Terminal 2: Convert your PDF
opendataloader-pdf Test_02.pdf -o output/ \
    --format markdown \
    --hybrid docling-fast \
    --hybrid-mode full \
    --hybrid-timeout 300000 \
    --image-output external

Note on LaTeX quality

The hybrid mode successfully extracts formulas into $$...$$ blocks, which is a big improvement. However, some complex multi-line equations (like summations with limits) may still appear slightly garbled — this is a known limitation of the current formula recognition model. Simple formulas like $$F > F(\alpha, 2, n-2)$$ and $$\hat{Y} = 0.009 + 0.986X$$ come through cleanly.

Let me know how it goes! 🙂

0 replies

Gamalmohamed2016 · 2026-03-27T15:21:39Z

Gamalmohamed2016
Mar 27, 2026
Author

Thanks you @hnc-jglee, a lot for your help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook in Google Colab #302

Uh oh!

{{title}}

Uh oh!

Replies: 10 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Notebook in Google Colab #302

Uh oh!

Gamalmohamed2016 Mar 19, 2026

Replies: 10 comments

Uh oh!

PedroParro1902 Mar 19, 2026

Uh oh!

Gamalmohamed2016 Mar 19, 2026 Author

Uh oh!

hnc-jglee Mar 19, 2026 Maintainer

What you'll need

Instructions

What's included

Uh oh!

Gamalmohamed2016 Mar 20, 2026 Author

Uh oh!

hnc-jglee Mar 23, 2026 Maintainer

What went wrong

Updated Code

What's different from the first version?

How it works (technical details)

Uh oh!

Gamalmohamed2016 Mar 23, 2026 Author

Uh oh!

hnc-jglee Mar 25, 2026 Maintainer

About the attached PDF

What I found from your Markdown output

Why the second code (hybrid mode) also didn't work

Fixed Colab Code

Alternative: Test locally (recommended for debugging)

Troubleshooting

Uh oh!

Gamalmohamed2016 Mar 25, 2026 Author

Uh oh!

hnc-jglee Mar 27, 2026 Maintainer

The root cause: timeout

Actual results from your PDF

Fixed Colab Code

Alternative: Test on your own machine

Note on LaTeX quality

Uh oh!

Gamalmohamed2016 Mar 27, 2026 Author

Gamalmohamed2016
Mar 19, 2026

PedroParro1902
Mar 19, 2026

Gamalmohamed2016
Mar 19, 2026
Author

hnc-jglee
Mar 19, 2026
Maintainer

Gamalmohamed2016
Mar 20, 2026
Author

hnc-jglee
Mar 23, 2026
Maintainer

Gamalmohamed2016
Mar 23, 2026
Author

hnc-jglee
Mar 25, 2026
Maintainer

Gamalmohamed2016
Mar 25, 2026
Author

hnc-jglee
Mar 27, 2026
Maintainer

Gamalmohamed2016
Mar 27, 2026
Author