Notebook in Google Colab #302
Replies: 10 comments
-
|
Hi, I am Pedro |
Beta Was this translation helpful? Give feedback.
-
|
Hi Pedro,
I am interested in testing opendataloader-pdf to parse a PDF into Markdown.
My test case is quite complex, as it contains LaTeX math, tables, and
images.
I am using Google Colab for this project. If you have a code example or a
notebook tailored for use in Google Colab, could you please share it with
me?
Best Regards
Gamal
…On Thu, Mar 19, 2026 at 11:20 PM PedroParro1902 ***@***.***> wrote:
Hi, I am Pedro
*Can I help?*
—
Reply to this email directly, view it on GitHub
<#302?email_source=notifications&email_token=AFCZBC4LJLNUPBQRZDM5CVL4RRI7ZA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRSGEZTSMZQUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16213930>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFCZBC63TMOITLH47MRVS634RRI7ZAVCNFSM6AAAAACWYCUPB6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMMRRGM4TGMA>
.
You are receiving this because you authored the thread.Message ID:
<opendataloader-project/opendataloader-pdf/repo-discussions/302/comments/16213930
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
|
Welcome! No coding experience needed — I'll walk you through it step by step. What you'll need
Instructions
That's it! The code will install everything, convert your PDF, show a preview, and download the results. # ============================================================
# OpenDataLoader PDF — Google Colab Quick Start
# Just press ▶ and upload your PDF when the file picker appears
# ============================================================
# --- 1. Install (takes about 1 minute) ---
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U opendataloader-pdf -q
print("✅ Step 1/4: Installation complete!\n")
# --- 2. Upload your PDF ---
print("📂 Step 2/4: Select your PDF file below:\n")
from google.colab import files
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")
# --- 3. Convert ---
print("⏳ Step 3/4: Converting... please wait.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=pdf_name,
output_dir="output/",
format="markdown,json",
image_output="embedded",
)
print("✅ Step 3/4: Conversion complete!\n")
# --- 4. Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
with open(md_files[0]) as f:
print("📄 Step 4/4: Here's a preview of your Markdown output:\n")
print(f.read()[:5000])
print("\n... (preview limited to 5000 characters)")
!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")What's included
I'd recommend trying the code above first and seeing how the output looks. For many documents, the basic mode already does a great job with tables and images. Let us know how it goes! 🙂 |
Beta Was this translation helpful? Give feedback.
-
|
Hi Jonggyu, Thanks for the quick start guide. I’ve tested the code, and while it is very fast, I noticed that it did not capture the LaTeX math or the tables. Additionally, the images were included as base64 encoded strings, but the output seems to have doubled the page count; the 4-page PDF resulted in 8 images (two for each page). I assume some adjustments are needed to properly extract the LaTeX, tables, and diagrams. I have attached the output for your reference. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for testing and sharing the output! I can see what happened — the first code I shared was a basic version that doesn't support LaTeX math or complex tables. Let me give you an upgraded version that handles everything. What went wrong
Updated CodeJust like before — paste this into a new Colab cell and press ▶. It will take longer than the basic version because it downloads AI models (~1-2 GB) on the first run, but the results will be much better. # ============================================================
# OpenDataLoader PDF — Google Colab (Full Version)
# Supports: LaTeX math, complex tables, and images
# Just press ▶ and upload your PDF when prompted
# ============================================================
# --- Step 1: Install (takes 1-2 minutes) ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")
# --- Step 2: Start the AI engine for math & tables ---
# This runs in the background. First time takes a while to download models.
import subprocess, time
print("⏳ Step 2/5: Loading AI models (first run downloads ~1-2 GB, please be patient)...\n")
proc = subprocess.Popen(
["opendataloader-pdf-hybrid", "--enrich-formula"],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)
time.sleep(60)
print("✅ Step 2/5: AI engine ready!\n")
# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")
# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a minute.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=pdf_name,
output_dir="output/",
format="markdown",
hybrid="docling-fast",
hybrid_mode="full",
image_output="external",
image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")
# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
with open(md_files[0]) as f:
print("📄 Step 5/5: Preview:\n")
print(f.read()[:5000])
print("\n... (preview limited to 5,000 characters)")
!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")What's different from the first version?
How it works (technical details)For those curious about what's happening under the hood:
Let us know how it goes! If you run into any issues, feel free to share the output here and we'll help. 🙂 |
Beta Was this translation helpful? Give feedback.
-
|
Hi Jonggyu,
Thanks for the updated code and the detailed explanation.
I've tested the full version as suggested, but unfortunately, it still
isn't working as expected. I am still not seeing any LaTeX math or tables
in the output. Additionally, the Markdown file output continues to generate
two .png files for each PDF page.
Please let me know if there are any other adjustments I should try to get
these elements to extract correctly.
I attached the original 4 test pdf pages, you could try it.
Best Regards
Gamal
…On Mon, Mar 23, 2026 at 7:59 AM Jonggyu Lee ***@***.***> wrote:
Hi @Gamalmohamed2016 <https://github.com/Gamalmohamed2016>,
Thanks for testing and sharing the output! I can see what happened — the
first code I shared was a basic version that doesn't support LaTeX math or
complex tables. Let me give you an upgraded version that handles everything.
What went wrong
- *No LaTeX math* — The basic version can't extract math formulas. You
need an upgraded setup that includes AI models for formula recognition.
- *No tables* — Your PDF likely has complex or borderless tables that
need the AI-powered detection.
- *Double images* — The first code generated both Markdown and JSON
files, each with their own copy of the images. The new code below fixes
this.
Updated Code
Just like before — paste this into a new Colab cell and press ▶. It will
take longer than the basic version because it downloads AI models (~1-2 GB)
on the first run, but the results will be much better.
*
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks for your patience — I looked into what's going wrong and I think I can help. About the attached PDFIt looks like your last comment was sent via email reply, which unfortunately doesn't support file attachments on GitHub. The PDF didn't come through. Could you post a new comment directly on this page (not via email) and drag-and-drop your PDF into the comment box? That way I can test with your exact file. What I found from your Markdown outputI analyzed the
Why the second code (hybrid mode) also didn't workThe updated code I gave you had a critical problem: the hybrid server may not have started successfully, but all errors were hidden. Here's what likely happened:
Fixed Colab CodeHere's a corrected version with proper server startup verification: # ============================================================
# OpenDataLoader PDF — Google Colab (Full Version, Fixed)
# Supports: LaTeX math, complex tables, and images
# ============================================================
# --- Step 1: Install ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")
# --- Step 2: Start the AI engine and VERIFY it's ready ---
import subprocess, time, urllib.request
print("⏳ Step 2/5: Starting AI engine (first run downloads ~1-2 GB of models)...\n")
print(" This may take 5-10 minutes on the first run. Please be patient.\n")
# Start server WITH visible logs so you can see what's happening
log_file = open("hybrid_server.log", "w")
proc = subprocess.Popen(
["opendataloader-pdf-hybrid", "--enrich-formula"],
stdout=log_file, stderr=subprocess.STDOUT
)
# Poll the health endpoint until the server is actually ready
max_wait = 600 # 10 minutes max for first-time model download
for i in range(max_wait // 5):
time.sleep(5)
try:
resp = urllib.request.urlopen("http://localhost:5002/health", timeout=3)
if resp.status == 200:
print(f"✅ Step 2/5: AI engine ready! (took {(i+1)*5} seconds)\n")
break
except Exception:
if i % 6 == 0: # Print progress every 30 seconds
print(f" Still loading... ({(i+1)*5}s elapsed)")
if proc.poll() is not None:
print("❌ Server crashed! Check the log below:\n")
!cat hybrid_server.log
raise RuntimeError("Hybrid server failed to start")
else:
print("❌ Server did not start within 10 minutes. Log output:\n")
!cat hybrid_server.log
raise RuntimeError("Timeout waiting for hybrid server")
# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")
# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a few minutes.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=pdf_name,
output_dir="output/",
format="markdown",
hybrid="docling-fast",
hybrid_mode="full",
image_output="external",
image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")
# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
with open(md_files[0]) as f:
print("📄 Step 5/5: Preview:\n")
print(f.read()[:5000])
print("\n... (preview limited to 5,000 characters)")
!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")Key changes from the previous version:
Alternative: Test locally (recommended for debugging)If Colab keeps giving you trouble, testing on your own machine gives you full visibility into what's happening. Here are the minimum requirements:
Step 1 — Install: pip install "opendataloader-pdf[hybrid]"Step 2 — Open two terminal windows: # Terminal 1: Start the AI server (keep this running)
opendataloader-pdf-hybrid --enrich-formula
# Wait until you see: "DocumentConverter initialized in XXs"# Terminal 2: Convert your PDF
opendataloader-pdf -i your_file.pdf -o output/ \
--format markdown \
--hybrid docling-fast \
--hybrid-mode full \
--image-output externalThe advantage of running locally is that you can see the server logs in real time. If something goes wrong, the error will be visible immediately in Terminal 1. TroubleshootingIf it still doesn't work after these changes, please share:
We want to make sure this works for you! 🙂 |
Beta Was this translation helpful? Give feedback.
-
|
This the pdf test file, I used, it is attached. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for uploading the PDF! I tested it locally and found the root cause of your problem. The root cause: timeoutThe hybrid server needs about 100 seconds to process your 4-page PDF, but the default timeout is only 30 seconds. When the timeout expires, the tool silently falls back to basic Java-only processing — which is why your output looked the same both times. The previous Colab code I shared was missing the Actual results from your PDFI ran your
You can see the actual output files here:
Fixed Colab CodeThe key fix is adding # ============================================================
# OpenDataLoader PDF — Google Colab (Full Version, Fixed)
# Supports: LaTeX math, complex tables, and images
# ============================================================
# --- Step 1: Install ---
print("⏳ Step 1/5: Installing... please wait.\n")
!apt-get update -qq && apt-get install -y -qq openjdk-17-jdk > /dev/null 2>&1
!pip install -U "opendataloader-pdf[hybrid]" -q
print("✅ Step 1/5: Installation complete!\n")
# --- Step 2: Start the AI engine and VERIFY it's ready ---
import subprocess, time, urllib.request
print("⏳ Step 2/5: Starting AI engine (first run downloads ~1-2 GB of models)...\n")
print(" This may take 5-10 minutes on the first run. Please be patient.\n")
# Start server with visible logs
log_file = open("hybrid_server.log", "w")
proc = subprocess.Popen(
["opendataloader-pdf-hybrid", "--enrich-formula"],
stdout=log_file, stderr=subprocess.STDOUT
)
# Poll the health endpoint until the server is actually ready
max_wait = 600 # 10 minutes max for first-time model download
for i in range(max_wait // 5):
time.sleep(5)
try:
resp = urllib.request.urlopen("http://localhost:5002/health", timeout=3)
if resp.status == 200:
print(f"✅ Step 2/5: AI engine ready! (took {(i+1)*5} seconds)\n")
break
except Exception:
if i % 6 == 0:
print(f" Still loading... ({(i+1)*5}s elapsed)")
if proc.poll() is not None:
print("❌ Server crashed! Check the log below:\n")
!cat hybrid_server.log
raise RuntimeError("Hybrid server failed to start")
else:
print("❌ Server did not start within 10 minutes. Log output:\n")
!cat hybrid_server.log
raise RuntimeError("Timeout waiting for hybrid server")
# --- Step 3: Upload your PDF ---
from google.colab import files
print("📂 Step 3/5: Select your PDF file below:\n")
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
print(f"\n✅ Uploaded: {pdf_name}\n")
# --- Step 4: Convert ---
print("⏳ Step 4/5: Converting... this may take a few minutes.\n")
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=pdf_name,
output_dir="output/",
format="markdown",
hybrid="docling-fast",
hybrid_mode="full",
hybrid_timeout="300000", # ← THIS WAS MISSING! 5 min timeout
image_output="external",
image_format="png",
)
print("✅ Step 4/5: Conversion complete!\n")
# --- Step 5: Preview & Download ---
import glob
md_files = glob.glob("output/**/*.md", recursive=True)
if md_files:
with open(md_files[0]) as f:
print("📄 Step 5/5: Preview:\n")
print(f.read()[:5000])
print("\n... (preview limited to 5,000 characters)")
!zip -r -q output.zip output/
files.download("output.zip")
print("\n✅ Done! Check your Downloads folder for output.zip")Alternative: Test on your own machineIf Colab keeps giving you trouble, running locally gives you full control. Minimum requirements:
# Install
pip install "opendataloader-pdf[hybrid]"# Terminal 1: Start the AI server (keep running)
opendataloader-pdf-hybrid --enrich-formula
# Wait for: "DocumentConverter initialized in XXs"# Terminal 2: Convert your PDF
opendataloader-pdf Test_02.pdf -o output/ \
--format markdown \
--hybrid docling-fast \
--hybrid-mode full \
--hybrid-timeout 300000 \
--image-output externalNote on LaTeX qualityThe hybrid mode successfully extracts formulas into Let me know how it goes! 🙂 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks you @hnc-jglee, a lot for your help. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I tried hard to test opendataloader-pdf in Google Colab, I tried for complex pdf document, I followed the step but i failed, I not a coder , so if anyone tested this parser please share you code if any.
Beta Was this translation helpful? Give feedback.
All reactions