Step-by-step guide for converting .pptx presentations to Slidev slides.md.
sudo apt install libreoffice-impress poppler-utilsEnsure the repo's .gitignore includes the migration working directory and the
LibreOffice lock file (created when the PPTX is open):
presentation/migration/
.~lock.*.pptx#The presentation/migration/ entry covers both the extraction artefacts (PDF, HTML,
PNGs, JPGs) and PPTX_REFERENCE.md, which is generated there in Step 7.
Steps 1 and 6 (PDF conversion + speaker notes extraction) can run in parallel since they use different source data. Step 2 depends on Step 1. Steps 3-5 are sequential.
mkdir -p presentation/migration
soffice --headless --convert-to pdf --outdir presentation/migration MyPresentation.pptxThe --headless flag runs LibreOffice without a GUI.
Ignore the failed to launch javaldx warning -- it's harmless.
pdftohtml -s -noframes -nomerge -fmt png \
presentation/migration/MyPresentation.pdf \
presentation/migration/MyPresentation.htmlFlags:
-s-- single HTML document (all pages in one file)-noframes-- no HTML frameset wrapper-nomerge-- keep paragraphs separate (better for parsing)-fmt png-- extract embedded images as PNG
MyPresentation.html-- single HTML file with all slidesMyPresentation001.png...MyPresentationNNN.png-- full-page slide renders (background images)MyPresentation-{page}_{seq}.jpg-- extracted sub-images (logos, memes, diagrams, photos)
Important: The HTML file is typically too large to read directly (~30K+ tokens for a 60-slide deck). Always use the Python parser script below rather than reading the HTML manually.
Use this script to extract all slide text with positions and CSS classes:
import re
from html import unescape
html_path = "presentation/migration/MyPresentation.html"
with open(html_path) as f:
html = f.read()
pages = re.split(r'<!-- Page (\d+) -->', html)
for i in range(1, len(pages), 2):
page_num = pages[i]
content = pages[i+1]
texts = re.findall(
r'<p style="position:absolute;top:(\d+)px;left:(\d+)px;[^"]*"[^>]*class="([^"]*)"[^>]*>(.*?)</p>',
content
)
print(f"\n{'='*60}")
print(f"SLIDE {page_num}")
print(f"{'='*60}")
if not texts:
print("[NO TEXT - IMAGE ONLY SLIDE]")
continue
for top, left, css_class, text_html in sorted(texts, key=lambda x: (int(x[0]), int(x[1]))):
text = re.sub(r'<[^>]+>', '', text_html)
text = unescape(text).replace('\xa0', ' ').strip()
if text:
print(f" [{top:>4},{left:>4}] ({css_class:>6}) {text}")Then only visually inspect (Read the .png) slides that are [NO TEXT - IMAGE ONLY SLIDE]
or where the text content is ambiguous.
The HTML structure per slide:
<!-- Page N -->
<div id="pageN-div" style="position:relative;width:2160px;height:1215px;">
<img src="MyPresentation00N.png" alt="background image"/>
<p style="position:absolute;..." class="ftXY">Text content</p>
...
</div>- Each
<!-- Page N -->comment marks a slide boundary - The
<img>is a full render of the slide (useful for visual reference) <p>tags contain the text content, positioned absolutely- Font classes (
.ftXY) encode font-size, font-family, and color - Large font sizes = titles, medium = bullet content, small = subtitles
| Clue | Slide Type |
|---|---|
| Page 1, large title text | Cover slide |
Bullet list with • |
Menu/Agenda |
| Only a title + image in background | Section divider |
| Title + bulleted content | Default content slide |
| Code-colored text (blue, green, orange classes) | Code slide |
| Two columns of content | Comparison slide |
| Large centered text, no bullets | Quote slide |
The sub-images (MyPresentation-{page}_{seq}.jpg) are the useful ones:
- Memes, diagrams, book covers, photos
- Avoid the small repeated ones (~14KB, ~1.8KB) -- these are typically theme logos/icons
List all sub-images sorted by file size. Images with identical file sizes are almost certainly the same theme element repeated across slides (logos, decorative corners). Filter these out first:
ls -la presentation/migration/*.jpg | awk '{print $5}' | sort | uniq -c | sort -rn | head -20Sizes appearing 5+ times are theme elements. Only visually inspect images that are:
- Larger than ~30KB
- Have a unique or rare file size (appears 1-3 times)
One extracted image will be the venue/building photo used in the theme's cover and
break layouts. This is typically from slide 1 or 2. Copy it as cover-art.jpg.
For talks that only have a generic placeholder cover image (e.g. the old building photo), generate a custom one with Midjourney. Follow this style guide:
- Palette: Warm orange, amber, gold tones against dark backgrounds. Must complement
the theme's orange (
#E78200) cover layout. - Style: Abstract, geometric, stylized digital illustration. Clean and modern. Not photographic.
- Format: Portrait ~2:3 ratio (~400×600px final). Save as
cover-art.jpg(or.png). - Suffix:
--ar 2:3 --v 6.1 --no text words letters
Approved examples for reference:
| Talk | Prompt |
|---|---|
| MicroServices | A murmuration of small luminous geometric birds forming a larger shape in the sky, each bird autonomous yet part of the whole, warm orange and gold against a deep twilight sky, abstract digital painting, ethereal and dynamic |
| UnitTesting/TDD | Abstract illustration of a shield protecting a crystalline code structure, green checkmarks floating around it, warm orange and amber palette with dark background, stylized geometric digital art, clean and modern |
| NTier-Hex-Onion | A stylized cross-section of concentric architectural rings, the innermost core glowing gold, surrounded by hexagonal lattice layers, each layer a different muted earth tone, abstract geometric style, clean lines, dark background with warm orange accents, digital illustration, minimal, elegant |
Copy meaningful images to presentation/images/ with descriptive names:
cp presentation/migration/MyPresentation-3_7.jpg presentation/images/meme-descriptive-name.jpgThe full-page background renders (MyPresentation00N.png) are useful as reference
but should NOT be used in slides.md -- they contain the old PowerPoint theme chrome.
Every PPTX slide with a non-theme image must have a corresponding image in slides.md. After extracting images, build a mapping table:
import os, re
migration_dir = "presentation/migration"
html_path = f"{migration_dir}/MyPresentation.html"
with open(html_path) as f:
html = f.read()
# Get page boundaries
pages = [(m.group(1), m.start()) for m in re.finditer(r'<!-- Page (\d+) -->', html)]
# Theme element sizes (bytes) -- these repeat across slides, skip them
theme_sizes = set()
size_counts = {}
for f in os.listdir(migration_dir):
if f.endswith('.jpg'):
sz = os.path.getsize(f"{migration_dir}/{f}")
size_counts[sz] = size_counts.get(sz, 0) + 1
for sz, count in size_counts.items():
if count >= 5:
theme_sizes.add(sz)
# Map images to pages
for f in sorted(os.listdir(migration_dir)):
if not f.endswith('.jpg'):
continue
m = re.match(r'.*-(\d+)_(\d+)\.jpg', f)
if not m:
continue
page = int(m.group(1))
sz = os.path.getsize(f"{migration_dir}/{f}")
if sz in theme_sizes:
continue
print(f"Page {page:3d}: {f} ({sz:,} bytes)")Use this table to verify that every image is copied and referenced in slides.md.
After writing slides.md, re-run the slide index script (see below) and cross-check that
each slide with a PPTX image has the correct ::image:: or image frontmatter.
Step 4b: Extract hidden slides from PPTX
Hidden slides are NOT exported to PDF. Extract them directly from the PPTX XML:
import xml.etree.ElementTree as ET
import os
ns = {'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}
# First extract the PPTX
os.system('mkdir -p /tmp/pptx-slides && unzip -o MyPresentation.pptx "ppt/slides/*.xml" -d /tmp/pptx-slides -x "*/rels/*"')
slide_files = sorted(
[f for f in os.listdir('/tmp/pptx-slides/ppt/slides') if f.startswith('slide') and f.endswith('.xml')],
key=lambda f: int(f.replace('slide','').replace('.xml',''))
)
for sf in slide_files:
tree = ET.parse(f'/tmp/pptx-slides/ppt/slides/{sf}')
root = tree.getroot()
if root.get('show') == '0':
texts = [t.text.strip() for t in tree.findall('.//a:t', ns) if t.text and t.text.strip()]
title = ' '.join(texts[:8]) if texts else '(no text)'
num = sf.replace('slide','').replace('.xml','')
print(f'Hidden slide {num}: {title}')Hidden slides should be added to slides.md with disabled: true in frontmatter.
Place them at their original position in the slide order:
---
layout: default
disabled: true
---
# Hidden Slide Title
Content hereAdd session metadata to the YAML frontmatter of slides.md:
---
theme: ./theme
title: 2022-08-25-UnitTesting
transition: fade
session-time: 70min
track: Architecture
type: Theoretical
---session-time: duration of the presentationtrack: which track it belongs to (e.g. Architecture, Frontend, Backend)type: Theoretical, Hands-on, Workshop, etc.
Every talk repo needs an ElevatorPitch.md in the git root (NOT inside presentation/).
The showcase index site at itenium-be.github.io/Presentations/ reads it to render the
talk's card description. A migration of an existing PPTX must create this file if it
doesn't exist yet — bun run scaffold would have created it, but a migrated talk
typically pre-dates the scaffold.
Required structure:
# Talk Title
## Abstract
A brief description (2-3 sentences). Displayed as the card description on the index site.
## Target Audience
Who should attend this talk?
## Key Takeaways
- Takeaway 1
- Takeaway 2
## Session Format
45-60 minutesWhen migrating, draft the Abstract from the cover slide + agenda items, the Target Audience from the talk's track/type, and the Key Takeaways from the section dividers. Ask the user to review before committing — these fields are user-facing on the showcase site and benefit from a human pass.
| PowerPoint slide | Slidev layout |
|---|---|
| Title slide | cover |
| Table of contents | agenda (items in frontmatter, pick size by item count — see sizing guide) |
| Section divider (photo bg + title) | section |
| Bullet content, NO image in PPTX | default with <v-clicks> |
| Bullet content + image in PPTX | default-aside with ::image:: slot (circled top-right) |
| Image left + content right | two-col-image-text with ::image:: + ::content:: slots |
| Two-column with pros/cons | comparison with .cols/.col (preserve emojis!) |
| Content left + informational image right | two-col-image-text (retired two-col-text-image; always use image-left layout now) |
| 1-2 bold statements, no bullets | statement (no author needed). Supports ::image:: for circled corner image |
| Full-screen meme/image (no bullet content) | quote-image with ::image:: slot (green bg, image centered) |
| Large quote/meme | quote |
| Break slide | break with <Timer> |
| Social links | socials |
| Thank you | end |
Use this to pick the right layout for each slide:
- Does the slide have only an image (no bullets, maybe just a title)?
→
quote-image(centered image, green bg) - Does the slide have 1-2 short statements (no bullets)?
→
statement(+::image::if PPTX had an image) - Does the slide have bullet content + a decorative image in the PPTX?
→
default-aside(circled image top-right, bullet content left) - Does the slide have bullet content + an informational diagram/chart?
→
two-col-image-text(image left, text right) - Does the slide have bullet content, no image in the PPTX?
→
default
Critical rule: If the PPTX slide had an image, the Slidev slide MUST have that
image. Use the image-to-slide mapping table to verify. Never use default layout
for a slide that had an image in the PPTX — use default-aside instead.
Powerpoint Source slide: The last slide should be the dedicated source layout
linking to the repo. Get the org/repo from git remote -v and pass it via frontmatter:
---
layout: source
source: itenium-be/REPO_NAME
---The layout renders the title, QR code, and clickable github link from the single
source: field. Don't reach for the old inline <QRCode> HTML — that pattern is
deprecated.
These patterns from the PPTX HTML can be detected automatically:
-
Circular corner images (
default-aside): Slides with the "orange lines top-left" template have a circular decorative image. Detect by checking the background image for the orange-lines pattern OR by finding 3+ extracted sub-images per slide where the largest (>100KB) is the circle photo. Usedefault-asidelayout with::image::.- Default position is
top-right. If the image center-Y is >40% of slide height, useimage-position: middle-right.
- Default position is
-
Emojis: The PPTX XML
<a:t>tags preserve emoji characters (🧐,⚠️ , etc.). These are lost in the PDF→HTML pipeline because pdftohtml renders them as images. Extract emoji text directly fromppt/slides/slideN.xmlinstead. -
Quote slides (
statement): Slides with only 1-2<p>tags of large font (>100px) and no bullet markers (•) are statement/quote slides. Usestatement. -
Comparison slides (
comparison): Slides with two distinct columns of content (detected by two groups of<p>tags with very different x-offsets). Preserve emoji prefixes (🧐/⚠️ ) as-is in the markdown. -
Image-only slides (
two-col-image-textwithout::content::): Slides where the only text is a title and the rest is a large image/meme. Thetwo-col-image-textlayout auto-centers when no::content::slot is provided. Preferimagefrontmatter over::image::slot for large images — the frontmatter renders a direct<img>tag that the layout CSS can properly constrain. Slot-based images get wrapped in<p>tags by Slidev's markdown renderer which breaksmax-heightcontainment. -
Size frontmatter: Add
size: smorsize: xswhen content is dense:agenda: ≤5 items →lg, 6 items →md(default), 7-8 items →sm, 9+ items →xsdefault/default-aside: 6+ bullet points →size: sm
Generate a slide index to map Slidev slide numbers to titles. Use this when the user references slides by number. Run from the presentation directory:
import re
with open('slides.md') as f:
content = f.read()
# Split on slide boundaries: --- possibly followed by frontmatter ---
parts = re.split(r'\n---\n', content)
# First part is global frontmatter, skip it
slides = []
i = 1
while i < len(parts):
part = parts[i].strip()
lines = part.split('\n')
has_layout = any(re.match(r'^[a-z].*:', l) for l in lines[:10])
if has_layout and i + 1 < len(parts):
slides.append(part + '\n---\n' + parts[i+1])
i += 2
else:
slides.append(part)
i += 1
for idx, slide in enumerate(slides, 1):
lines = slide.strip().split('\n')
title = ''
subtitle = ''
layout = ''
for line in lines:
if line.startswith('layout:'):
layout = line.split(':',1)[1].strip()
if line.startswith('# ') and not title:
title = line.strip()[2:]
elif line.startswith('## ') and not subtitle:
subtitle = line.strip()[3:]
display = title if title else '(untitled)'
if subtitle:
display += ' — ' + subtitle
if layout:
display = f'[{layout}] {display}'
print(f'{idx:3d}. {display}')Always run this at the start of a finetuning session so that slide numbers from the user match exactly what Slidev displays in the browser.
| Old name | New name |
|---|---|
default-image |
default-aside |
image-content |
two-col-image-text |
content-image |
two-col-text-image → deleted (use two-col-image-text instead) |
quote-alt |
statement |
Add code-inspired decorations (h1/h2 frontmatter) to roughly 40% of content slides.
Pick a random mix of types, colors, and positions per slide. Vary them — don't repeat
the same decoration on consecutive slides.
Which layouts get decorations:
| Layout | Decorate? | Notes |
|---|---|---|
default |
~50% | Good candidate, especially slides with bullet content |
default-aside |
~60% | Most common decorated layout |
comparison |
~50% | Decorate h1; h2 optional |
two-col-image-text |
~40% | Only when slide has text content (not image-only) |
section |
never | Section dividers stay clean |
statement |
never | No h1/h2 to decorate |
quote |
never | Special styling, don't decorate |
quote-image |
never | Special styling, don't decorate |
agenda |
never | Structured layout, don't decorate |
cover |
optional | At most the h1 |
end / socials / break |
never | No content titles |
h2 decorations: Only add to ~30% of slides that already have an h1 decoration. Never add h2 decoration without h1 decoration on the same slide.
Available types and rules:
| Type | Position | Colors |
|---|---|---|
dot |
end only | primary, muted |
slashes |
end only | primary, muted |
brackets |
all, or word range (e.g. 2 or 2-3) |
primary, muted |
braces |
all, or word range | primary, muted |
hash |
start only | primary, muted |
semicolon |
start or end | muted only |
white color is only available on layout: statement. All other layouts use
primary or muted only.
Colors: primary = #f1b06c (orange), muted = #6ebca5 (teal), white = #ffffff
Example frontmatter:
h1:
type: braces
color: primary
position: 2
h2:
type: dot
color: muted
position: end- Line breaks in HTML: pdftohtml splits long lines into multiple
<p>tags. Reassemble them by checking if the next<p>has a similar x-offset (continuation) or is indented further (sub-bullet). - Embedded images vs text: Some slides have all content baked into the background
image with no text
<p>tags. View the.pngbackground to understand the content. - Font colors as semantic hints: Colored text (green
#6a9955, blue#569cd6) typically indicates code syntax highlighting in the original.  : Non-breaking spaces are used liberally -- strip them when converting.- Emojis are lost in PDF export: Extract them from the PPTX slide XML directly (see auto-detection rule #2). Do not rely on the HTML output for emoji content.
- Speaker notes: PowerPoint speaker notes are NOT preserved in the PDF export. Extract them directly from the PPTX (see Step 6).
- Size prefix: The
sizefrontmatter value should besm,xs, etc. — NOTsize-sm. The layout template adds thesize-prefix automatically.
A .pptx file is a ZIP containing XML. Speaker notes live in ppt/notesSlides/notesSlideN.xml.
CRITICAL:
notesSlideN.xmlis NOT linked toslideN.xmlby filename. The mapping lives inppt/slides/_rels/slideN.xml.rels. Iterating notes by filename gives wrong results — slide 7 might have its notes innotesSlide12.xml. Always resolve via the.relsfile. Extract the rels too:
unzip -o MyPresentation.pptx -d /tmp/pptx-extract
# (full extraction; we need slides/, slides/_rels/, notesSlides/, _rels/, presentation.xml)Resolve slide order from presentation.xml (the sldIdLst order is the presentation
order, which may differ from filename order), then for each slide find its notes via
the rels file:
import xml.etree.ElementTree as ET
import os
ns = {
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main',
'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
}
RID = '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'
base = '/tmp/pptx-extract/ppt'
# 1) Get slide order from presentation.xml
pres = ET.parse(f'{base}/presentation.xml').getroot()
pres_rels = ET.parse(f'{base}/_rels/presentation.xml.rels').getroot()
rid_to_target = {r.get('Id'): r.get('Target') for r in pres_rels}
slide_order = []
for sldId in pres.findall('.//p:sldIdLst/p:sldId', ns):
target = rid_to_target[sldId.get(RID)]
slide_order.append(os.path.basename(target)) # e.g. 'slide1.xml'
# 2) For each slide, find notes via slideN.xml.rels
# Preserve **bold**/*italic* formatting from <a:rPr b="1"/> / i="1".
# Each <a:p> is a paragraph (separate line); each <a:r> is a run with optional rPr.
A = '{http://schemas.openxmlformats.org/drawingml/2006/main}'
import re
def _wrap(text, bold, italic):
if not text.strip():
return text
m = re.match(r'^(\s*)(.*?)(\s*)$', text, re.DOTALL)
lead, core, trail = m.group(1), m.group(2), m.group(3)
if bold and italic: core = f'***{core}***'
elif bold: core = f'**{core}**'
elif italic: core = f'*{core}*'
return lead + core + trail
def _format_paragraph(p_elem):
runs = []
for r in p_elem.findall(f'{A}r'):
t = r.find(f'{A}t')
if t is None or t.text is None:
continue
rpr = r.find(f'{A}rPr')
bold = rpr is not None and rpr.get('b') == '1'
italic = rpr is not None and rpr.get('i') == '1'
runs.append((t.text, bold, italic))
# CRITICAL: merge adjacent runs with the same formatting before wrapping,
# otherwise PowerPoint's word-by-word runs become "**word1** **word2**".
merged = []
for text, b, i in runs:
if merged and merged[-1][1] == b and merged[-1][2] == i:
merged[-1] = (merged[-1][0] + text, b, i)
else:
merged.append((text, b, i))
return ''.join(_wrap(t, b, i) for t, b, i in merged).rstrip()
def get_notes(slide_xml):
rels_path = f'{base}/slides/_rels/{slide_xml}.rels'
if not os.path.exists(rels_path):
return []
rels = ET.parse(rels_path).getroot()
for rel in rels:
if 'notesSlide' in rel.get('Type', ''):
notes_file = os.path.basename(rel.get('Target'))
ntree = ET.parse(f'{base}/notesSlides/{notes_file}')
for sp in ntree.findall('.//p:sp', ns):
ph = sp.find('.//p:ph', ns)
if ph is not None and ph.get('idx') == '1':
paras = []
for p in sp.findall('.//a:p', ns):
line = _format_paragraph(p)
if line:
paras.append(line)
return paras
return []
# 3) Also use the slide XML's `show="0"` attribute to detect HIDDEN slides
def is_hidden(slide_xml):
return ET.parse(f'{base}/slides/{slide_xml}').getroot().get('show') == '0'
for pos, sf in enumerate(slide_order, 1):
notes = get_notes(sf)
h = ' [HIDDEN]' if is_hidden(sf) else ''
if notes:
print(f'=== SLIDE {pos}{h} ({sf}) ===')
print('\n'.join(notes))
print()The pos (1-based index into slide_order) is the presentation slide number the
user sees. Use this number, not the slide XML filename suffix.
Speaker notes render as markdown in Slidev's presenter mode, so preserve PPTX formatting as markdown:
- Bold runs (
<a:rPr b="1"/>) →**text** - Italic runs (
i="1") →*text* - Each
<a:p>paragraph → its own line, joined with blank lines so markdown treats them as separate paragraphs
When writing the notes into the slide's <!-- ... --> comment, join the paragraph
list with '\n\n'.join(paras) (not '\n'). Single newlines inside markdown collapse
into one paragraph.
Replace bullet point characters (•) with markdown list markers (-) in notes.
<!-- and --> must always be on their own line — never inline with note content.
Add the extracted notes as HTML comments at the end of each slide in slides.md:
# Slide Title
- Content here
<!--
Speaker notes go here. Only visible in presenter mode.
-->After completing the migration, generate presentation/migration/PPTX_REFERENCE.md — a per-slide
snapshot of the original PPTX content. This serves as the source of truth for verifying
the migration and debugging missing content without re-extracting the PPTX.
Combine data from Steps 3 (HTML text), 4 (images), and 6 (speaker notes) into a single file. For each PPTX slide, extract:
- Text: all text content from the slide XML
- Images: PPTX media filenames → imported filenames (or
NOT IMPORTEDif missing) - Notes: speaker notes from
notesSlideN.xml - Slidev slide: corresponding slide number in
slides.md
## PPTX Slide 1 → Slidev Slide 1
**Layout template:** Title slide
**Text:**
- UnitTesting
- TDD
**Images:**
| PPTX media | Imported as | Type |
|------------|-------------|------|
| image1.jpeg | cover-art.jpg | cover photo |
| image2.png | (template decoration) | skip |
**Notes:**
Quote from "Working Effectively with Legacy Code": ...Run from the project root (where the .pptx lives). Requires the PPTX and
the completed slides.md. Combine the extraction scripts from Steps 3, 4, and 6:
- Extract slide text from each
ppt/slides/slideN.xml - Map images via
ppt/slides/_rels/slideN.xml.rels— filter out template decorations (SVGs, small PNGs <10KB) to focus on content images - For each content image, check if a matching file exists in
presentation/images/(compare by visual inspection or file hash) and record the mapping - Extract speaker notes from
ppt/notesSlides/notesSlideN.xml - Match PPTX slide numbers to Slidev slide numbers using the Slide Index script
- After migration: verify no images or text were lost
- During finetuning: when a user says "this slide is missing something", check the reference file before re-extracting the PPTX
- Debugging: compare the reference entry against the current
slides.mdslide to spot discrepancies
The problem: slidev slide numbers do NOT match PPTX slide numbers, because the migration adds section dividers, reorders hidden slides, adds a QR/PowerPoint Source slide, etc. Any approach that maps notes by slide index will silently produce an off-by-one (or off-by-N) cascade. The user notices weeks later when notes don't match the slide content.
The rule: Match slidev slides to PPTX slides by content overlap, not by index.
- Speaker notes "feel related but wrong" (e.g., the next slide's notes are showing)
disabled: trueis on the wrong slide- A specific slide has the right notes but everything after it is shifted
-
Re-extract PPTX data using the rels-based script in Step 6 (filename mapping is wrong).
-
Parse
slides.mdinto slide blocks, capturing each slide's frontmatter and body. Splitter:re.split(r'(?m)^---\s*$', content). A block is "frontmatter" if its first non-empty line matches^[a-zA-Z_][a-zA-Z0-9_-]*\s*:(key: value). -
Skip slides that have no PPTX equivalent when matching:
layout: section(added during migration as chapter dividers)layout: end,layout: socials,layout: source(the dedicatedsourcelayout — final QR-code slide added at the end of every talk)
-
Match each remaining slidev slide to a PPTX slide using a normalized word-set overlap of the slide text, with a position bias to break ties when titles repeat (e.g. many slides titled "Interprocess Communication"):
def normalize(text): text = text.lower().replace('…','').replace('\u2019',"'") text = re.sub(r'[^a-z0-9]+', ' ', text) return set(w for w in text.split() if len(w) > 2) # For each slidev slide (in order), search ALL unused PPTX slides: # cscore = |bag ∩ pbag| / min(|bag|, |pbag|) # score = cscore - 0.03 * abs(pptx_idx - expected_next_pos) # Pick best, mark used, advance expected_next_pos. # # Special case: if BOTH bags are empty (image-only meme slide vs PPTX slide # with no text), give cscore = 0.95 — position bias will pick the right one. # If only one is empty, give cscore = 0.3 (let position bias decide).
The
0.03 * distpenalty is small enough that strong content matches still win across reorderings (e.g. the Chapters slide moved to before the book slide), but large enough to disambiguate the dozen near-identical "Async Messaging" slides. -
Apply two fixes per matched pair:
- Speaker notes: strip ALL existing
<!-- ... -->from the slide body and append the PPTX notes as a fresh comment block. (Don't try to preserve old notes — if reconciliation is needed, the existing ones are wrong by definition.) disabled: true: ensure the slidev slide has it iff the matched PPTX slide hasshow="0".
- Speaker notes: strip ALL existing
-
Verify with 3-4 ground-truth pairs the user gives you ("slide X with title Y should have note Z"). If any fail, the matching is still off — do NOT write changes.
The trap: re.split(r'(?m)^---\s*$', content) followed by '---'.join(parts)
strips the entire --- line on split, so on rejoin the blank lines that surrounded
it (both the blank line before the separator and any blank lines after frontmatter
closing) collapse. This produces a 200-line whitespace-only diff on top of the
actual fix and the user gets annoyed.
The fix: Edit slides.md by line ranges, not by string split.
lines = content.split('\n')
sep_lines = [i for i, l in enumerate(lines) if l.strip() == '---']
# Walk separators to identify slide bodies as (slidev_idx, start_line, end_line)
# Then for each body, find the existing trailing <!-- ... --> comment by scanning
# backwards for a line ending in '-->' and the matching '<!--' opener.
# Replace the comment in place using lines[body_start:body_end] = new_body_lines.
# Process slides in REVERSE order so earlier line numbers stay valid.Key rules when editing in place:
- Strip blank lines immediately before the old comment, but preserve blank lines
after (the trailing whitespace before the next
---separator). - When inserting a new comment, prepend exactly one blank line as separator from the slide body.
- Never touch frontmatter blocks (
---... key: value ...---) — they should stay exactly as-is. - After the rewrite, run
git diff --stat HEAD slides.md. If it shows changes to many slides where you only edited a few, your line-range logic is wrong.
- The user reorders, hides, and inserts slides during finetuning.
- Section dividers and the QR slide have no PPTX counterpart, so any index-based mapping must hard-code their positions — and breaks the moment the user adds another section.
- Content overlap is self-correcting: if slide titles change, you'll see the score drop and can flag it instead of silently writing wrong notes.