fix(paste): detect heading levels from Google Docs styled paragraphs#2178
fix(paste): detect heading levels from Google Docs styled paragraphs#2178ErickPetru wants to merge 5 commits intosuperdoc-dev:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 167f5aa820
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
caio-pizzol
left a comment
There was a problem hiding this comment.
nice work on this, thanks for picking it up! the code is clean, tests are thorough, and the prior bot feedback (bold false positives + list-item corruption) is both addressed in the current version.
two things worth discussing:
multi-span headings won't be detected
getHeadingStyleProps only reads styles from a single child <span> (children.length === 1). a heading that contains a link, bookmark, or line break will produce multiple child elements — and since Google Docs usually puts the styles on the spans (not the <p>), those headings will be silently skipped.
e.g. this wouldn't be detected:
<p>
<span style="font-size:20pt;font-weight:700">Heading with </span>
<a href="..."><span style="font-size:20pt;font-weight:700">a link</span></a>
</p>one option: check that all child elements share the same font-size and bold status instead of requiring exactly one span. up to you whether that's worth the added complexity for this PR or a follow-up.
minor: closest('li') instead of parentElement
the list-item filter only checks the immediate parent. the rest of the pipeline makes the same assumption (<li> > <p> direct child), so it's consistent. but !p.closest('li') would be a zero-cost safety net in case Google Docs ever wraps list content differently.
|
Thanks for the thorough review, @caio-pizzol! Both points addressed:
|
caio-pizzol
left a comment
There was a problem hiding this comment.
@ErickPetru the full-paragraph bold check and the closest('li') fix both look good — nice work on those.
one thing still worth a look: left an inline comment on the paragraph-level bold case in fromSpans. it's not a blocker since the pattern is uncommon in real Google Docs output, but an easy one-liner if you want to close the gap.
caio-pizzol
left a comment
There was a problem hiding this comment.
@ErickPetru the bold detection fix looks good — all feedback from previous rounds is addressed.
two small things:
- the size thresholds in
headingSizeMapcould use a short comment saying they come from Google Docs defaults - a couple of guard paths in
fromSpansdon't have tests — left an inline comment with example inputs
neither is blocking, but easy wins.
| import { getLvlTextForGoogleList, googleNumDefMap } from '../../helpers/pasteListHelpers.js'; | ||
| import { wrapTextsInRuns } from '../docx-paste/docx-paste.js'; | ||
|
|
||
| // Ordered largest → smallest; first match wins. |
There was a problem hiding this comment.
a comment saying these numbers come from Google Docs default heading sizes (H1=20pt, H2=16pt, etc.) would help — not obvious where they came from otherwise.
There was a problem hiding this comment.
Good suggestion; what's obvious to one developer might not be to another.
|
|
||
| // if inconsistent sizes, mixed body text, not a heading | ||
| const [firstSpanSize] = sizes; | ||
| if (sizes.some((size) => size !== firstSpanSize)) return notHeading; |
There was a problem hiding this comment.
these two checks don't have tests yet. if they broke, paragraphs with mixed spans would quietly turn into headings. two quick cases to add:
<!-- one span has no font-size — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-weight:700">B</span></p>
<!-- spans have different sizes — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-size:14pt;font-weight:700">B</span></p>There was a problem hiding this comment.
Yes, more coverage makes perfect sense. Thank you for the patient review, I've sent the new tests.
caio-pizzol
left a comment
There was a problem hiding this comment.
Hey @ErickPetru, I found the corner case we're not covering here.
When you copy text from Google Docs that's styled to look like a heading (bold + large font) but is set as "Normal text" in the styles dropdown, it gets wrongly converted into a heading in SuperDoc. For example, any bold text at 11pt (Google Docs' default size) becomes an h5.
The thing is, Google Docs already puts real headings as <h1>–<h6> in the clipboard — so the font-size detection only runs on normal paragraphs, which shouldn't be converted.
The fix is to skip the convertStyledHeadings step entirely — headings already work without it. Left an inline suggestion.
| const tempDivWithHeadings = convertStyledHeadings(tempDiv); | ||
|
|
||
| const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings); |
There was a problem hiding this comment.
Google Docs clipboard already uses <h1>–<h6> for headings set via the styles dropdown — ProseMirror parses these natively. This conversion only runs on <p> elements (normal text), so bold body text at 11pt gets wrongly turned into h5.
| const tempDivWithHeadings = convertStyledHeadings(tempDiv); | |
| const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings); | |
| // Google Docs already outputs semantic <h1>–<h6> for headings set via | |
| // the paragraph style dropdown — ProseMirror handles them natively. | |
| const htmlWithMergedLists = mergeSeparateLists(tempDiv); |
There was a problem hiding this comment.
Hey @caio-pizzol, I want to make sure I'm not missing something here, because this seems to contradict the issue #2152 itself.
The issue description states:
Google Docs converts most heading levels to
<p>tags with inlinefont-size/font-weightstyling instead of semantic<h1>-<h6>tags. Theparagraph.parseDOMhas rules forh1-h6but they never fire for these styled<p>elements.
That's exactly the pattern this PR detects. So could you clarify? Did something change in how Google Docs serializes to clipboard, or was the issue description inaccurate?
If Google Docs already outputs semantic heading tags upfront, then the issue itself would be invalid and need to be closed, but the repro steps suggest otherwise.
There was a problem hiding this comment.
Hey @ErickPetru, you're right to question it - the issue description is wrong, and that's on us for not verifying the assumption before writing it up.
I just tested both paste flows with debug logging on the clipboard HTML:
- Google Docs into SuperDoc: 6 semantic
<h1>–<h6>tags, 2<p>elements (body text) - Word into SuperDoc: 6 semantic
<h1>–<h6>tags, 5<p>elements (body text)
Both sources already output proper heading tags in the clipboard. ProseMirror handles them natively — no conversion needed.
The convertStyledHeadings function only operates on <p> elements, which in practice are body text. That's why bold text at 11pt (Google Docs' default body size) gets incorrectly promoted to h5.
Sorry for the confusion on this - but appreciate your patience working through the reviews.
If you're gained, there are other good first issue here - feel free to pick any from the list (I will also double check if we might have any there are misinterpreted
Problem
Closes #2152
Cause
Google Docs converts most heading levels to
<p>tags with inlinefont-size/font-weightstyling instead of semantic<h1>–<h6>tags. ProseMirror'sparagraph.parseDOMhas rules forh1–h6but they never fire for these styled<p>elements, so the heading level is silently dropped.Fix
convertStyledHeadingsto the Google Docs paste pipeline, running before ProseMirror parses the pasted DOM<p>elements carrying bold + a recognised font-size (checked on both the<p>itself and its first child<span>, covering both Google Docs style placements) and replaces them with the corresponding<h1>–<h5>elementh1)