Skip to content

fix(paste): detect heading levels from Google Docs styled paragraphs#2178

Closed
ErickPetru wants to merge 5 commits intosuperdoc-dev:mainfrom
ErickPetru:fix/google-docs-heading-detection
Closed

fix(paste): detect heading levels from Google Docs styled paragraphs#2178
ErickPetru wants to merge 5 commits intosuperdoc-dev:mainfrom
ErickPetru:fix/google-docs-heading-detection

Conversation

@ErickPetru
Copy link
Copy Markdown

@ErickPetru ErickPetru commented Feb 25, 2026

Problem

When pasting content with H1, H3-H6 headings from Google Docs, they lose their heading level and paste as plain paragraphs. Only H2 is reliably preserved as a heading tag.

Closes #2152

SuperDoc with pasted Google Docs headings after the fix is applied

Cause

Google Docs converts most heading levels to <p> tags with inline font-size/font-weight styling instead of semantic <h1><h6> tags. ProseMirror's paragraph.parseDOM has rules for h1h6 but they never fire for these styled <p> elements, so the heading level is silently dropped.

Fix

  • Added convertStyledHeadings to the Google Docs paste pipeline, running before ProseMirror parses the pasted DOM
  • Detects <p> elements carrying bold + a recognised font-size (checked on both the <p> itself and its first child <span>, covering both Google Docs style placements) and replaces them with the corresponding <h1><h5> element
  • Font-size thresholds accommodate the default Google Docs theme and alternate themes (e.g. 20 pt and 24 pt both resolve to h1)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 167f5aa820

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/super-editor/src/core/inputRules/google-docs-paste/google-docs-paste.js Outdated
Comment thread packages/super-editor/src/core/inputRules/google-docs-paste/google-docs-paste.js Outdated
Copy link
Copy Markdown
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on this, thanks for picking it up! the code is clean, tests are thorough, and the prior bot feedback (bold false positives + list-item corruption) is both addressed in the current version.

two things worth discussing:

multi-span headings won't be detected

getHeadingStyleProps only reads styles from a single child <span> (children.length === 1). a heading that contains a link, bookmark, or line break will produce multiple child elements — and since Google Docs usually puts the styles on the spans (not the <p>), those headings will be silently skipped.

e.g. this wouldn't be detected:

<p>
  <span style="font-size:20pt;font-weight:700">Heading with </span>
  <a href="..."><span style="font-size:20pt;font-weight:700">a link</span></a>
</p>

one option: check that all child elements share the same font-size and bold status instead of requiring exactly one span. up to you whether that's worth the added complexity for this PR or a follow-up.

minor: closest('li') instead of parentElement

the list-item filter only checks the immediate parent. the rest of the pipeline makes the same assumption (<li> > <p> direct child), so it's consistent. but !p.closest('li') would be a zero-cost safety net in case Google Docs ever wraps list content differently.

@ErickPetru
Copy link
Copy Markdown
Author

Thanks for the thorough review, @caio-pizzol!

Both points addressed:

  1. Multi-span headings: replaced the singleSpan guard with an all-spans approach which uses querySelectorAll('span') then requires them all to share the same font-size and bold status. This detects headings with links/bookmarks while still rejecting body paragraphs with a single bold opening word (since inconsistent bold across spans is not a heading).

  2. closest('li'): swapped the parentElement check for p.closest('li') as suggested.

Copy link
Copy Markdown
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ErickPetru the full-paragraph bold check and the closest('li') fix both look good — nice work on those.

one thing still worth a look: left an inline comment on the paragraph-level bold case in fromSpans. it's not a blocker since the pattern is uncommon in real Google Docs output, but an easy one-liner if you want to close the gap.

Comment thread packages/super-editor/src/core/inputRules/google-docs-paste/google-docs-paste.js Outdated
Copy link
Copy Markdown
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ErickPetru the bold detection fix looks good — all feedback from previous rounds is addressed.

two small things:

  1. the size thresholds in headingSizeMap could use a short comment saying they come from Google Docs defaults
  2. a couple of guard paths in fromSpans don't have tests — left an inline comment with example inputs

neither is blocking, but easy wins.

import { getLvlTextForGoogleList, googleNumDefMap } from '../../helpers/pasteListHelpers.js';
import { wrapTextsInRuns } from '../docx-paste/docx-paste.js';

// Ordered largest → smallest; first match wins.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment saying these numbers come from Google Docs default heading sizes (H1=20pt, H2=16pt, etc.) would help — not obvious where they came from otherwise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion; what's obvious to one developer might not be to another.


// if inconsistent sizes, mixed body text, not a heading
const [firstSpanSize] = sizes;
if (sizes.some((size) => size !== firstSpanSize)) return notHeading;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two checks don't have tests yet. if they broke, paragraphs with mixed spans would quietly turn into headings. two quick cases to add:

<!-- one span has no font-size — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-weight:700">B</span></p>

<!-- spans have different sizes — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-size:14pt;font-weight:700">B</span></p>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more coverage makes perfect sense. Thank you for the patient review, I've sent the new tests.

Copy link
Copy Markdown
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ErickPetru, I found the corner case we're not covering here.

When you copy text from Google Docs that's styled to look like a heading (bold + large font) but is set as "Normal text" in the styles dropdown, it gets wrongly converted into a heading in SuperDoc. For example, any bold text at 11pt (Google Docs' default size) becomes an h5.

The thing is, Google Docs already puts real headings as <h1><h6> in the clipboard — so the font-size detection only runs on normal paragraphs, which shouldn't be converted.

The fix is to skip the convertStyledHeadings step entirely — headings already work without it. Left an inline suggestion.

Comment on lines +36 to +38
const tempDivWithHeadings = convertStyledHeadings(tempDiv);

const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google Docs clipboard already uses <h1><h6> for headings set via the styles dropdown — ProseMirror parses these natively. This conversion only runs on <p> elements (normal text), so bold body text at 11pt gets wrongly turned into h5.

Suggested change
const tempDivWithHeadings = convertStyledHeadings(tempDiv);
const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings);
// Google Docs already outputs semantic <h1>–<h6> for headings set via
// the paragraph style dropdown — ProseMirror handles them natively.
const htmlWithMergedLists = mergeSeparateLists(tempDiv);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @caio-pizzol, I want to make sure I'm not missing something here, because this seems to contradict the issue #2152 itself.

The issue description states:

Google Docs converts most heading levels to <p> tags with inline font-size/font-weight styling instead of semantic <h1>-<h6> tags. The paragraph.parseDOM has rules for h1-h6 but they never fire for these styled <p> elements.

That's exactly the pattern this PR detects. So could you clarify? Did something change in how Google Docs serializes to clipboard, or was the issue description inaccurate?

If Google Docs already outputs semantic heading tags upfront, then the issue itself would be invalid and need to be closed, but the repro steps suggest otherwise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ErickPetru, you're right to question it - the issue description is wrong, and that's on us for not verifying the assumption before writing it up.

I just tested both paste flows with debug logging on the clipboard HTML:

  • Google Docs into SuperDoc: 6 semantic <h1>–<h6> tags, 2 <p> elements (body text)
  • Word into SuperDoc: 6 semantic <h1>–<h6> tags, 5 <p> elements (body text)

Both sources already output proper heading tags in the clipboard. ProseMirror handles them natively — no conversion needed.

The convertStyledHeadings function only operates on <p> elements, which in practice are body text. That's why bold text at 11pt (Google Docs' default body size) gets incorrectly promoted to h5.

Sorry for the confusion on this - but appreciate your patience working through the reviews.

If you're gained, there are other good first issue here - feel free to pick any from the list (I will also double check if we might have any there are misinterpreted

@caio-pizzol caio-pizzol self-requested a review March 3, 2026 22:27
@caio-pizzol caio-pizzol closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect heading levels from Google Docs styled paragraphs

2 participants