Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ import { createSingleItemList } from '../html/html-helpers.js';
import { getLvlTextForGoogleList, googleNumDefMap } from '../../helpers/pasteListHelpers.js';
import { wrapTextsInRuns } from '../docx-paste/docx-paste.js';

// Match Google Docs default heading sizes (H1=20pt, H2=18pt, H3=14pt, H4=12pt, H5=11pt).
// Descending order so oversized fonts (e.g. 24pt) still resolve to closest heading.
const headingSizeMap = [
{ minPt: 20, tag: 'h1' },
{ minPt: 16, tag: 'h2' },
{ minPt: 14, tag: 'h3' },
{ minPt: 12, tag: 'h4' },
{ minPt: 10, tag: 'h5' },
];

const boldWeightRegex = /^(bold|700|800|900)$/i;

/**
* Main handler for pasted Google Docs content.
*
Expand All @@ -21,7 +33,9 @@ export const handleGoogleDocsHtml = (html, editor, view) => {
const tempDiv = document.createElement('div');
tempDiv.innerHTML = cleanedHtml;

const htmlWithMergedLists = mergeSeparateLists(tempDiv);
const tempDivWithHeadings = convertStyledHeadings(tempDiv);

const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings);
Comment on lines +36 to +38
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google Docs clipboard already uses <h1><h6> for headings set via the styles dropdown — ProseMirror parses these natively. This conversion only runs on <p> elements (normal text), so bold body text at 11pt gets wrongly turned into h5.

Suggested change
const tempDivWithHeadings = convertStyledHeadings(tempDiv);
const htmlWithMergedLists = mergeSeparateLists(tempDivWithHeadings);
// Google Docs already outputs semantic <h1>–<h6> for headings set via
// the paragraph style dropdown — ProseMirror handles them natively.
const htmlWithMergedLists = mergeSeparateLists(tempDiv);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @caio-pizzol, I want to make sure I'm not missing something here, because this seems to contradict the issue #2152 itself.

The issue description states:

Google Docs converts most heading levels to <p> tags with inline font-size/font-weight styling instead of semantic <h1>-<h6> tags. The paragraph.parseDOM has rules for h1-h6 but they never fire for these styled <p> elements.

That's exactly the pattern this PR detects. So could you clarify? Did something change in how Google Docs serializes to clipboard, or was the issue description inaccurate?

If Google Docs already outputs semantic heading tags upfront, then the issue itself would be invalid and need to be closed, but the repro steps suggest otherwise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ErickPetru, you're right to question it - the issue description is wrong, and that's on us for not verifying the assumption before writing it up.

I just tested both paste flows with debug logging on the clipboard HTML:

  • Google Docs into SuperDoc: 6 semantic <h1>–<h6> tags, 2 <p> elements (body text)
  • Word into SuperDoc: 6 semantic <h1>–<h6> tags, 5 <p> elements (body text)

Both sources already output proper heading tags in the clipboard. ProseMirror handles them natively — no conversion needed.

The convertStyledHeadings function only operates on <p> elements, which in practice are body text. That's why bold text at 11pt (Google Docs' default body size) gets incorrectly promoted to h5.

Sorry for the confusion on this - but appreciate your patience working through the reviews.

If you're gained, there are other good first issue here - feel free to pick any from the list (I will also double check if we might have any there are misinterpreted

const flattenHtml = flattenListsInHtml(htmlWithMergedLists, editor);

let doc = DOMParser.fromSchema(editor.schema).parse(flattenHtml);
Expand Down Expand Up @@ -253,3 +267,88 @@ function buildListPath(level, map) {
}
return path;
}

/**
* Converts Google Docs styled <p> elements that represent headings into proper
* <h1>–<h5> tags before ProseMirror parsing.
*
* Google Docs converts heading levels to <p> tags with inline font-size /
* font-weight styling instead of semantic heading tags. This function detects
* that pattern and replaces the elements in-place.
*
* @param {HTMLElement} container
*/
function convertStyledHeadings(container) {
const paragraphs = Array.from(container.querySelectorAll('p')).filter((p) => !p.closest('li'));

paragraphs.forEach((p) => {
const { fontSize, isBold } = getHeadingStyleProps(p);
if (!isBold || fontSize === null) return;

const match = headingSizeMap.find(({ minPt }) => fontSize >= minPt);
if (!match) return;

const heading = document.createElement(match.tag);
heading.innerHTML = p.innerHTML;
Array.from(p.attributes).forEach((attr) => heading.setAttribute(attr.name, attr.value));
p.replaceWith(heading);
});

return container;
}

/**
* Reads font-size (in pt) and bold status from an element's inline style.
* When font-size is on the root element, bold is accepted from the root or
* all child spans. When font-size is only on child spans, all spans must
* share the same size, and bold is from the root or all child spans.
*
* @param {HTMLElement} el
* @returns {{ fontSize: number|null, isBold: boolean }}
*/
function getHeadingStyleProps(el) {
const elFontSize = parsePtValue(el.style.fontSize);
const elIsBold = boldWeightRegex.test(el.style.fontWeight || '');
const spans = Array.from(el.querySelectorAll('span'));
const spanIsBold = (span) => boldWeightRegex.test(span.style.fontWeight || '');
const notHeading = { fontSize: null, isBold: false };

// font-size declared on root element: bold from itself or if all child spans are bold
const fromElement = () => {
const isBold = elIsBold || (spans.length > 0 && spans.every(spanIsBold));
return { fontSize: elFontSize, isBold };
};

// font-size only on child spans: all must be same size, then bold from root or all spans
const fromSpans = () => {
// no span children, size is indeterminate
if (spans.length === 0) return notHeading;

// if not all spans declare a font-size, not a heading
const sizes = spans.map((span) => parsePtValue(span.style.fontSize));
if (sizes.some((size) => size === null)) return notHeading;

// if inconsistent sizes, mixed body text, not a heading
const [firstSpanSize] = sizes;
if (sizes.some((size) => size !== firstSpanSize)) return notHeading;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two checks don't have tests yet. if they broke, paragraphs with mixed spans would quietly turn into headings. two quick cases to add:

<!-- one span has no font-size — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-weight:700">B</span></p>

<!-- spans have different sizes — should stay as <p> -->
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-size:14pt;font-weight:700">B</span></p>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more coverage makes perfect sense. Thank you for the patient review, I've sent the new tests.


// otherwise, first span size, and root element or all spans bold
const isBold = elIsBold || spans.every(spanIsBold);
return { fontSize: firstSpanSize, isBold };
};

return elFontSize !== null ? fromElement() : fromSpans();
}

/**
* Parses a CSS font-size value in pt units, e.g. "20pt" → 20. Returns null
* for any other format.
*
* @param {string|undefined} cssValue
* @returns {number|null}
*/
function parsePtValue(cssValue) {
if (!cssValue) return null;
const m = cssValue.match(/^([\d.]+)pt$/i);
return m ? parseFloat(m[1]) : null;
}
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,138 @@ describe('handleGoogleDocsHtml', () => {
expect(replaceSelectionWith).toHaveBeenCalledWith(parseResult, true);
expect(dispatch).toHaveBeenCalledWith('next');
});

describe('convertStyledHeadings', () => {
function makeEditor(dispatch, replaceSelectionWith) {
return {
editor: { schema: {}, view: { dispatch }, options: {} },
view: { state: { tr: { replaceSelectionWith } } },
};
}

function parseHeadings(html) {
const dispatch = vi.fn();
const replaceSelectionWith = vi.fn(() => 'next');
const { editor, view } = makeEditor(dispatch, replaceSelectionWith);
handleGoogleDocsHtml(html, editor, view);
return parseSpy.mock.calls[0][0];
}

it('converts bold <p> with large font-size to heading tags', () => {
const html = `
<p style="font-size:20pt;font-weight:700">Heading 1</p>
<p style="font-size:16pt;font-weight:bold">Heading 2</p>
<p style="font-size:14pt;font-weight:700">Heading 3</p>
<p style="font-size:12pt;font-weight:700">Heading 4</p>
<p style="font-size:11pt;font-weight:700">Heading 5</p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.trim()).toBe('Heading 1');
expect(dom.querySelector('h2')?.textContent?.trim()).toBe('Heading 2');
expect(dom.querySelector('h3')?.textContent?.trim()).toBe('Heading 3');
expect(dom.querySelector('h4')?.textContent?.trim()).toBe('Heading 4');
expect(dom.querySelector('h5')?.textContent?.trim()).toBe('Heading 5');
});

it('converts when style is on a child <span> instead of the <p>', () => {
const html = `
<p><span style="font-size:20pt;font-weight:700">Heading from span</span></p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.trim()).toBe('Heading from span');
expect(dom.querySelector('p')).toBeNull();
});

it('does not convert non-bold paragraphs', () => {
const html = `<p style="font-size:20pt">Not a heading</p>`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')).toBeNull();
expect(dom.querySelector('p')?.textContent?.trim()).toBe('Not a heading');
});

it('does not convert bold paragraphs with small font-size', () => {
const html = `<p style="font-size:9pt;font-weight:700">Small bold</p>`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1,h2,h3,h4,h5')).toBeNull();
});

it('handles large font-sizes from alternate Google Docs themes (e.g. 24pt → h1)', () => {
const html = `<p style="font-size:24pt;font-weight:700">Big Heading</p>`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.trim()).toBe('Big Heading');
});

it('does not convert a paragraph where only the first of multiple spans is bold', () => {
// Body paragraph with a bold opening word — must not become a heading.
const html = `
<p>
<span style="font-size:11pt;font-weight:700">Bold word</span>
<span style="font-size:11pt;">rest of text</span>
</p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1,h2,h3,h4,h5')).toBeNull();
});

it('does not convert <p> elements inside <li> to avoid corrupting list structure', () => {
const html = `
<ul>
<li><p style="font-size:20pt;font-weight:700">List item</p></li>
</ul>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')).toBeNull();
expect(dom.querySelector('p[data-num-id]')).not.toBeNull();
});

it('converts when font-weight is on <p> but font-size is only on the child spans', () => {
const html = `
<p style="font-weight:700"><span style="font-size:20pt">Bold p, size on span</span></p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.trim()).toBe('Bold p, size on span');
});

it('does not convert when one span is missing a font-size', () => {
const html = `
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-weight:700">B</span></p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1,h2,h3,h4,h5')).toBeNull();
});

it('does not convert when spans have inconsistent font sizes', () => {
const html = `
<p><span style="font-size:20pt;font-weight:700">A</span><span style="font-size:14pt;font-weight:700">B</span></p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1,h2,h3,h4,h5')).toBeNull();
});

it('converts when font-size is on <p> but font-weight is only on the child <span>', () => {
const html = `
<p style="font-size:20pt"><span style="font-weight:700">Split style heading</span></p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.trim()).toBe('Split style heading');
});

it('converts a heading with multiple child spans (e.g. text + anchor)', () => {
const html = `
<p>
<span style="font-size:20pt;font-weight:700">Heading with </span>
<a href="#"><span style="font-size:20pt;font-weight:700">a link</span></a>
</p>
`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.textContent?.replace(/\s+/g, ' ').trim()).toBe('Heading with a link');
expect(dom.querySelector('p')).toBeNull();
});

it('preserves attributes from the original <p> on the new heading element', () => {
const html = `<p style="font-size:20pt;font-weight:700" data-custom="yes">With attr</p>`;
const dom = parseHeadings(html);
expect(dom.querySelector('h1')?.getAttribute('data-custom')).toBe('yes');
});
});
});
Loading