Skip to content

Commit a24e9c0

Browse files
develop (#1)
* chore(planning): initialize markdown conversion fidelity improvement plan 12 tasks across 5 phases to fix TOC/anchor links, text extraction artifacts, and conversion fidelity issues identified by comparing DOCX-to-markdown output with live Microsoft Learn HTML. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(docx): rewrite table cell text extraction to use run-aware parsing Replaced flat w:t node collection with paragraph/run-aware extraction in Get-OpenSpecOpenXmlNodeText. The old approach joined all text nodes with spaces, causing mid-word artifacts (e.g., 'W EBAUTHN', '10/8/20 10', 'technica l'). The new approach walks w:p > w:r structure and delegates to ConvertFrom-OpenSpecOpenXmlRunText which correctly handles w:br, w:tab, w:cr elements. Tasks: TASK-001, TASK-002 | Phase: 1/5 | Progress: 2/12 (17%) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(docx): rewrite TOC links to Section_X.Y anchors and strip page numbers Rewrote Add-OpenSpecSectionAnchorsFromToc to replace _Toc anchor targets with Section_X.Y in TOC links and strip trailing DOCX page numbers from labels. TOC entries now read '[1 Introduction](#Section_1)' instead of '[1 Introduction 5](#_Toc164822728)'. Tasks: TASK-003, TASK-004 | Phase: 2/5 | Progress: 4/12 (33%) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(docx): remove _Toc anchor tags from heading output Keep _Toc bookmarks during initial conversion for Section_X.Y anchor placement, then strip all _Toc anchor tags with regex. Each heading now has only bookmark GUID + Section_X.Y anchors. Task: TASK-005 | Phase: 2/5 | Progress: 5/12 (42%) Co-authored-by: Cursor <cursoragent@cursor.com> * chore(planning): phase 2 complete, TASK-006 cancelled (not needed) All MS Open Specs headings are numbered - slug anchors for non-numbered headings not needed. Phase 2 complete: 3 tasks done, 1 cancelled. Moving to Phase 3. Phase: 2/5 complete | Progress: 6/12 (50%) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(docx): prepend section numbers to heading text from TOC mapping Word auto-numbers headings but the number isn't in the paragraph text. Added post-processing in Add-OpenSpecSectionAnchorsFromToc to inject section numbers from the TOC map into heading lines. Headings now show '# 1 Introduction' matching the live Microsoft Learn HTML. Task: TASK-007 finding | Phase: 3/5 | Progress: 7/12 (58%) Co-authored-by: Cursor <cursoragent@cursor.com> * feat(docx): add inline formatting and table cell formatting support Add bold/italic/code detection from OpenXML run properties (w:rPr) to markdown output. Uses Unicode noncharacter placeholders for safe marker merging of adjacent same-format runs. Whitespace moved outside markers for CommonMark compliance. Bold stripped from headings. Table cell extraction upgraded from plain text to paragraph-aware rendering, preserving bold formatting and hyperlinks within table cells. Results across 41 specs: 20,258 bold pairs, 669 bold table rows, 796 linked table rows, 0 conversion errors. Completes TASK-009 and TASK-010. All 12 plan tasks now completed. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(tests): add anchor validation to Test-OpenSpecMarkdownFidelity Extended fidelity tests to validate: Section_X.Y anchors present, no _Toc anchors remain, TOC links resolve to existing anchors, numbered headings exist, bold formatting detected. Fixed CRLF regex issue in table detection. All 41 specs pass. Updated plan.md to reflect completed project status with all checkboxes checked. Co-authored-by: Cursor <cursoragent@cursor.com> * improve processing * feat(docx): convert packet diagram tables to mermaid packet-beta syntax Detect DOCX packet layout tables by their PacketDiagramHeaderText style and convert them to mermaid packet-beta diagrams instead of wide 32-column markdown tables. Continuation rows are merged into the previous field's bit range for correct multi-row field representation. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(docx): detect additional packet diagram styles (Definition-Field, Packetdiagramheaderrow) Extend packet diagram detection to match Packetdiagramheaderrow and Definition-Field/Definition-Field2 styles in addition to PacketDiagramHeaderText. This catches 230 additional packet diagrams across the RDP specs. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: rename output to <name>.md and add root index generation Change per-spec output filename from index.md to <ProtocolId>.md for unique editor tab names. Update cross-document link generation to match. Add Update-OpenSpecIndex command that generates a README.md catalog of all converted specs with titles and links. Co-authored-by: Cursor <cursoragent@cursor.com> * cleanup * Add convert-and-publish workflow and Prepare-Publish script Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cf6e4eb commit a24e9c0

9 files changed

Lines changed: 875 additions & 45 deletions
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Convert all Windows protocol specs to markdown, build a clean publish tree,
2+
# then force-push it to an orphaned 'publish' branch (e.g. for GitHub Pages).
3+
name: Convert and publish
4+
5+
on:
6+
workflow_dispatch:
7+
8+
jobs:
9+
convert-and-publish:
10+
runs-on: windows-latest
11+
steps:
12+
- name: Checkout repository
13+
uses: actions/checkout@v4
14+
15+
- name: Install OpenXML module
16+
shell: pwsh
17+
run: |
18+
Set-PSRepository -Name PSGallery -InstallationPolicy Trusted
19+
Install-Module -Name OpenXML -Force -Scope CurrentUser
20+
21+
- name: Import module and convert all specs
22+
shell: pwsh
23+
run: |
24+
Import-Module .\AwakeCoding.OpenSpecs -Force
25+
Get-OpenSpecCatalog |
26+
Save-OpenSpecDocument -Format DOCX -OutputPath ./downloads-convert -Force |
27+
Where-Object { $_.Status -in 'Downloaded', 'Exists' } |
28+
Convert-OpenSpecToMarkdown -OutputPath ./converted-specs -Force
29+
30+
- name: Build publish directory and index
31+
shell: pwsh
32+
run: |
33+
Import-Module .\AwakeCoding.OpenSpecs -Force
34+
.\scripts\Prepare-Publish.ps1 -ConvertedSpecsPath ./converted-specs -PublishPath ./publish
35+
Update-OpenSpecIndex -Path ./publish
36+
37+
- name: Zip publish contents
38+
shell: pwsh
39+
run: |
40+
Compress-Archive -Path ./publish/* -DestinationPath ./publish.zip -Force
41+
42+
- name: Upload publish artifact
43+
uses: actions/upload-artifact@v4
44+
with:
45+
name: publish
46+
path: publish.zip
47+
48+
- name: Push to orphaned publish branch
49+
shell: pwsh
50+
working-directory: publish
51+
env:
52+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
53+
run: |
54+
$RemoteRepo = "https://${Env:GITHUB_ACTOR}:${Env:GITHUB_TOKEN}@github.com/${Env:GITHUB_REPOSITORY}.git"
55+
git init
56+
git config user.name "GitHub Actions"
57+
git config user.email "github-actions-bot@users.noreply.github.com"
58+
git add .
59+
git commit -m "Publish converted Open Specs markdown (${Env:GITHUB_REPOSITORY})"
60+
git push --force "${RemoteRepo}" "HEAD:publish"

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
artifacts/
2-
downloads/
2+
downloads*/
3+
converted*/
4+
reports*/

AwakeCoding.OpenSpecs/AwakeCoding.OpenSpecs.psd1

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
'Convert-OpenSpecToMarkdown',
1919
'Invoke-OpenSpecConversionPipeline',
2020
'Get-OpenSpecConversionReport',
21-
'Test-OpenSpecMarkdownFidelity'
21+
'Test-OpenSpecMarkdownFidelity',
22+
'Update-OpenSpecIndex'
2223
)
2324
CmdletsToExport = @()
2425
VariablesToExport = @()

AwakeCoding.OpenSpecs/Private/ConvertFrom-OpenSpecDocx.ps1

Lines changed: 556 additions & 35 deletions
Large diffs are not rendered by default.

AwakeCoding.OpenSpecs/Private/Invoke-OpenSpecMarkdownCleanup.ps1

Lines changed: 98 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ function Invoke-OpenSpecMarkdownCleanup {
3131
$result = $tocResult.Markdown
3232
foreach ($issue in $tocResult.Issues) { [void]$issues.Add($issue) }
3333

34+
$guidResult = Resolve-OpenSpecGuidSectionAnchors -Markdown $result
35+
$result = $guidResult.Markdown
36+
foreach ($issue in $guidResult.Issues) { [void]$issues.Add($issue) }
37+
3438
$mathResult = ConvertTo-OpenSpecNormalizedMathText -Markdown $result
3539
$result = $mathResult.Markdown
3640
foreach ($issue in $mathResult.Issues) { [void]$issues.Add($issue) }
@@ -672,7 +676,7 @@ function ConvertTo-OpenSpecInternalLinks {
672676
if ($frag) { $frag } else { '#' }
673677
}
674678
else {
675-
"../$id/index.md$frag"
679+
"../$id/$id.md$frag"
676680
}
677681
$rewriteCount++
678682
if ($rewriteSamples.Count -lt $sampleCap) {
@@ -817,6 +821,96 @@ function ConvertTo-OpenSpecNormalizedMathText {
817821
}
818822
}
819823

824+
function Resolve-OpenSpecGuidSectionAnchors {
825+
[CmdletBinding()]
826+
param(
827+
[Parameter(Mandatory)]
828+
[string]$Markdown
829+
)
830+
831+
$issues = New-Object System.Collections.Generic.List[object]
832+
$result = $Markdown
833+
$rewriteCount = 0
834+
835+
# Build a mapping from GUID-based anchors to human-readable Section_X.Y.Z
836+
# anchors. In the converted markdown, each heading is preceded by a pair of
837+
# anchor tags:
838+
# <a id="section_<GUID>"></a>
839+
# <a id="Section_X.Y.Z"></a>
840+
# Cross-reference links in the body text reference sections using the GUID
841+
# form (#Section_<GUID> or #section_<GUID>), which is both unreadable and
842+
# may not resolve due to a case mismatch (the bookmark anchor uses
843+
# lowercase "section_" while the hyperlink uses "Section_"). Replacing
844+
# these with the Section_X.Y.Z form fixes both issues.
845+
$guidToSection = @{}
846+
847+
# Order 1: GUID anchor followed by Section anchor (most common)
848+
$pairRegex1 = [regex]::new(
849+
'<a\s+id="section_(?<guid>[0-9a-f]{32})"></a>\s*\r?\n<a\s+id="(?<section>Section_\d+(?:\.\d+)*)"></a>',
850+
[System.Text.RegularExpressions.RegexOptions]::IgnoreCase
851+
)
852+
foreach ($m in $pairRegex1.Matches($result)) {
853+
$guid = $m.Groups['guid'].Value.ToLowerInvariant()
854+
if (-not $guidToSection.ContainsKey($guid)) {
855+
$guidToSection[$guid] = $m.Groups['section'].Value
856+
}
857+
}
858+
859+
# Order 2: Section anchor followed by GUID anchor (fallback)
860+
$pairRegex2 = [regex]::new(
861+
'<a\s+id="(?<section>Section_\d+(?:\.\d+)*)"></a>\s*\r?\n<a\s+id="section_(?<guid>[0-9a-f]{32})"></a>',
862+
[System.Text.RegularExpressions.RegexOptions]::IgnoreCase
863+
)
864+
foreach ($m in $pairRegex2.Matches($result)) {
865+
$guid = $m.Groups['guid'].Value.ToLowerInvariant()
866+
if (-not $guidToSection.ContainsKey($guid)) {
867+
$guidToSection[$guid] = $m.Groups['section'].Value
868+
}
869+
}
870+
871+
if ($guidToSection.Count -eq 0) {
872+
return [pscustomobject]@{
873+
Markdown = $result
874+
Issues = $issues.ToArray()
875+
}
876+
}
877+
878+
# Rewrite all link targets that reference GUID-based section anchors.
879+
# Matches both (#Section_GUID) and (#section_GUID) forms.
880+
$rewriteCounter = @{ Value = 0 }
881+
$result = [regex]::Replace(
882+
$result,
883+
'\(#[Ss]ection_(?<guid>[0-9a-f]{32})\)',
884+
{
885+
param($m)
886+
$guid = $m.Groups['guid'].Value.ToLowerInvariant()
887+
if ($guidToSection.ContainsKey($guid)) {
888+
$rewriteCounter.Value++
889+
"(#$($guidToSection[$guid]))"
890+
}
891+
else {
892+
$m.Value
893+
}
894+
}
895+
)
896+
$rewriteCount = $rewriteCounter.Value
897+
898+
if ($rewriteCount -gt 0) {
899+
[void]$issues.Add([pscustomobject]@{
900+
Type = 'GuidAnchorResolved'
901+
Severity = 'Info'
902+
Count = $rewriteCount
903+
MappedAnchors = $guidToSection.Count
904+
Reason = 'GUID-based section anchors were resolved to section number anchors.'
905+
})
906+
}
907+
908+
[pscustomobject]@{
909+
Markdown = $result
910+
Issues = $issues.ToArray()
911+
}
912+
}
913+
820914
function Resolve-OpenSpecLinkTarget {
821915
[CmdletBinding()]
822916
param(
@@ -841,7 +935,7 @@ function Resolve-OpenSpecLinkTarget {
841935
return [pscustomobject]@{ Url = if ($fragment) { $fragment } else { '#' }; Rewritten = $true }
842936
}
843937

844-
return [pscustomobject]@{ Url = "../$targetId/index.md$fragment"; Rewritten = $true }
938+
return [pscustomobject]@{ Url = "../$targetId/$targetId.md$fragment"; Rewritten = $true }
845939
}
846940

847941
if ($decoded -match '(?i)(?:https?://learn\.microsoft\.com)?/?openspecs/windows_protocols/(?<slug>(?:ms|mc)-[a-z0-9-]+)(?:/[^#?]+)?') {
@@ -850,7 +944,7 @@ function Resolve-OpenSpecLinkTarget {
850944
return [pscustomobject]@{ Url = if ($fragment) { $fragment } else { '#' }; Rewritten = $true }
851945
}
852946

853-
return [pscustomobject]@{ Url = "../$targetId/index.md$fragment"; Rewritten = $true }
947+
return [pscustomobject]@{ Url = "../$targetId/$targetId.md$fragment"; Rewritten = $true }
854948
}
855949

856950
if ($decoded -match '(?i)%5b(?<id>(?:MS|MC)-[A-Z0-9-]+)%5d\.(?:pdf|docx)$') {
@@ -859,7 +953,7 @@ function Resolve-OpenSpecLinkTarget {
859953
return [pscustomobject]@{ Url = if ($fragment) { $fragment } else { '#' }; Rewritten = $true }
860954
}
861955

862-
return [pscustomobject]@{ Url = "../$targetId/index.md$fragment"; Rewritten = $true }
956+
return [pscustomobject]@{ Url = "../$targetId/$targetId.md$fragment"; Rewritten = $true }
863957
}
864958

865959
return [pscustomobject]@{ Url = $Url; Rewritten = $false }

AwakeCoding.OpenSpecs/Public/Convert-OpenSpecToMarkdown.ps1

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ function Convert-OpenSpecToMarkdown {
8989
[void](New-Item -Path $artifactDirectory -ItemType Directory -Force)
9090
}
9191

92-
$markdownPath = Join-Path -Path $specDirectory -ChildPath 'index.md'
92+
$markdownPath = Join-Path -Path $specDirectory -ChildPath "$safeProtocol.md"
9393
if ((Test-Path -LiteralPath $markdownPath) -and -not $Force) {
9494
[pscustomobject]@{
9595
PSTypeName = 'AwakeCoding.OpenSpecs.ConversionResult'
@@ -112,7 +112,8 @@ function Convert-OpenSpecToMarkdown {
112112
if ($resolvedFormat -eq 'DOCX') {
113113
$toolchain = Get-OpenSpecToolchain -RequireDocxConverter
114114
$rawMarkdownPath = Join-Path -Path $artifactDirectory -ChildPath 'raw-docx.md'
115-
$conversionStep = ConvertFrom-OpenSpecDocx -InputPath $sourcePath -OutputPath $rawMarkdownPath -Toolchain $toolchain
115+
$mediaDirectory = Join-Path -Path $specDirectory -ChildPath 'media'
116+
$conversionStep = ConvertFrom-OpenSpecDocx -InputPath $sourcePath -OutputPath $rawMarkdownPath -Toolchain $toolchain -MediaOutputDirectory $mediaDirectory
116117
}
117118
elseif ($resolvedFormat -eq 'PDF') {
118119
$toolchain = Get-OpenSpecToolchain -RequirePdfConverter

AwakeCoding.OpenSpecs/Public/Test-OpenSpecMarkdownFidelity.ps1

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,39 @@ function Test-OpenSpecMarkdownFidelity {
1515
}
1616

1717
[bool]$hasHeadings = $markdown -match '(?m)^#'
18-
[bool]$hasTables = $markdown -match '(?m)^\|.+\|$'
18+
[bool]$hasTables = $markdown -match '(?m)^\|.+\|\r?$'
1919
[bool]$hasNormative = $markdown -match '\b(MUST|SHOULD|MAY|REQUIRED|OPTIONAL)\b'
2020

21-
$pass = $hasHeadings -and $hasTables
21+
# Anchor validation: check that TOC links resolve and anchors are correct
22+
$sectionAnchors = [regex]::Matches($markdown, '<a id="Section_[^"]+"></a>')
23+
$tocAnchors = [regex]::Matches($markdown, '<a id="_Toc\d+"></a>')
24+
$tocLinks = [regex]::Matches($markdown, '\]\(#Section_[^)]+\)')
25+
$numberedHeadings = [regex]::Matches($markdown, '(?m)^#{1,6} \d+')
26+
$boldPairs = [int]([regex]::Matches($markdown, '\*\*').Count / 2)
27+
28+
[bool]$hasSectionAnchors = $sectionAnchors.Count -gt 0
29+
[bool]$noTocAnchors = $tocAnchors.Count -eq 0
30+
[bool]$hasTocLinks = $tocLinks.Count -gt 0
31+
[bool]$hasNumberedHeadings = $numberedHeadings.Count -gt 0
32+
33+
# Validate that TOC links resolve to existing anchors
34+
$anchorIds = [System.Collections.Generic.HashSet[string]]::new(
35+
[System.StringComparer]::OrdinalIgnoreCase
36+
)
37+
foreach ($m in [regex]::Matches($markdown, '<a id="([^"]+)"></a>')) {
38+
[void]$anchorIds.Add($m.Groups[1].Value)
39+
}
40+
41+
$unresolvedLinks = 0
42+
foreach ($m in [regex]::Matches($markdown, '\]\(#([^)]+)\)')) {
43+
$target = $m.Groups[1].Value
44+
if (-not $anchorIds.Contains($target)) {
45+
$unresolvedLinks++
46+
}
47+
}
48+
49+
$pass = $hasHeadings -and $hasTables -and $hasSectionAnchors -and
50+
$noTocAnchors -and $hasTocLinks -and $hasNumberedHeadings
2251

2352
[pscustomobject]@{
2453
PSTypeName = 'AwakeCoding.OpenSpecs.FidelityResult'
@@ -27,6 +56,14 @@ function Test-OpenSpecMarkdownFidelity {
2756
HasHeadings = $hasHeadings
2857
HasTables = $hasTables
2958
HasNormativeKeywords = $hasNormative
59+
HasSectionAnchors = $hasSectionAnchors
60+
SectionAnchorCount = $sectionAnchors.Count
61+
NoTocAnchors = $noTocAnchors
62+
TocAnchorCount = $tocAnchors.Count
63+
TocLinkCount = $tocLinks.Count
64+
NumberedHeadingCount = $numberedHeadings.Count
65+
BoldPairCount = $boldPairs
66+
UnresolvedLinkCount = $unresolvedLinks
3067
IssueCount = $report.IssueCount
3168
MarkdownPath = $report.MarkdownPath
3269
ReportPath = $report.ReportPath
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
function Update-OpenSpecIndex {
2+
[CmdletBinding()]
3+
param(
4+
[Parameter(Mandatory)]
5+
[string]$Path
6+
)
7+
8+
if (-not (Test-Path -LiteralPath $Path)) {
9+
throw "Output directory not found: $Path"
10+
}
11+
12+
$specDirs = Get-ChildItem -LiteralPath $Path -Directory | Sort-Object Name
13+
$entries = New-Object System.Collections.Generic.List[pscustomobject]
14+
15+
foreach ($dir in $specDirs) {
16+
$specName = $dir.Name
17+
$mdFile = Join-Path -Path $dir.FullName -ChildPath "$specName.md"
18+
19+
# Fall back to index.md for specs not yet reconverted.
20+
if (-not (Test-Path -LiteralPath $mdFile)) {
21+
$mdFile = Join-Path -Path $dir.FullName -ChildPath 'index.md'
22+
}
23+
24+
if (-not (Test-Path -LiteralPath $mdFile)) {
25+
continue
26+
}
27+
28+
$mdFileName = [System.IO.Path]::GetFileName($mdFile)
29+
30+
# Extract the title from line 3 of the markdown.
31+
# Expected format:
32+
# Line 1: **[MS-RDPECLIP]:**
33+
# Line 2: (blank)
34+
# Line 3: **Remote Desktop Protocol: Clipboard Virtual Channel Extension**
35+
$lines = Get-Content -LiteralPath $mdFile -TotalCount 5
36+
$title = ''
37+
if ($lines.Count -ge 3) {
38+
$rawTitle = $lines[2]
39+
# Strip surrounding bold markers (**...**)
40+
$title = $rawTitle -replace '^\*\*(.+)\*\*$', '$1'
41+
$title = $title.Trim()
42+
}
43+
44+
if ([string]::IsNullOrWhiteSpace($title)) {
45+
$title = $specName
46+
}
47+
48+
[void]$entries.Add([pscustomobject]@{
49+
Name = $specName
50+
Title = $title
51+
Link = "$specName/$mdFileName"
52+
})
53+
}
54+
55+
$sb = New-Object System.Text.StringBuilder
56+
[void]$sb.AppendLine('# Microsoft Open Specifications')
57+
[void]$sb.AppendLine()
58+
[void]$sb.AppendLine("$($entries.Count) protocol specifications converted to Markdown.")
59+
[void]$sb.AppendLine()
60+
[void]$sb.AppendLine('| Protocol | Title |')
61+
[void]$sb.AppendLine('|---|---|')
62+
63+
foreach ($entry in $entries) {
64+
[void]$sb.AppendLine("| [$($entry.Name)]($($entry.Link)) | $($entry.Title) |")
65+
}
66+
67+
$readmePath = Join-Path -Path $Path -ChildPath 'README.md'
68+
$sb.ToString() | Set-Content -LiteralPath $readmePath -Encoding UTF8
69+
70+
Write-Verbose "Generated index at $readmePath with $($entries.Count) entries."
71+
72+
[pscustomobject]@{
73+
PSTypeName = 'AwakeCoding.OpenSpecs.IndexResult'
74+
Path = $readmePath
75+
EntryCount = $entries.Count
76+
}
77+
}

0 commit comments

Comments
 (0)