fix(blog): correct factual errors + apply editorial + peer review feedback

Claude (agent) · claude · Claude (agent) · commit bc5dd2bca121 · 2026-04-19T10:30:56.000-04:00
Corrections from self-audit:
- Sentry: FAILED → PROVISIONAL (dep floor fix changed outcome)
- Sentry: "best code quality" → "among the highest" (Stripe scored 98.9)
- LangChain: removed "No issue/PR templates" (they have both)
- Removed unverified median/σ statistics table

Editorial review fixes:
- Added context for why unpinned deps fail (supply chain risk)
- Acknowledged cybersecurity n=4 sample size
- Added transition paragraph before enterprise advice section
- Softened Flask claim to "no documented security response process"
- Expanded DCO on first use with link
- Changed "Governance is where the signal is" → "actual risk surface"
- Fixed passive voice on MONAI section

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/blog/governance-bottleneck.md b/docs/blog/governance-bottleneck.md
@@ -1,6 +1,6 @@
 # Code Quality Is a Solved Problem. Governance Isn't.
 
-*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*
+*We certified 201 open-source repositories across 23 industries. Here's what the data says.*
 
 ---
 
@@ -12,67 +12,63 @@ The hypothesis was simple: popular repos have poor code quality, and that's what
 
 ## What We Actually Found
 
-Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.
+Code quality across popular open-source repos is **remarkably consistent**. Most repos score above 85 on code quality — solidly in the A and B range. The tools work. The linters work. Developers lint their code.
 
 What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.
 
-| Dimension | Median Score | Variance |
-|-----------|-------------|----------|
-| Code Quality | 91.2 | Low (σ = 10.3) |
-| Governance | 65.0 | **High** (σ = 18.7) |
-| Dependencies | 100.0 | Low |
-
-Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**
+Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality clusters between 85 and 98. **Governance is the actual risk surface.**
 
 ## The Evidence
 
 ### LangChain: 95.4 code, 45 governance
 
 The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:
 
-- No Code of Conduct
+- No CONTRIBUTING.md
 - No SECURITY.md
-- No DCO/CLA process
-- No issue/PR templates
+- No Code of Conduct
+- No DCO ([Developer Certificate of Origin](https://developercertificate.org/)) or CLA process
 
 Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.
 
 ### MONAI: The Gold Standard at 98.2
 
 NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.
 
-This is what enterprises should require. And almost nobody does.
+This is what enterprises should require. Almost nobody does.
 
-### Sentry: 98.5 code, FAILED
+### Sentry: Near-Perfect Code, Still PROVISIONAL
 
-Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.
+Sentry scores 98.5 on code quality — among the highest we tested. But it lands at **PROVISIONAL** because of 109 unpinned dependencies. Arbiter's dependency scoring penalizes unversioned dependency declarations because they make builds unreproducible and expand the supply chain attack surface. The risk isn't the code — it's the supply chain.
 
 ### Flask and Click: Foundational, PROVISIONAL
 
 Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.
 
-If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.
+If your enterprise depends on Flask, you're trusting a library with no documented security response process.
 
 ## The Pattern Across 23 Categories
 
 | Category | Repos | Certification Rate |
 |----------|-------|-------------------|
 | Developer Tools | 7 | 86% CERTIFIED |
-| Fintech | 5 | 60% CERTIFIED |
-| ML Platforms | 6 | 50% CERTIFIED |
 | Healthcare | 4 | 75% CERTIFIED |
 | Web Frameworks | 6 | 67% CERTIFIED |
-| LLM Frameworks | 5 | 40% CERTIFIED |
+| Fintech | 5 | 60% CERTIFIED |
 | Databases/ORMs | 5 | 60% CERTIFIED |
-| Testing | 4 | 50% CERTIFIED |
 | Networking | 5 | 60% CERTIFIED |
+| ML Platforms | 6 | 50% CERTIFIED |
+| Testing | 4 | 50% CERTIFIED |
+| LLM Frameworks | 5 | 40% CERTIFIED |
 | Gaming | 5 | 20% CERTIFIED |
-| Cybersecurity | 4 | 0% CERTIFIED |
+| Cybersecurity Tools | 4 | 0% CERTIFIED |
 
-**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** — speed over process.
+**Developer tools lead** — pytest, pip, tox. The people who build tools for quality also practice quality. **Cybersecurity tools** (pwntools, nmap, sqlmap, routersploit) all landed PROVISIONAL — strong code, weak governance artifacts. With only 4 repos in the sample, this likely reflects the tooling culture's bias toward speed over process rather than an industry-wide gap.
 
 ## What This Means for Enterprise AI Adoption
 
+These aren't academic distinctions. When AI tooling fails in production, the investigation often traces back to governance gaps, not code bugs.
+
 If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:
 
 1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
@@ -87,13 +83,13 @@ These aren't nice-to-haves. They're the difference between a dependency you can
 
 Arbiter scores three dimensions:
 
-- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
-- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
-- **Dependency Health** (20%): pinning, count, known-good packages
+- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, vulture, shellcheck across Python and Shell
+- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, CI/CD, issue/PR templates, DCO
+- **Dependency Health** (20%): version pinning, dependency count, known-good packages
 
-**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.
+**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score. No AI in the scoring path.
 
-When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
+When code quality is unscorable (e.g., a Go or Rust repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
 
 ## Try It Yourself