Skip to content

Commit bc5dd2b

Browse files
Claude (agent)claude
andcommitted
fix(blog): correct factual errors + apply editorial + peer review feedback
Corrections from self-audit: - Sentry: FAILED → PROVISIONAL (dep floor fix changed outcome) - Sentry: "best code quality" → "among the highest" (Stripe scored 98.9) - LangChain: removed "No issue/PR templates" (they have both) - Removed unverified median/σ statistics table Editorial review fixes: - Added context for why unpinned deps fail (supply chain risk) - Acknowledged cybersecurity n=4 sample size - Added transition paragraph before enterprise advice section - Softened Flask claim to "no documented security response process" - Expanded DCO on first use with link - Changed "Governance is where the signal is" → "actual risk surface" - Fixed passive voice on MONAI section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c2ff7a6 commit bc5dd2b

1 file changed

Lines changed: 23 additions & 27 deletions

File tree

docs/blog/governance-bottleneck.md

Lines changed: 23 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Code Quality Is a Solved Problem. Governance Isn't.
22

3-
*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*
3+
*We certified 201 open-source repositories across 23 industries. Here's what the data says.*
44

55
---
66

@@ -12,67 +12,63 @@ The hypothesis was simple: popular repos have poor code quality, and that's what
1212

1313
## What We Actually Found
1414

15-
Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.
15+
Code quality across popular open-source repos is **remarkably consistent**. Most repos score above 85 on code quality — solidly in the A and B range. The tools work. The linters work. Developers lint their code.
1616

1717
What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.
1818

19-
| Dimension | Median Score | Variance |
20-
|-----------|-------------|----------|
21-
| Code Quality | 91.2 | Low (σ = 10.3) |
22-
| Governance | 65.0 | **High** (σ = 18.7) |
23-
| Dependencies | 100.0 | Low |
24-
25-
Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**
19+
Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality clusters between 85 and 98. **Governance is the actual risk surface.**
2620

2721
## The Evidence
2822

2923
### LangChain: 95.4 code, 45 governance
3024

3125
The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:
3226

33-
- No Code of Conduct
27+
- No CONTRIBUTING.md
3428
- No SECURITY.md
35-
- No DCO/CLA process
36-
- No issue/PR templates
29+
- No Code of Conduct
30+
- No DCO ([Developer Certificate of Origin](https://developercertificate.org/)) or CLA process
3731

3832
Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.
3933

4034
### MONAI: The Gold Standard at 98.2
4135

4236
NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.
4337

44-
This is what enterprises should require. And almost nobody does.
38+
This is what enterprises should require. Almost nobody does.
4539

46-
### Sentry: 98.5 code, FAILED
40+
### Sentry: Near-Perfect Code, Still PROVISIONAL
4741

48-
Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.
42+
Sentry scores 98.5 on code quality — among the highest we tested. But it lands at **PROVISIONAL** because of 109 unpinned dependencies. Arbiter's dependency scoring penalizes unversioned dependency declarations because they make builds unreproducible and expand the supply chain attack surface. The risk isn't the code — it's the supply chain.
4943

5044
### Flask and Click: Foundational, PROVISIONAL
5145

5246
Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.
5347

54-
If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.
48+
If your enterprise depends on Flask, you're trusting a library with no documented security response process.
5549

5650
## The Pattern Across 23 Categories
5751

5852
| Category | Repos | Certification Rate |
5953
|----------|-------|-------------------|
6054
| Developer Tools | 7 | 86% CERTIFIED |
61-
| Fintech | 5 | 60% CERTIFIED |
62-
| ML Platforms | 6 | 50% CERTIFIED |
6355
| Healthcare | 4 | 75% CERTIFIED |
6456
| Web Frameworks | 6 | 67% CERTIFIED |
65-
| LLM Frameworks | 5 | 40% CERTIFIED |
57+
| Fintech | 5 | 60% CERTIFIED |
6658
| Databases/ORMs | 5 | 60% CERTIFIED |
67-
| Testing | 4 | 50% CERTIFIED |
6859
| Networking | 5 | 60% CERTIFIED |
60+
| ML Platforms | 6 | 50% CERTIFIED |
61+
| Testing | 4 | 50% CERTIFIED |
62+
| LLM Frameworks | 5 | 40% CERTIFIED |
6963
| Gaming | 5 | 20% CERTIFIED |
70-
| Cybersecurity | 4 | 0% CERTIFIED |
64+
| Cybersecurity Tools | 4 | 0% CERTIFIED |
7165

72-
**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** speed over process.
66+
**Developer tools lead** pytest, pip, tox. The people who build tools for quality also practice quality. **Cybersecurity tools** (pwntools, nmap, sqlmap, routersploit) all landed PROVISIONAL — strong code, weak governance artifacts. With only 4 repos in the sample, this likely reflects the tooling culture's bias toward speed over process rather than an industry-wide gap.
7367

7468
## What This Means for Enterprise AI Adoption
7569

70+
These aren't academic distinctions. When AI tooling fails in production, the investigation often traces back to governance gaps, not code bugs.
71+
7672
If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:
7773

7874
1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
@@ -87,13 +83,13 @@ These aren't nice-to-haves. They're the difference between a dependency you can
8783

8884
Arbiter scores three dimensions:
8985

90-
- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
91-
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
92-
- **Dependency Health** (20%): pinning, count, known-good packages
86+
- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, vulture, shellcheck across Python and Shell
87+
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, CI/CD, issue/PR templates, DCO
88+
- **Dependency Health** (20%): version pinning, dependency count, known-good packages
9389

94-
**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.
90+
**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score. No AI in the scoring path.
9591

96-
When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
92+
When code quality is unscorable (e.g., a Go or Rust repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
9793

9894
## Try It Yourself
9995

0 commit comments

Comments
 (0)