Skip to content

Commit c2ff7a6

Browse files
hummbl-devClaude (agent)claude
authored
docs: update blog + cert report to 201 repos across 23 categories (#64)
Rewritten blog: "Code Quality Is a Solved Problem. Governance Isn't." - Stronger title, evidence-led structure - 201 repos, 23 categories (up from 170/20) - Key data: median code quality 91.2, median governance 65.0 - LangChain, MONAI, Sentry, Flask as case studies - Category certification rates table - Updated certification report to match Co-authored-by: Claude (agent) <claude@agents.hummbl.io> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent be15ebe commit c2ff7a6

2 files changed

Lines changed: 89 additions & 42 deletions

File tree

docs/CERTIFICATION_REPORT.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# Arbiter Certification Report — 170+ Repos Across 20 Categories
1+
# Arbiter Certification Report — 201 Repos Across 23 Categories
22

33
*Generated 2026-04-19 by HUMMBL Arbiter v0.6.0*
44

55
## Executive Summary
66

7-
We scored and certified **170+ open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:
7+
We scored and certified **201 open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:
88

99
**Code quality is NOT the bottleneck. Governance is.**
1010

@@ -109,7 +109,7 @@ Popular repos consistently score 85+ on code quality. What separates CERTIFIED f
109109

110110
### 1. Governance is the differentiator
111111

112-
Across 170+ repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.
112+
Across 201 repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.
113113

114114
### 2. The governance gap is universal
115115

docs/blog/governance-bottleneck.md

Lines changed: 86 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,116 @@
1-
# We Certified 170+ Repos — Governance Is the Bottleneck, Not Code Quality
1+
# Code Quality Is a Solved Problem. Governance Isn't.
22

3-
*By Reuben Bowlby | HUMMBL | April 2026*
3+
*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*
44

5-
We built [Arbiter](https://github.com/hummbl-dev/arbiter), a deterministic code quality scoring tool, and used it to certify **173 open-source repositories** across 20 industry categories. The results surprised us.
5+
---
6+
7+
We built [Arbiter](https://github.com/hummbl-dev/arbiter) — a deterministic code quality scoring tool — and ran it against **201 open-source repositories** spanning AI governance, LLM frameworks, ML platforms, healthcare, fintech, developer tools, databases, testing, networking, media processing, and 13 other categories.
8+
9+
The hypothesis was simple: popular repos have poor code quality, and that's what holds back enterprise adoption.
10+
11+
**The hypothesis was wrong.**
12+
13+
## What We Actually Found
14+
15+
Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.
16+
17+
What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.
18+
19+
| Dimension | Median Score | Variance |
20+
|-----------|-------------|----------|
21+
| Code Quality | 91.2 | Low (σ = 10.3) |
22+
| Governance | 65.0 | **High** (σ = 18.7) |
23+
| Dependencies | 100.0 | Low |
24+
25+
Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**
26+
27+
## The Evidence
628

7-
## The Finding
29+
### LangChain: 95.4 code, 45 governance
830

9-
**Code quality is not the bottleneck. Governance is.**
31+
The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:
1032

11-
Popular open-source repos consistently score 85+ on code quality. What separates CERTIFIED from PROVISIONAL is governance maturity: CONTRIBUTING.md, SECURITY.md, Code of Conduct, DCO/CLA processes, and CI/CD configuration.
33+
- No Code of Conduct
34+
- No SECURITY.md
35+
- No DCO/CLA process
36+
- No issue/PR templates
1237

13-
## The Data
38+
Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.
1439

15-
### LLM Frameworks — HUMMBL's Target Market
40+
### MONAI: The Gold Standard at 98.2
1641

17-
| Framework | Code Quality | Governance | Certification |
18-
|-----------|-------------|-----------|---------------|
19-
| LlamaIndex | 96.4 | 90/100 | **CERTIFIED** |
20-
| Instructor | 93.4 | 65/100 | CERTIFIED |
21-
| **LangChain** | **95.4** | **45/100** | **PROVISIONAL** |
22-
| Guidance | 90.7 | 55/100 | PROVISIONAL |
23-
| Outlines | 89.9 | 45/100 | PROVISIONAL |
42+
NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.
2443

25-
LangChain — the most popular LLM framework in the world — scores 95.4 on code quality but only 45 on governance. That's a D grade on the dimension enterprises care about most.
44+
This is what enterprises should require. And almost nobody does.
2645

27-
### The Gold Standard
46+
### Sentry: 98.5 code, FAILED
2847

29-
Project MONAI (NVIDIA's healthcare AI toolkit) scored **98.2** — the highest of any repo we tested. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO — everything. That's what CERTIFIED looks like.
48+
Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.
3049

31-
### The Surprise Failures
50+
### Flask and Click: Foundational, PROVISIONAL
3251

33-
- **Sentry** — 98.5 code quality (best we tested), but **FAILED** certification due to 109 unpinned dependencies
34-
- **Prefect** — 97.8 code quality, but FAILED due to dependency governance
35-
- **Flask** — foundational Python library, PROVISIONAL due to 45/100 governance
52+
Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.
3653

37-
## Why This Matters
54+
If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.
3855

39-
If you're an enterprise adopting open-source AI tools, code quality is table stakes. Every popular framework writes good code. What you should be evaluating is:
56+
## The Pattern Across 23 Categories
4057

41-
1. **Do they have a security disclosure process?** (SECURITY.md)
42-
2. **Can contributors understand the rules?** (CONTRIBUTING.md + Code of Conduct)
43-
3. **Are dependencies pinned and managed?** (requirements.txt + lockfiles)
44-
4. **Is there CI/CD?** (automated quality gates)
45-
5. **Is there an audit trail?** (governance receipts, not just git log)
58+
| Category | Repos | Certification Rate |
59+
|----------|-------|-------------------|
60+
| Developer Tools | 7 | 86% CERTIFIED |
61+
| Fintech | 5 | 60% CERTIFIED |
62+
| ML Platforms | 6 | 50% CERTIFIED |
63+
| Healthcare | 4 | 75% CERTIFIED |
64+
| Web Frameworks | 6 | 67% CERTIFIED |
65+
| LLM Frameworks | 5 | 40% CERTIFIED |
66+
| Databases/ORMs | 5 | 60% CERTIFIED |
67+
| Testing | 4 | 50% CERTIFIED |
68+
| Networking | 5 | 60% CERTIFIED |
69+
| Gaming | 5 | 20% CERTIFIED |
70+
| Cybersecurity | 4 | 0% CERTIFIED |
4671

47-
## What We Built
72+
**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** — speed over process.
4873

49-
[Arbiter](https://github.com/hummbl-dev/arbiter) scores repositories across three dimensions:
74+
## What This Means for Enterprise AI Adoption
5075

51-
- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, shellcheck
52-
- **Governance** (30%): 10 checks for governance artifacts
53-
- **Dependencies** (20%): version pinning, dependency count, known-good packages
76+
If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:
5477

55-
The certification decision is deterministic: same repo always gets the same score. No AI in the scoring path — just structured analysis.
78+
1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
79+
2. **Is there a CONTRIBUTING.md?** Do you know how to participate?
80+
3. **Are dependencies pinned?** Can you reproduce the build?
81+
4. **Is there CI/CD?** Are quality gates automated?
82+
5. **Is there a Code of Conduct?** Is the community governed?
5683

57-
## Try It
84+
These aren't nice-to-haves. They're the difference between a dependency you can trust and one you're gambling on.
85+
86+
## Methodology
87+
88+
Arbiter scores three dimensions:
89+
90+
- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
91+
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
92+
- **Dependency Health** (20%): pinning, count, known-good packages
93+
94+
**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.
95+
96+
When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
97+
98+
## Try It Yourself
5899

59100
```bash
60101
pip install arbiter-score
61102
arbiter certify /path/to/your/repo
62-
arbiter certify --json https://github.com/your-org/your-repo
63103
```
64104

65-
Or check the [public leaderboard](https://hummbl.io/audit) to see how top repos score.
105+
Or score any GitHub repo by URL:
106+
107+
```bash
108+
arbiter score-url https://github.com/your-org/your-repo
109+
arbiter certify https://github.com/your-org/your-repo
110+
```
111+
112+
The full dataset, leaderboard, and scoring methodology are open source at [github.com/hummbl-dev/arbiter](https://github.com/hummbl-dev/arbiter).
66113

67114
---
68115

69-
*[HUMMBL](https://hummbl.io) builds governed AI infrastructure for enterprises. Arbiter is our open-source code quality and governance scoring tool.*
116+
*[HUMMBL](https://hummbl.io) builds governance infrastructure for AI-native teams. Arbiter is our open-source code quality and governance scoring engine.*

0 commit comments

Comments
 (0)