hummbl-dev · hummbl-dev · Apr 19, 2026 · Apr 19, 2026
@@ -1,10 +1,10 @@
-# Arbiter Certification Report — 170+ Repos Across 20 Categories
+# Arbiter Certification Report — 201 Repos Across 23 Categories
 
 *Generated 2026-04-19 by HUMMBL Arbiter v0.6.0*
 
 ## Executive Summary
 
-We scored and certified **170+ open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:
+We scored and certified **201 open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:
 
 **Code quality is NOT the bottleneck. Governance is.**
 
@@ -109,7 +109,7 @@ Popular repos consistently score 85+ on code quality. What separates CERTIFIED f
 
 ### 1. Governance is the differentiator
 
-Across 170+ repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.
+Across 201 repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.
 
 ### 2. The governance gap is universal
 

@@ -1,69 +1,116 @@
-# We Certified 170+ Repos — Governance Is the Bottleneck, Not Code Quality
+# Code Quality Is a Solved Problem. Governance Isn't.
 
-*By Reuben Bowlby | HUMMBL | April 2026*
+*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*
 
-We built [Arbiter](https://github.com/hummbl-dev/arbiter), a deterministic code quality scoring tool, and used it to certify **173 open-source repositories** across 20 industry categories. The results surprised us.
+---
+
+We built [Arbiter](https://github.com/hummbl-dev/arbiter) — a deterministic code quality scoring tool — and ran it against **201 open-source repositories** spanning AI governance, LLM frameworks, ML platforms, healthcare, fintech, developer tools, databases, testing, networking, media processing, and 13 other categories.
+
+The hypothesis was simple: popular repos have poor code quality, and that's what holds back enterprise adoption.
+
+**The hypothesis was wrong.**
+
+## What We Actually Found
+
+Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.
+
+What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.
+
+| Dimension | Median Score | Variance |
+|-----------|-------------|----------|
+| Code Quality | 91.2 | Low (σ = 10.3) |
+| Governance | 65.0 | **High** (σ = 18.7) |
+| Dependencies | 100.0 | Low |
+
+Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**
+
+## The Evidence
 
-## The Finding
+### LangChain: 95.4 code, 45 governance
 
-**Code quality is not the bottleneck. Governance is.**
+The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:
 
-Popular open-source repos consistently score 85+ on code quality. What separates CERTIFIED from PROVISIONAL is governance maturity: CONTRIBUTING.md, SECURITY.md, Code of Conduct, DCO/CLA processes, and CI/CD configuration.
+- No Code of Conduct
+- No SECURITY.md
+- No DCO/CLA process
+- No issue/PR templates
 
-## The Data
+Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.
 
-### LLM Frameworks — HUMMBL's Target Market
+### MONAI: The Gold Standard at 98.2
 
-| Framework | Code Quality | Governance | Certification |
-|-----------|-------------|-----------|---------------|
-| LlamaIndex | 96.4 | 90/100 | **CERTIFIED** |
-| Instructor | 93.4 | 65/100 | CERTIFIED |
-| **LangChain** | **95.4** | **45/100** | **PROVISIONAL** |
-| Guidance | 90.7 | 55/100 | PROVISIONAL |
-| Outlines | 89.9 | 45/100 | PROVISIONAL |
+NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.
 
-LangChain — the most popular LLM framework in the world — scores 95.4 on code quality but only 45 on governance. That's a D grade on the dimension enterprises care about most.
+This is what enterprises should require. And almost nobody does.
 
-### The Gold Standard
+### Sentry: 98.5 code, FAILED
 
-Project MONAI (NVIDIA's healthcare AI toolkit) scored **98.2** — the highest of any repo we tested. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO — everything. That's what CERTIFIED looks like.
+Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.
 
-### The Surprise Failures
+### Flask and Click: Foundational, PROVISIONAL
 
-- **Sentry** — 98.5 code quality (best we tested), but **FAILED** certification due to 109 unpinned dependencies
-- **Prefect** — 97.8 code quality, but FAILED due to dependency governance
-- **Flask** — foundational Python library, PROVISIONAL due to 45/100 governance
+Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.
 
-## Why This Matters
+If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.
 
-If you're an enterprise adopting open-source AI tools, code quality is table stakes. Every popular framework writes good code. What you should be evaluating is:
+## The Pattern Across 23 Categories
 
-1. **Do they have a security disclosure process?** (SECURITY.md)
-2. **Can contributors understand the rules?** (CONTRIBUTING.md + Code of Conduct)
-3. **Are dependencies pinned and managed?** (requirements.txt + lockfiles)
-4. **Is there CI/CD?** (automated quality gates)
-5. **Is there an audit trail?** (governance receipts, not just git log)
+| Category | Repos | Certification Rate |
+|----------|-------|-------------------|
+| Developer Tools | 7 | 86% CERTIFIED |
+| Fintech | 5 | 60% CERTIFIED |
+| ML Platforms | 6 | 50% CERTIFIED |
+| Healthcare | 4 | 75% CERTIFIED |
+| Web Frameworks | 6 | 67% CERTIFIED |
+| LLM Frameworks | 5 | 40% CERTIFIED |
+| Databases/ORMs | 5 | 60% CERTIFIED |
+| Testing | 4 | 50% CERTIFIED |
+| Networking | 5 | 60% CERTIFIED |
+| Gaming | 5 | 20% CERTIFIED |
+| Cybersecurity | 4 | 0% CERTIFIED |
 
-## What We Built
+**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** — speed over process.
 
-[Arbiter](https://github.com/hummbl-dev/arbiter) scores repositories across three dimensions:
+## What This Means for Enterprise AI Adoption
 
-- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, shellcheck
-- **Governance** (30%): 10 checks for governance artifacts
-- **Dependencies** (20%): version pinning, dependency count, known-good packages
+If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:
 
-The certification decision is deterministic: same repo always gets the same score. No AI in the scoring path — just structured analysis.
+1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
+2. **Is there a CONTRIBUTING.md?** Do you know how to participate?
+3. **Are dependencies pinned?** Can you reproduce the build?
+4. **Is there CI/CD?** Are quality gates automated?
+5. **Is there a Code of Conduct?** Is the community governed?
 
-## Try It
+These aren't nice-to-haves. They're the difference between a dependency you can trust and one you're gambling on.
+
+## Methodology
+
+Arbiter scores three dimensions:
+
+- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
+- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
+- **Dependency Health** (20%): pinning, count, known-good packages
+
+**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.
+
+When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
+
+## Try It Yourself
 
 ```bash
 pip install arbiter-score
 arbiter certify /path/to/your/repo
-arbiter certify --json https://github.com/your-org/your-repo
 ```
 
-Or check the [public leaderboard](https://hummbl.io/audit) to see how top repos score.
+Or score any GitHub repo by URL:
+
+```bash
+arbiter score-url https://github.com/your-org/your-repo
+arbiter certify https://github.com/your-org/your-repo
+```
+
+The full dataset, leaderboard, and scoring methodology are open source at [github.com/hummbl-dev/arbiter](https://github.com/hummbl-dev/arbiter).
 
 ---
 
-*[HUMMBL](https://hummbl.io) builds governed AI infrastructure for enterprises. Arbiter is our open-source code quality and governance scoring tool.*
+*[HUMMBL](https://hummbl.io) builds governance infrastructure for AI-native teams. Arbiter is our open-source code quality and governance scoring engine.*