|
1 | | -# We Certified 170+ Repos — Governance Is the Bottleneck, Not Code Quality |
| 1 | +# Code Quality Is a Solved Problem. Governance Isn't. |
2 | 2 |
|
3 | | -*By Reuben Bowlby | HUMMBL | April 2026* |
| 3 | +*We certified 200+ open-source repositories across 23 industries. Here's what the data says.* |
4 | 4 |
|
5 | | -We built [Arbiter](https://github.com/hummbl-dev/arbiter), a deterministic code quality scoring tool, and used it to certify **173 open-source repositories** across 20 industry categories. The results surprised us. |
| 5 | +--- |
| 6 | + |
| 7 | +We built [Arbiter](https://github.com/hummbl-dev/arbiter) — a deterministic code quality scoring tool — and ran it against **201 open-source repositories** spanning AI governance, LLM frameworks, ML platforms, healthcare, fintech, developer tools, databases, testing, networking, media processing, and 13 other categories. |
| 8 | + |
| 9 | +The hypothesis was simple: popular repos have poor code quality, and that's what holds back enterprise adoption. |
| 10 | + |
| 11 | +**The hypothesis was wrong.** |
| 12 | + |
| 13 | +## What We Actually Found |
| 14 | + |
| 15 | +Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code. |
| 16 | + |
| 17 | +What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**. |
| 18 | + |
| 19 | +| Dimension | Median Score | Variance | |
| 20 | +|-----------|-------------|----------| |
| 21 | +| Code Quality | 91.2 | Low (σ = 10.3) | |
| 22 | +| Governance | 65.0 | **High** (σ = 18.7) | |
| 23 | +| Dependencies | 100.0 | Low | |
| 24 | + |
| 25 | +Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.** |
| 26 | + |
| 27 | +## The Evidence |
6 | 28 |
|
7 | | -## The Finding |
| 29 | +### LangChain: 95.4 code, 45 governance |
8 | 30 |
|
9 | | -**Code quality is not the bottleneck. Governance is.** |
| 31 | +The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance: |
10 | 32 |
|
11 | | -Popular open-source repos consistently score 85+ on code quality. What separates CERTIFIED from PROVISIONAL is governance maturity: CONTRIBUTING.md, SECURITY.md, Code of Conduct, DCO/CLA processes, and CI/CD configuration. |
| 33 | +- No Code of Conduct |
| 34 | +- No SECURITY.md |
| 35 | +- No DCO/CLA process |
| 36 | +- No issue/PR templates |
12 | 37 |
|
13 | | -## The Data |
| 38 | +Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist. |
14 | 39 |
|
15 | | -### LLM Frameworks — HUMMBL's Target Market |
| 40 | +### MONAI: The Gold Standard at 98.2 |
16 | 41 |
|
17 | | -| Framework | Code Quality | Governance | Certification | |
18 | | -|-----------|-------------|-----------|---------------| |
19 | | -| LlamaIndex | 96.4 | 90/100 | **CERTIFIED** | |
20 | | -| Instructor | 93.4 | 65/100 | CERTIFIED | |
21 | | -| **LangChain** | **95.4** | **45/100** | **PROVISIONAL** | |
22 | | -| Guidance | 90.7 | 55/100 | PROVISIONAL | |
23 | | -| Outlines | 89.9 | 45/100 | PROVISIONAL | |
| 42 | +NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked. |
24 | 43 |
|
25 | | -LangChain — the most popular LLM framework in the world — scores 95.4 on code quality but only 45 on governance. That's a D grade on the dimension enterprises care about most. |
| 44 | +This is what enterprises should require. And almost nobody does. |
26 | 45 |
|
27 | | -### The Gold Standard |
| 46 | +### Sentry: 98.5 code, FAILED |
28 | 47 |
|
29 | | -Project MONAI (NVIDIA's healthcare AI toolkit) scored **98.2** — the highest of any repo we tested. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO — everything. That's what CERTIFIED looks like. |
| 48 | +Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain. |
30 | 49 |
|
31 | | -### The Surprise Failures |
| 50 | +### Flask and Click: Foundational, PROVISIONAL |
32 | 51 |
|
33 | | -- **Sentry** — 98.5 code quality (best we tested), but **FAILED** certification due to 109 unpinned dependencies |
34 | | -- **Prefect** — 97.8 code quality, but FAILED due to dependency governance |
35 | | -- **Flask** — foundational Python library, PROVISIONAL due to 45/100 governance |
| 52 | +Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO. |
36 | 53 |
|
37 | | -## Why This Matters |
| 54 | +If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process. |
38 | 55 |
|
39 | | -If you're an enterprise adopting open-source AI tools, code quality is table stakes. Every popular framework writes good code. What you should be evaluating is: |
| 56 | +## The Pattern Across 23 Categories |
40 | 57 |
|
41 | | -1. **Do they have a security disclosure process?** (SECURITY.md) |
42 | | -2. **Can contributors understand the rules?** (CONTRIBUTING.md + Code of Conduct) |
43 | | -3. **Are dependencies pinned and managed?** (requirements.txt + lockfiles) |
44 | | -4. **Is there CI/CD?** (automated quality gates) |
45 | | -5. **Is there an audit trail?** (governance receipts, not just git log) |
| 58 | +| Category | Repos | Certification Rate | |
| 59 | +|----------|-------|-------------------| |
| 60 | +| Developer Tools | 7 | 86% CERTIFIED | |
| 61 | +| Fintech | 5 | 60% CERTIFIED | |
| 62 | +| ML Platforms | 6 | 50% CERTIFIED | |
| 63 | +| Healthcare | 4 | 75% CERTIFIED | |
| 64 | +| Web Frameworks | 6 | 67% CERTIFIED | |
| 65 | +| LLM Frameworks | 5 | 40% CERTIFIED | |
| 66 | +| Databases/ORMs | 5 | 60% CERTIFIED | |
| 67 | +| Testing | 4 | 50% CERTIFIED | |
| 68 | +| Networking | 5 | 60% CERTIFIED | |
| 69 | +| Gaming | 5 | 20% CERTIFIED | |
| 70 | +| Cybersecurity | 4 | 0% CERTIFIED | |
46 | 71 |
|
47 | | -## What We Built |
| 72 | +**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** — speed over process. |
48 | 73 |
|
49 | | -[Arbiter](https://github.com/hummbl-dev/arbiter) scores repositories across three dimensions: |
| 74 | +## What This Means for Enterprise AI Adoption |
50 | 75 |
|
51 | | -- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, shellcheck |
52 | | -- **Governance** (30%): 10 checks for governance artifacts |
53 | | -- **Dependencies** (20%): version pinning, dependency count, known-good packages |
| 76 | +If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking: |
54 | 77 |
|
55 | | -The certification decision is deterministic: same repo always gets the same score. No AI in the scoring path — just structured analysis. |
| 78 | +1. **Is there a SECURITY.md?** Can you report vulnerabilities privately? |
| 79 | +2. **Is there a CONTRIBUTING.md?** Do you know how to participate? |
| 80 | +3. **Are dependencies pinned?** Can you reproduce the build? |
| 81 | +4. **Is there CI/CD?** Are quality gates automated? |
| 82 | +5. **Is there a Code of Conduct?** Is the community governed? |
56 | 83 |
|
57 | | -## Try It |
| 84 | +These aren't nice-to-haves. They're the difference between a dependency you can trust and one you're gambling on. |
| 85 | + |
| 86 | +## Methodology |
| 87 | + |
| 88 | +Arbiter scores three dimensions: |
| 89 | + |
| 90 | +- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell |
| 91 | +- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO |
| 92 | +- **Dependency Health** (20%): pinning, count, known-good packages |
| 93 | + |
| 94 | +**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score. |
| 95 | + |
| 96 | +When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing. |
| 97 | + |
| 98 | +## Try It Yourself |
58 | 99 |
|
59 | 100 | ```bash |
60 | 101 | pip install arbiter-score |
61 | 102 | arbiter certify /path/to/your/repo |
62 | | -arbiter certify --json https://github.com/your-org/your-repo |
63 | 103 | ``` |
64 | 104 |
|
65 | | -Or check the [public leaderboard](https://hummbl.io/audit) to see how top repos score. |
| 105 | +Or score any GitHub repo by URL: |
| 106 | + |
| 107 | +```bash |
| 108 | +arbiter score-url https://github.com/your-org/your-repo |
| 109 | +arbiter certify https://github.com/your-org/your-repo |
| 110 | +``` |
| 111 | + |
| 112 | +The full dataset, leaderboard, and scoring methodology are open source at [github.com/hummbl-dev/arbiter](https://github.com/hummbl-dev/arbiter). |
66 | 113 |
|
67 | 114 | --- |
68 | 115 |
|
69 | | -*[HUMMBL](https://hummbl.io) builds governed AI infrastructure for enterprises. Arbiter is our open-source code quality and governance scoring tool.* |
| 116 | +*[HUMMBL](https://hummbl.io) builds governance infrastructure for AI-native teams. Arbiter is our open-source code quality and governance scoring engine.* |
0 commit comments