Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/CERTIFICATION_REPORT.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Arbiter Certification Report — 170+ Repos Across 20 Categories
# Arbiter Certification Report — 201 Repos Across 23 Categories

*Generated 2026-04-19 by HUMMBL Arbiter v0.6.0*

## Executive Summary

We scored and certified **170+ open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:
We scored and certified **201 open-source repositories** across 20 industry categories using Arbiter's deterministic quality scoring engine. The data reveals a consistent pattern:

**Code quality is NOT the bottleneck. Governance is.**

Expand Down Expand Up @@ -109,7 +109,7 @@ Popular repos consistently score 85+ on code quality. What separates CERTIFIED f

### 1. Governance is the differentiator

Across 170+ repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.
Across 201 repos, code quality is consistently high (85+). The factor that separates CERTIFIED from PROVISIONAL is governance maturity — the exact dimension enterprises care about and the exact gap HUMMBL fills.

### 2. The governance gap is universal

Expand Down
125 changes: 86 additions & 39 deletions docs/blog/governance-bottleneck.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,116 @@
# We Certified 170+ Repos — Governance Is the Bottleneck, Not Code Quality
# Code Quality Is a Solved Problem. Governance Isn't.

*By Reuben Bowlby | HUMMBL | April 2026*
*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*

We built [Arbiter](https://github.com/hummbl-dev/arbiter), a deterministic code quality scoring tool, and used it to certify **173 open-source repositories** across 20 industry categories. The results surprised us.
---

We built [Arbiter](https://github.com/hummbl-dev/arbiter) — a deterministic code quality scoring tool — and ran it against **201 open-source repositories** spanning AI governance, LLM frameworks, ML platforms, healthcare, fintech, developer tools, databases, testing, networking, media processing, and 13 other categories.

The hypothesis was simple: popular repos have poor code quality, and that's what holds back enterprise adoption.

**The hypothesis was wrong.**

## What We Actually Found

Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.

What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.

| Dimension | Median Score | Variance |
|-----------|-------------|----------|
| Code Quality | 91.2 | Low (σ = 10.3) |
| Governance | 65.0 | **High** (σ = 18.7) |
| Dependencies | 100.0 | Low |

Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**

## The Evidence

## The Finding
### LangChain: 95.4 code, 45 governance

**Code quality is not the bottleneck. Governance is.**
The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:

Popular open-source repos consistently score 85+ on code quality. What separates CERTIFIED from PROVISIONAL is governance maturity: CONTRIBUTING.md, SECURITY.md, Code of Conduct, DCO/CLA processes, and CI/CD configuration.
- No Code of Conduct
- No SECURITY.md
- No DCO/CLA process
- No issue/PR templates

## The Data
Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.

### LLM Frameworks — HUMMBL's Target Market
### MONAI: The Gold Standard at 98.2

| Framework | Code Quality | Governance | Certification |
|-----------|-------------|-----------|---------------|
| LlamaIndex | 96.4 | 90/100 | **CERTIFIED** |
| Instructor | 93.4 | 65/100 | CERTIFIED |
| **LangChain** | **95.4** | **45/100** | **PROVISIONAL** |
| Guidance | 90.7 | 55/100 | PROVISIONAL |
| Outlines | 89.9 | 45/100 | PROVISIONAL |
NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.

LangChain — the most popular LLM framework in the world — scores 95.4 on code quality but only 45 on governance. That's a D grade on the dimension enterprises care about most.
This is what enterprises should require. And almost nobody does.

### The Gold Standard
### Sentry: 98.5 code, FAILED

Project MONAI (NVIDIA's healthcare AI toolkit) scored **98.2** — the highest of any repo we tested. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO — everything. That's what CERTIFIED looks like.
Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.

### The Surprise Failures
### Flask and Click: Foundational, PROVISIONAL

- **Sentry** — 98.5 code quality (best we tested), but **FAILED** certification due to 109 unpinned dependencies
- **Prefect** — 97.8 code quality, but FAILED due to dependency governance
- **Flask** — foundational Python library, PROVISIONAL due to 45/100 governance
Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.

## Why This Matters
If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.

If you're an enterprise adopting open-source AI tools, code quality is table stakes. Every popular framework writes good code. What you should be evaluating is:
## The Pattern Across 23 Categories

1. **Do they have a security disclosure process?** (SECURITY.md)
2. **Can contributors understand the rules?** (CONTRIBUTING.md + Code of Conduct)
3. **Are dependencies pinned and managed?** (requirements.txt + lockfiles)
4. **Is there CI/CD?** (automated quality gates)
5. **Is there an audit trail?** (governance receipts, not just git log)
| Category | Repos | Certification Rate |
|----------|-------|-------------------|
| Developer Tools | 7 | 86% CERTIFIED |
| Fintech | 5 | 60% CERTIFIED |
| ML Platforms | 6 | 50% CERTIFIED |
| Healthcare | 4 | 75% CERTIFIED |
| Web Frameworks | 6 | 67% CERTIFIED |
| LLM Frameworks | 5 | 40% CERTIFIED |
| Databases/ORMs | 5 | 60% CERTIFIED |
| Testing | 4 | 50% CERTIFIED |
| Networking | 5 | 60% CERTIFIED |
| Gaming | 5 | 20% CERTIFIED |
| Cybersecurity | 4 | 0% CERTIFIED |

## What We Built
**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** — speed over process.

[Arbiter](https://github.com/hummbl-dev/arbiter) scores repositories across three dimensions:
## What This Means for Enterprise AI Adoption

- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, shellcheck
- **Governance** (30%): 10 checks for governance artifacts
- **Dependencies** (20%): version pinning, dependency count, known-good packages
If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:

The certification decision is deterministic: same repo always gets the same score. No AI in the scoring path — just structured analysis.
1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
2. **Is there a CONTRIBUTING.md?** Do you know how to participate?
3. **Are dependencies pinned?** Can you reproduce the build?
4. **Is there CI/CD?** Are quality gates automated?
5. **Is there a Code of Conduct?** Is the community governed?

## Try It
These aren't nice-to-haves. They're the difference between a dependency you can trust and one you're gambling on.

## Methodology

Arbiter scores three dimensions:

- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
- **Dependency Health** (20%): pinning, count, known-good packages

**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.

When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.

## Try It Yourself

```bash
pip install arbiter-score
arbiter certify /path/to/your/repo
arbiter certify --json https://github.com/your-org/your-repo
```

Or check the [public leaderboard](https://hummbl.io/audit) to see how top repos score.
Or score any GitHub repo by URL:

```bash
arbiter score-url https://github.com/your-org/your-repo
arbiter certify https://github.com/your-org/your-repo
```

The full dataset, leaderboard, and scoring methodology are open source at [github.com/hummbl-dev/arbiter](https://github.com/hummbl-dev/arbiter).

---

*[HUMMBL](https://hummbl.io) builds governed AI infrastructure for enterprises. Arbiter is our open-source code quality and governance scoring tool.*
*[HUMMBL](https://hummbl.io) builds governance infrastructure for AI-native teams. Arbiter is our open-source code quality and governance scoring engine.*
Loading