The AI tools were used for planning, debugging, idea validation, and enhancing the language and structure of our documentation
We used the following AI-based tools during this project:
ChatGPT / Claude: Helped us plan the overall pipeline architecture and brainstorm how to structure the medallion layers. We also used it to debug issues in our ingestion scripts and Spark notebooks when we ran into errors. The architecture diagrams used in the documentation were generated with AI assistance. We also used these LLMs for sparring on ideas e.g. which sub-score weights made sense, discussing trade-offs (like why we skipped Z-ordering) etc. These LLMs also helped us improve and polish the project documentation and README files.
Grammarly: Used it to improve and enhance the project documentation.
OpenLens is a data pipeline that collects data from GitHub, PyPI, and Stack Overflow, runs it through a medallion architecture (Bronze -> Silver -> Gold) on Databricks with Delta Lake, and produces a health score for Python packages. We serve the results through a FastAPI backend and a Streamlit dashboard.
We wanted to build this because picking a Python library usually means checking stars on GitHub, maybe downloads on PyPI, or browsing Stack Overflow. But none of that gives you the full picture - a package can have 40k stars and still be unmaintained. So we combined everything into one score.
| Member | Contribution |
|---|---|
| Behroz Karim | Implemented Bronze layer, profiling, performance tuning, tests. Documented the project. |
| Hasnain Ajmal | Implemented data ingestion scripts (GitHub, PyPI, StackOverflow APIs), Databricks workspace setup, Gold layer (health scoring & sentiment aggregation), FastAPI REST endpoints, and Streamlit dashboard with including pipeline stats. |
| Talha Rizwan | Implemented Silver layer, and Automated Job for ETL in Databricks. Helped a little in Documentation |
Here's what we did and where to find each part.
| Task | Where to find it |
|---|---|
| Solution description with diagrams, data schemas, pipeline description | This README; docs/pipeline_architecture.md; |
| Repository organization & documentation | See folder structure below; every directory has its own README |
| Presentation & language | We used Markdown throughout with diagrams, tables, and screenshots |
| Scope & originality | Three-source health score; DistilBERT sentiment analysis; automated orchestration pipeline |
| Task | Where to find it |
|---|---|
| Structured data (batch) | GitHub ingestor and PyPI ingestor pull structured JSON from REST APIs |
| Unstructured data (batch) | Stack Overflow ingestor fetches raw HTML question/answer bodies; the GitHub ingestor also pulls README Markdown text |
| Automated batch ingestion | Cron orchestrator runs all three ingestors, uploads to DBFS, and triggers the Databricks job automatically |
| Task | Where to find it |
|---|---|
| ETL to clean, transform, and standardize | Silver notebooks: 02_silver_github, 03_silver_pypi, 04_silver_stackoverflow |
| Deduplication, type validation, basic metadata | Silver notebooks deduplicate on package name; Bronze envelope carries _ingested_at, _source, _endpoint metadata on every record |
| Advanced/automated metadata (profiling, schema evolution) | 04b_silver_profiling profiles all silver tables; 01b_bronze_schema_evolution tracks schema changes across runs |
| Task | Where to find it |
|---|---|
| Medallion architecture (Bronze-Silver-Gold) | Pipeline architecture doc; notebooks 01 through 05 implement the three layers |
| Data lineage | Data Lineage Bronze envelopes have _source and _endpoint fields on every record; schema evolution snapshots; YAML contracts list upstream table dependencies. Databricks also provide data lineage feature. |
| AI/ML model in the pipeline | 05_gold_health_score runs DistilBERT SST-2 sentiment analysis as a Spark UDF |
| Performance tuning | 06_optimize_vacuum runs OPTIMIZE (compaction) and VACUUM (stale file cleanup) on all Delta tables |
| Task | Where to find it |
|---|---|
| Visualizations and dashboard | Streamlit dashboard with a leaderboard, radar charts, and grouped bar charts |
| Pipeline statistics | Pipeline Stats page reads the structured JSONL logs and displays ingestion metrics |
| Task | Where to find it |
|---|---|
| External data usage via API | FastAPI REST API with 6 endpoints serving Gold-layer data as JSON over HTTP |
| Task | Where to find it |
|---|---|
| Data contracts | YAML contracts for package_health_scores, package_sentiment, and package_unified_features |
| Access data product through an API | FastAPI endpoints |
| Automated testing of schema/data | test/ contains 4 test suites that cover models, API endpoints, scoring logic, and log schema validation |
| Task | Where to find it |
|---|---|
| Structured logging & error handling | Every ingestor writes a human-readable .log file plus a structured .jsonl file; see data/logs/ and the Logging section below |
When you pick a Python library you usually check GitHub stars or maybe download numbers. But that only shows part of the picture. A package might have tons of stars but be abandoned, or have low visibility but great community support on Stack Overflow.
We wanted a single score that pulls together all these signals so you can compare packages across multiple dimensions at once.
The pipeline currently tracks 8 Python packages:
| PyPI Name | GitHub Repository | Stack Overflow Tag |
|---|---|---|
| pandas | pandas-dev/pandas | pandas |
| requests | psf/requests | python-requests |
| fastapi | tiangolo/fastapi | fastapi |
| numpy | numpy/numpy | numpy |
| flask | pallets/flask | flask |
| django | django/django | django |
| scikit-learn | scikit-learn/scikit-learn | scikit-learn |
| torch | pytorch/pytorch | pytorch |
To add a new package you just add an entry to TARGET_PACKAGES in the ingestor files.
High-level data flow:
┌──────────────────────────────────────────────────────────────────────────────┐
│ OpenLens Data Pipeline │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ GitHub │ │ PyPI │ │ Stack │ DATA SOURCES │
│ │ REST API │ │ JSON API │ │ Overflow API │ │
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Local Ingestion Layer │ code/ingestion/ │
│ │ github_ingestor pypi_ingestor so_ingestor│ │
│ └────────────────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ data/bronze/ (local JSON) │ Raw API responses │
│ └────────────────────┬────────────────────────┘ │
│ │ upload_to_dbfs.py │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ DBFS (dbfs:/FileStore/bronze/) │ Azure Databricks │
│ └────────────────────┬────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ │
│ │ BRONZE │ │ SILVER │ │ GOLD │ Delta Lake Tables │
│ │ (raw) │───▶│ (clean) │───▶│ (scored)│ ddc_databricks catalog │
│ └─────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ FastAPI REST │ │ Streamlit │ │
│ │ API (6 routes) │◀────────│ Dashboard │ │
│ └─────────────────┘ └──────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
More detail in docs/pipeline_architecture.md.
Raw API responses are stored as-is in JSON files under data/bronze/. Each record is wrapped in a metadata envelope so we can track where it came from:
{
"_ingested_at": "2026-03-22T10:00:00Z",
"_source": "github_rest_api",
"_pypi_package": "pandas",
"_endpoint": "https://api.github.com/repos/pandas-dev/pandas",
"data": { "...raw GitHub API response..." }
}Those underscore-prefixed fields are our own addition. They're not part of the original API response. They let us trace every single record back to its source, which is essentially our data lineage mechanism. After ingestion, [upload_to_dbfs.py](code/upload_to_dbfs.py) pushes these files to Databricks and [01_bronze_to_delta.ipynb](code/databricks/01_bronze_to_delta.ipynb) converts them into Delta tables.
[01b_bronze_schema_evolution.ipynb](code/databricks/01b_bronze_schema_evolution.ipynb) tracks whether the column structure changes between runs. This matters because external APIs sometimes add or drop fields.
The silver notebooks clean and transform the raw data:
| Notebook | Source | What it does |
|---|---|---|
| 02_silver_github | GitHub | Parses READMEs with BeautifulSoup, deduplicates records, aggregates contributor stats, extracts features like stars, forks, days since last push, README quality metrics |
| 03_silver_pypi | PyPI | Flattens the nested metadata, normalizes download stats, computes download momentum (30 day vs 90 day trend) |
| 04_silver_stackoverflow | Stack Overflow | Strips HTML from Q&A bodies, computes answered rates and average scores, measures community size |
| 04b_silver_profiling | All Silver | Profiles all 16 silver tables automatically: row counts, null rates, value distributions |
05_gold_health_score.ipynb joins the silver feature tables and produces three gold tables:
gold.package_unified_features- wide table with all GitHub, PyPI, and SO features per package (30+ columns)gold.package_sentiment- sentiment scores from DistilBERT (SST-2), run as a Spark UDF, with a keyword fallbackgold.package_health_scores- final composite scores and tier assignments
06_optimize_vacuum.ipynb runs after the pipeline finishes. It runs OPTIMIZE (compacts small files) and VACUUM (removes stale files older than 7 days) on every table across all three schemas. It discovers tables dynamically with SHOW TABLES so we don't need to update it when adding new tables.
Each package gets scored on four dimensions (0-100 each), then combined with weights:
| Sub-Score | Weight | Criteria |
|---|---|---|
| GitHub | 30% | Stars (25%), recency of last push (20%), contributor count (15%), forks (10%), contributor diversity (10%), README quality (10%), archived flag (5%), wiki flag (5%) |
| PyPI | 25% | Monthly downloads (35%), release maturity (25%), download momentum 30d vs 90d (20%), total releases (20%) |
| Community | 25% | SO question volume (25%), answered rate (20%), answer quality (20%), community size (20%), 30-day activity (15%) |
| Sentiment | 20% | DistilBERT compound sentiment across SO questions (30%), SO answers (30%), README text (25%), PyPI description (15%) |
Overall score: 0.30 * GitHub + 0.25 * PyPI + 0.25 * Community + 0.20 * Sentiment
Tier assignment:
| Tier | Range |
|---|---|
| A | 65 and above |
| B | 52-64.99 |
| C | 40-51.99 |
| D | Below 40 |
There was no science behind choosing these weights since our main focus was to build a data pipeline.
Every record in Bronze follows this format:
| Field | Type | What it is |
|---|---|---|
_ingested_at |
string (ISO 8601) | When we pulled this data |
_source |
string | Which API it came from (e.g. github_rest_api, pypi_json_api, stackexchange_api_v2.3) |
_pypi_package |
string | The package name |
_endpoint |
string | The exact API URL we called |
data / questions / answers / ... |
object or array | The raw API response, untouched |
We wrote formal column-level contracts in YAML in [code/contracts/](code/contracts/):
[package_health_scores.yaml](code/contracts/package_health_scores.yaml)- composite scores[package_sentiment.yaml](code/contracts/package_sentiment.yaml)- per-source sentiment[package_unified_features.yaml](code/contracts/package_unified_features.yaml)- wide feature table (30+ columns)
| Column | Type | Description |
|---|---|---|
package_name |
string | PyPI package name |
github_score |
double | GitHub sub-score [0-100] |
pypi_score |
double | PyPI sub-score [0-100] |
community_score |
double | SO community sub-score [0-100] |
sentiment_score |
double | Sentiment sub-score [0-100] |
overall_health_score |
double | Weighted composite [0-100] |
health_tier |
string | A, B, C, or D |
scored_at |
timestamp | When this scoring run happened |
| Column | Type | Description |
|---|---|---|
package_name |
string | PyPI package name |
so_question_sentiment_avg |
double | Avg sentiment of SO questions [-1, +1] |
so_answer_sentiment_avg |
double | Avg sentiment of SO answers [-1, +1] |
readme_sentiment_compound |
double | README sentiment [-1, +1] |
pypi_desc_sentiment_compound |
double | PyPI description sentiment [-1, +1] |
overall_sentiment |
double | Weighted aggregate [-1, +1] |
We have YAML data contracts in [code/contracts/](code/contracts/) that define the schema, ownership, update frequency, upstream dependencies, and which API endpoints serve each table. Shortened example:
contract:
name: package_health_scores
catalog: ddc_databricks
schema: gold
table: package_health_scores
owner: openlens-team
update_frequency: on-demand (manual pipeline run)
freshness_sla: 7 days
columns: [...]
upstream:
- ddc_databricks.gold.package_unified_features
- ddc_databricks.gold.package_sentiment
served_by:
- endpoint: "GET /packages"
- endpoint: "GET /scores/leaderboard"The FastAPI app ([code/api/main.py](code/api/main.py)) reads Gold tables from Databricks at startup, caches them as pandas DataFrames, and serves them as JSON. No Databricks calls on each request.
| Method | Path | What it returns |
|---|---|---|
GET |
/health |
Liveness check with cached package count and timestamp |
GET |
/packages |
All packages ranked by health score (highest first) |
GET |
/packages/{name} |
Full health score + sentiment breakdown for one package |
GET |
/packages/{name}/scores |
Just the health score breakdown |
GET |
/packages/{name}/sentiment |
Just the sentiment breakdown |
GET |
/scores/leaderboard |
All packages with full score details, ranked |
Response models are defined with Pydantic in [code/api/models.py](code/api/models.py): PackageSummary, HealthScore, SentimentDetail, PackageDetail.
Swagger UI of the APIs can be accessed at https://openlensbackend.onrender.com/docs
uvicorn code.api.main:app --reload --port 8000The Streamlit dashboard ([code/dashboard/app.py](code/dashboard/app.py)) talks to the FastAPI backend and shows Plotly charts. The dashbaord can be accessed at https://openlensapp.streamlit.app/
| Page | File | What's on it |
|---|---|---|
| Leaderboard | [app.py](code/dashboard/app.py) |
Tier counts at the top, a grouped bar chart showing all sub-scores, and a ranked table with progress bars |
| Package Detail | [1_Package_Detail.py](code/dashboard/pages/1_Package_Detail.py) |
Pick a package and see its radar chart, score cards, and a sentiment breakdown bar chart |
| Compare | [2_Compare.py](code/dashboard/pages/2_Compare.py) |
Select two packages for a side-by-side comparison with overlapping radar and a delta table |
| Pipeline Stats | [3_Pipeline_Stats.py](code/dashboard/pages/3_Pipeline_Stats.py) |
Reads the JSONL logs directly (no API needed) and shows ingestion run history, stars, downloads, and SO question counts |
# Make sure the API is running first, then:
streamlit run code/dashboard/app.pyThe dashboard can also be accessed at https://openlensapp.streamlit.app/
Leaderboard Page Showing Package Health and Ranking
Grouped Bar Chart of Sub-scores
Package Comparision - Radar Overlay
The orchestrator ([code/orchestration/cron_pipeline.py](code/orchestration/cron_pipeline.py)) handles the full pipeline:
- Runs all three ingestors (GitHub, PyPI, Stack Overflow) in sequence
- Uploads bronze data to DBFS via
[upload_to_dbfs.py](code/upload_to_dbfs.py) - Triggers the Databricks Workflows job (runs notebooks 01-06) through the Jobs API
- Polls until the job finishes and writes a summary to
data/logs/nightly_pipeline_summary.jsonl
The Databricks job interaction is in [DatabricksJobsClient](code/orchestration/databricks_job.py), a wrapper around the Databricks REST API 2.1.
Set these in a .env file (copy from .env.example):
| Variable | What it's for |
|---|---|
GITHUB_TOKEN |
GitHub personal access token |
SO_API_KEY |
Stack Exchange API key for higher quotas |
DATABRICKS_HOST |
Your Databricks workspace URL |
DATABRICKS_TOKEN |
Databricks personal access token |
DATABRICKS_CLUSTER_ID |
The cluster ID that databricks-connect should use |
Each ingestor writes two log files:
| File | Format | What it's for |
|---|---|---|
data/logs/<source>_ingestor.log |
Plain text | Readable log for debugging |
data/logs/<source>_ingestor_structured.jsonl |
JSON Lines | Machine-readable telemetry that feeds the Pipeline Stats dashboard |
Example structured log entry:
{
"timestamp": "2026-03-22T10:00:05Z",
"event": "repo_metadata",
"package": "pandas",
"status": "success",
"file": "data/bronze/github/repos/pandas.json",
"bytes": 14823,
"stars": 43000,
"forks": 17800,
"open_issues": 3500
}Each run ends with a run_complete event that has success/error counts. The Pipeline Stats page reads these logs.
We have four test files in [test/](test/). You can run them all with:
pytest test/ -v| Test File | Type | What it checks |
|---|---|---|
[test_score_logic.py](test/test_score_logic.py) |
Unit | The min-max scaling function, that weights sum to 1.0, the overall score formula, and all the tier boundary conditions |
[test_api_models.py](test/test_api_models.py) |
Unit | That our Pydantic models handle required fields, optional fields, type coercion, and JSON round-trips correctly |
[test_api_endpoints.py](test/test_api_endpoints.py) |
Integration | All 6 API routes, but with Databricks completely stubbed out and mock DataFrames injected instead |
[test_log_schema.py](test/test_log_schema.py) |
Integration | Reads the actual JSONL log files and checks that the right fields exist with reasonable values |
ddca2026-project-online5/
├── README.md # This file: the main project report
├── requirements.txt # Python dependencies
├── .env.example # Template for environment variables
│
├── code/ # All pipeline code
│ ├── README.md
│ ├── config.py # Shared config (tokens, paths, packages)
│ ├── upload_to_dbfs.py # Pushes bronze data to DBFS
│ │
│ ├── ingestion/ # Data ingestion scripts
│ │ ├── github_ingestor.py
│ │ ├── pypi_ingestor.py
│ │ └── stackoverflow_ingestor.py
│ │
│ ├── databricks/ # Databricks notebooks (run in order)
│ │ ├── 01_bronze_to_delta.ipynb
│ │ ├── 01b_bronze_schema_evolution.ipynb
│ │ ├── 02_silver_github.ipynb
│ │ ├── 03_silver_pypi.ipynb
│ │ ├── 04_silver_stackoverflow.ipynb
│ │ ├── 04b_silver_profiling.ipynb
│ │ ├── 05_gold_health_score.ipynb
│ │ └── 06_optimize_vacuum.ipynb
│ │
│ ├── api/ # FastAPI REST API
│ │ ├── main.py
│ │ ├── db.py
│ │ └── models.py
│ │
│ ├── dashboard/ # Streamlit dashboard
│ │ ├── app.py
│ │ ├── api_client.py
│ │ └── pages/
│ │ ├── 1_Package_Detail.py
│ │ ├── 2_Compare.py
│ │ └── 3_Pipeline_Stats.py
│ │
│ ├── contracts/ # YAML data contracts
│ │ ├── package_health_scores.yaml
│ │ ├── package_sentiment.yaml
│ │ └── package_unified_features.yaml
│ │
│ └── orchestration/ # Pipeline scheduling
│ ├── cron_pipeline.py
│ └── databricks_job.py
│
├── data/ # Sample data and logs
│ ├── bronze/ # Raw API responses per source
│ │ ├── github/
│ │ ├── pypi/
│ │ └── stackoverflow/
│ └── logs/ # Ingestor logs (.log + .jsonl)
│
├── docs/ # Extended documentation
│ ├── pipeline_architecture.md
│ └── screenshots/
│
├── test/ # Test suites
│ ├── conftest.py
│ ├── test_score_logic.py
│ ├── test_api_models.py
│ ├── test_api_endpoints.py
│ └── test_log_schema.py
│
├── example/ # Pipeline run evidence / screenshots
│
└── misc/ # Notes and references
- Python 3.12+
- An Azure Databricks workspace (free trial works)
- A GitHub personal access token
- A Stack Exchange API key
git clone <repository-url>
cd ddca2026-project-online5
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install -r requirements.txt
cp .env.example .env
# Fill in your tokens and Databricks credentials in .env# 1. Collect data from all three APIs
python code/ingestion/github_ingestor.py
python code/ingestion/pypi_ingestor.py
python code/ingestion/stackoverflow_ingestor.py
# 2. Push bronze data to Databricks
python code/upload_to_dbfs.py
# 3. Run notebooks 01–06 in order (You can run the notebooks in databricks or establish connection locally)
# 4. Start the API
uvicorn code.api.main:app --port 8000
# 5. Start the dashboard
streamlit run code/dashboard/app.py
# Or just run the orchestrator which does steps 1-3 automatically:
python code/orchestration/cron_pipeline.pypytest test/ -v| What | Technology |
|---|---|
| Data platform | Azure Databricks + Delta Lake |
| Storage | Delta Lake (Parquet + transaction log) |
| Compute | Databricks Connect, PySpark |
| ML | DistilBERT from Hugging Face (distilbert-base-uncased-finetuned-sst-2-english) |
| API | FastAPI + Uvicorn |
| Dashboard | Streamlit + Plotly |
| Contracts | YAML |
| Testing | pytest + httpx |
| Orchestration | Python + Databricks Jobs REST API |


