Data-driven Computing Architectures 2026 - Final Project

AI Tool Usage Declaration:

The AI tools were used for planning, debugging, idea validation, and enhancing the language and structure of our documentation

We used the following AI-based tools during this project:

ChatGPT / Claude: Helped us plan the overall pipeline architecture and brainstorm how to structure the medallion layers. We also used it to debug issues in our ingestion scripts and Spark notebooks when we ran into errors. The architecture diagrams used in the documentation were generated with AI assistance. We also used these LLMs for sparring on ideas e.g. which sub-score weights made sense, discussing trade-offs (like why we skipped Z-ordering) etc. These LLMs also helped us improve and polish the project documentation and README files.

Grammarly: Used it to improve and enhance the project documentation.

OpenLens: Python Package Health Intelligence Pipeline

OpenLens is a data pipeline that collects data from GitHub, PyPI, and Stack Overflow, runs it through a medallion architecture (Bronze -> Silver -> Gold) on Databricks with Delta Lake, and produces a health score for Python packages. We serve the results through a FastAPI backend and a Streamlit dashboard.

We wanted to build this because picking a Python library usually means checking stars on GitHub, maybe downloads on PyPI, or browsing Stack Overflow. But none of that gives you the full picture - a package can have 40k stars and still be unmaintained. So we combined everything into one score.

Team & Contributions

Member	Contribution
Behroz Karim	Implemented Bronze layer, profiling, performance tuning, tests. Documented the project.
Hasnain Ajmal	Implemented data ingestion scripts (GitHub, PyPI, StackOverflow APIs), Databricks workspace setup, Gold layer (health scoring & sentiment aggregation), FastAPI REST endpoints, and Streamlit dashboard with including pipeline stats.
Talha Rizwan	Implemented Silver layer, and Automated Job for ETL in Databricks. Helped a little in Documentation

What We Implemented

Here's what we did and where to find each part.

Documentation & Presentation

Task	Where to find it
Solution description with diagrams, data schemas, pipeline description	This README; docs/pipeline_architecture.md;
Repository organization & documentation	See folder structure below; every directory has its own README
Presentation & language	We used Markdown throughout with diagrams, tables, and screenshots
Scope & originality	Three-source health score; DistilBERT sentiment analysis; automated orchestration pipeline

Data Ingestion

Task	Where to find it
Structured data (batch)	GitHub ingestor and PyPI ingestor pull structured JSON from REST APIs
Unstructured data (batch)	Stack Overflow ingestor fetches raw HTML question/answer bodies; the GitHub ingestor also pulls README Markdown text
Automated batch ingestion	Cron orchestrator runs all three ingestors, uploads to DBFS, and triggers the Databricks job automatically

Data Processing & Cleaning

Task	Where to find it
ETL to clean, transform, and standardize	Silver notebooks: 02_silver_github, 03_silver_pypi, 04_silver_stackoverflow
Deduplication, type validation, basic metadata	Silver notebooks deduplicate on package name; Bronze envelope carries `_ingested_at`, `_source`, `_endpoint` metadata on every record
Advanced/automated metadata (profiling, schema evolution)	04b_silver_profiling profiles all silver tables; 01b_bronze_schema_evolution tracks schema changes across runs

Architecture & Advanced Features

Task	Where to find it
Medallion architecture (Bronze-Silver-Gold)	Pipeline architecture doc; notebooks `01` through `05` implement the three layers
Data lineage	Data Lineage Bronze envelopes have `_source` and `_endpoint` fields on every record; schema evolution snapshots; YAML contracts list `upstream` table dependencies. Databricks also provide data lineage feature.
AI/ML model in the pipeline	05_gold_health_score runs DistilBERT SST-2 sentiment analysis as a Spark UDF
Performance tuning	06_optimize_vacuum runs OPTIMIZE (compaction) and VACUUM (stale file cleanup) on all Delta tables

Visualization & Dashboard

Task	Where to find it
Visualizations and dashboard	Streamlit dashboard with a leaderboard, radar charts, and grouped bar charts
Pipeline statistics	Pipeline Stats page reads the structured JSONL logs and displays ingestion metrics

Data Sharing

Task	Where to find it
External data usage via API	FastAPI REST API with 6 endpoints serving Gold-layer data as JSON over HTTP

Data Product

Task	Where to find it
Data contracts	YAML contracts for `package_health_scores`, `package_sentiment`, and `package_unified_features`
Access data product through an API	FastAPI endpoints
Automated testing of schema/data	test/ contains 4 test suites that cover models, API endpoints, scoring logic, and log schema validation

Logging

Task	Where to find it
Structured logging & error handling	Every ingestor writes a human-readable `.log` file plus a structured `.jsonl` file; see data/logs/ and the Logging section below

How It Works

The Problem

When you pick a Python library you usually check GitHub stars or maybe download numbers. But that only shows part of the picture. A package might have tons of stars but be abandoned, or have low visibility but great community support on Stack Overflow.

We wanted a single score that pulls together all these signals so you can compare packages across multiple dimensions at once.

Tracked Packages

The pipeline currently tracks 8 Python packages:

PyPI Name	GitHub Repository	Stack Overflow Tag
pandas	pandas-dev/pandas	pandas
requests	psf/requests	python-requests
fastapi	tiangolo/fastapi	fastapi
numpy	numpy/numpy	numpy
flask	pallets/flask	flask
django	django/django	django
scikit-learn	scikit-learn/scikit-learn	scikit-learn
torch	pytorch/pytorch	pytorch

To add a new package you just add an entry to TARGET_PACKAGES in the ingestor files.

Pipeline Architecture

High-level data flow:

┌──────────────────────────────────────────────────────────────────────────────┐
│                          OpenLens Data Pipeline                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐                                │ 
│  │ GitHub   │  │  PyPI    │  │ Stack        │     DATA SOURCES               │
│  │ REST API │  │ JSON API │  │ Overflow API │                                │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘                                │
│       │              │               │                                       │
│       ▼              ▼               ▼                                       │
│  ┌─────────────────────────────────────────────┐                             │
│  │           Local Ingestion Layer              │     code/ingestion/        │
│  │  github_ingestor  pypi_ingestor  so_ingestor│                             │
│  └────────────────────┬────────────────────────┘                             │
│                       │                                                      │
│                       ▼                                                      │
│  ┌─────────────────────────────────────────────┐                             │
│  │           data/bronze/ (local JSON)          │     Raw API responses      │
│  └────────────────────┬────────────────────────┘                             │
│                       │  upload_to_dbfs.py                                   │
│                       ▼                                                      │
│  ┌─────────────────────────────────────────────┐                             │
│  │         DBFS  (dbfs:/FileStore/bronze/)      │     Azure Databricks       │
│  └────────────────────┬────────────────────────┘                             │
│                       │                                                      │
│       ┌───────────────┼───────────────┐                                      │
│       ▼               ▼               ▼                                      │
│  ┌─────────┐    ┌──────────┐    ┌──────────┐                                 │
│  │ BRONZE  │    │  SILVER  │    │   GOLD   │     Delta Lake Tables           │
│  │ (raw)   │───▶│ (clean)  │───▶│ (scored)│     ddc_databricks catalog      │
│  └─────────┘    └──────────┘    └────┬─────┘                                 │
│                                      │                                       │
│                       ┌──────────────┼──────────────┐                        │
│                       ▼                             ▼                        │
│              ┌─────────────────┐          ┌──────────────────┐               │
│              │  FastAPI REST   │          │    Streamlit     │               │
│              │  API (6 routes) │◀────────│    Dashboard     │               │
│              └─────────────────┘          └──────────────────┘               │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

More detail in docs/pipeline_architecture.md.

Medallion Architecture

Bronze Layer: Raw Ingestion

Raw API responses are stored as-is in JSON files under data/bronze/. Each record is wrapped in a metadata envelope so we can track where it came from:

{
  "_ingested_at": "2026-03-22T10:00:00Z",
  "_source": "github_rest_api",
  "_pypi_package": "pandas",
  "_endpoint": "https://api.github.com/repos/pandas-dev/pandas",
  "data": { "...raw GitHub API response..." }
}

Those underscore-prefixed fields are our own addition. They're not part of the original API response. They let us trace every single record back to its source, which is essentially our data lineage mechanism. After ingestion, [upload_to_dbfs.py](code/upload_to_dbfs.py) pushes these files to Databricks and [01_bronze_to_delta.ipynb](code/databricks/01_bronze_to_delta.ipynb) converts them into Delta tables.

[01b_bronze_schema_evolution.ipynb](code/databricks/01b_bronze_schema_evolution.ipynb) tracks whether the column structure changes between runs. This matters because external APIs sometimes add or drop fields.

Silver Layer: Cleaned & Enriched

The silver notebooks clean and transform the raw data:

Notebook	Source	What it does
02_silver_github	GitHub	Parses READMEs with BeautifulSoup, deduplicates records, aggregates contributor stats, extracts features like stars, forks, days since last push, README quality metrics
03_silver_pypi	PyPI	Flattens the nested metadata, normalizes download stats, computes download momentum (30 day vs 90 day trend)
04_silver_stackoverflow	Stack Overflow	Strips HTML from Q&A bodies, computes answered rates and average scores, measures community size
04b_silver_profiling	All Silver	Profiles all 16 silver tables automatically: row counts, null rates, value distributions

Gold Layer: Scores & ML

05_gold_health_score.ipynb joins the silver feature tables and produces three gold tables:

gold.package_unified_features - wide table with all GitHub, PyPI, and SO features per package (30+ columns)
gold.package_sentiment - sentiment scores from DistilBERT (SST-2), run as a Spark UDF, with a keyword fallback
gold.package_health_scores - final composite scores and tier assignments

Delta Lake Maintenance

06_optimize_vacuum.ipynb runs after the pipeline finishes. It runs OPTIMIZE (compacts small files) and VACUUM (removes stale files older than 7 days) on every table across all three schemas. It discovers tables dynamically with SHOW TABLES so we don't need to update it when adding new tables.

Health Score Model

Each package gets scored on four dimensions (0-100 each), then combined with weights:

Sub-Score	Weight	Criteria
GitHub	30%	Stars (25%), recency of last push (20%), contributor count (15%), forks (10%), contributor diversity (10%), README quality (10%), archived flag (5%), wiki flag (5%)
PyPI	25%	Monthly downloads (35%), release maturity (25%), download momentum 30d vs 90d (20%), total releases (20%)
Community	25%	SO question volume (25%), answered rate (20%), answer quality (20%), community size (20%), 30-day activity (15%)
Sentiment	20%	DistilBERT compound sentiment across SO questions (30%), SO answers (30%), README text (25%), PyPI description (15%)

Overall score: 0.30 * GitHub + 0.25 * PyPI + 0.25 * Community + 0.20 * Sentiment

Tier assignment:

Tier	Range
A	65 and above
B	52-64.99
C	40-51.99
D	Below 40

There was no science behind choosing these weights since our main focus was to build a data pipeline.

Data Schemas

Bronze Envelope

Every record in Bronze follows this format:

Field	Type	What it is
`_ingested_at`	string (ISO 8601)	When we pulled this data
`_source`	string	Which API it came from (e.g. `github_rest_api`, `pypi_json_api`, `stackexchange_api_v2.3`)
`_pypi_package`	string	The package name
`_endpoint`	string	The exact API URL we called
`data` / `questions` / `answers` / ...	object or array	The raw API response, untouched

Gold Tables

We wrote formal column-level contracts in YAML in [code/contracts/](code/contracts/):

[package_health_scores.yaml](code/contracts/package_health_scores.yaml) - composite scores
[package_sentiment.yaml](code/contracts/package_sentiment.yaml) - per-source sentiment
[package_unified_features.yaml](code/contracts/package_unified_features.yaml) - wide feature table (30+ columns)

`gold.package_health_scores`

Column	Type	Description
`package_name`	string	PyPI package name
`github_score`	double	GitHub sub-score [0-100]
`pypi_score`	double	PyPI sub-score [0-100]
`community_score`	double	SO community sub-score [0-100]
`sentiment_score`	double	Sentiment sub-score [0-100]
`overall_health_score`	double	Weighted composite [0-100]
`health_tier`	string	A, B, C, or D
`scored_at`	timestamp	When this scoring run happened

`gold.package_sentiment`

Column	Type	Description
`package_name`	string	PyPI package name
`so_question_sentiment_avg`	double	Avg sentiment of SO questions [-1, +1]
`so_answer_sentiment_avg`	double	Avg sentiment of SO answers [-1, +1]
`readme_sentiment_compound`	double	README sentiment [-1, +1]
`pypi_desc_sentiment_compound`	double	PyPI description sentiment [-1, +1]
`overall_sentiment`	double	Weighted aggregate [-1, +1]

Data Contracts

We have YAML data contracts in [code/contracts/](code/contracts/) that define the schema, ownership, update frequency, upstream dependencies, and which API endpoints serve each table. Shortened example:

contract:
  name: package_health_scores
  catalog: ddc_databricks
  schema: gold
  table: package_health_scores
  owner: openlens-team
  update_frequency: on-demand (manual pipeline run)
  freshness_sla: 7 days
  columns: [...]
  upstream:
    - ddc_databricks.gold.package_unified_features
    - ddc_databricks.gold.package_sentiment
  served_by:
    - endpoint: "GET /packages"
    - endpoint: "GET /scores/leaderboard"

API Endpoints

The FastAPI app ([code/api/main.py](code/api/main.py)) reads Gold tables from Databricks at startup, caches them as pandas DataFrames, and serves them as JSON. No Databricks calls on each request.

Method	Path	What it returns
`GET`	`/health`	Liveness check with cached package count and timestamp
`GET`	`/packages`	All packages ranked by health score (highest first)
`GET`	`/packages/{name}`	Full health score + sentiment breakdown for one package
`GET`	`/packages/{name}/scores`	Just the health score breakdown
`GET`	`/packages/{name}/sentiment`	Just the sentiment breakdown
`GET`	`/scores/leaderboard`	All packages with full score details, ranked

Response models are defined with Pydantic in [code/api/models.py](code/api/models.py): PackageSummary, HealthScore, SentimentDetail, PackageDetail.

Swagger UI of the APIs can be accessed at https://openlensbackend.onrender.com/docs

Running the API

uvicorn code.api.main:app --reload --port 8000

Dashboard

The Streamlit dashboard ([code/dashboard/app.py](code/dashboard/app.py)) talks to the FastAPI backend and shows Plotly charts. The dashbaord can be accessed at https://openlensapp.streamlit.app/

Page	File	What's on it
Leaderboard	`[app.py](code/dashboard/app.py)`	Tier counts at the top, a grouped bar chart showing all sub-scores, and a ranked table with progress bars
Package Detail	`[1_Package_Detail.py](code/dashboard/pages/1_Package_Detail.py)`	Pick a package and see its radar chart, score cards, and a sentiment breakdown bar chart
Compare	`[2_Compare.py](code/dashboard/pages/2_Compare.py)`	Select two packages for a side-by-side comparison with overlapping radar and a delta table
Pipeline Stats	`[3_Pipeline_Stats.py](code/dashboard/pages/3_Pipeline_Stats.py)`	Reads the JSONL logs directly (no API needed) and shows ingestion run history, stars, downloads, and SO question counts

Running the Dashboard

# Make sure the API is running first, then:
streamlit run code/dashboard/app.py

The dashboard can also be accessed at https://openlensapp.streamlit.app/

Dashboard Screenshots

Leaderboard Page Showing Package Health and Ranking

Grouped Bar Chart of Sub-scores

Package Detail view

Package Comparision

Package Comparision - Radar Overlay

Orchestration

Automated Pipeline

The orchestrator ([code/orchestration/cron_pipeline.py](code/orchestration/cron_pipeline.py)) handles the full pipeline:

Runs all three ingestors (GitHub, PyPI, Stack Overflow) in sequence
Uploads bronze data to DBFS via [upload_to_dbfs.py](code/upload_to_dbfs.py)
Triggers the Databricks Workflows job (runs notebooks 01-06) through the Jobs API
Polls until the job finishes and writes a summary to data/logs/nightly_pipeline_summary.jsonl

The Databricks job interaction is in [DatabricksJobsClient](code/orchestration/databricks_job.py), a wrapper around the Databricks REST API 2.1.

Environment Variables

Set these in a .env file (copy from .env.example):

Variable	What it's for
`GITHUB_TOKEN`	GitHub personal access token
`SO_API_KEY`	Stack Exchange API key for higher quotas
`DATABRICKS_HOST`	Your Databricks workspace URL
`DATABRICKS_TOKEN`	Databricks personal access token
`DATABRICKS_CLUSTER_ID`	The cluster ID that databricks-connect should use

Structured Logging

Each ingestor writes two log files:

File	Format	What it's for
`data/logs/<source>_ingestor.log`	Plain text	Readable log for debugging
`data/logs/<source>_ingestor_structured.jsonl`	JSON Lines	Machine-readable telemetry that feeds the Pipeline Stats dashboard

Example structured log entry:

{
  "timestamp": "2026-03-22T10:00:05Z",
  "event": "repo_metadata",
  "package": "pandas",
  "status": "success",
  "file": "data/bronze/github/repos/pandas.json",
  "bytes": 14823,
  "stars": 43000,
  "forks": 17800,
  "open_issues": 3500
}

Each run ends with a run_complete event that has success/error counts. The Pipeline Stats page reads these logs.

Testing

We have four test files in [test/](test/). You can run them all with:

pytest test/ -v

Test File	Type	What it checks
`[test_score_logic.py](test/test_score_logic.py)`	Unit	The min-max scaling function, that weights sum to 1.0, the overall score formula, and all the tier boundary conditions
`[test_api_models.py](test/test_api_models.py)`	Unit	That our Pydantic models handle required fields, optional fields, type coercion, and JSON round-trips correctly
`[test_api_endpoints.py](test/test_api_endpoints.py)`	Integration	All 6 API routes, but with Databricks completely stubbed out and mock DataFrames injected instead
`[test_log_schema.py](test/test_log_schema.py)`	Integration	Reads the actual JSONL log files and checks that the right fields exist with reasonable values

Test Outputs

Repository Structure

ddca2026-project-online5/
├── README.md                          # This file: the main project report
├── requirements.txt                   # Python dependencies
├── .env.example                       # Template for environment variables
│
├── code/                              # All pipeline code
│   ├── README.md
│   ├── config.py                      # Shared config (tokens, paths, packages)
│   ├── upload_to_dbfs.py              # Pushes bronze data to DBFS
│   │
│   ├── ingestion/                     # Data ingestion scripts
│   │   ├── github_ingestor.py
│   │   ├── pypi_ingestor.py
│   │   └── stackoverflow_ingestor.py
│   │
│   ├── databricks/                    # Databricks notebooks (run in order)
│   │   ├── 01_bronze_to_delta.ipynb
│   │   ├── 01b_bronze_schema_evolution.ipynb
│   │   ├── 02_silver_github.ipynb
│   │   ├── 03_silver_pypi.ipynb
│   │   ├── 04_silver_stackoverflow.ipynb
│   │   ├── 04b_silver_profiling.ipynb
│   │   ├── 05_gold_health_score.ipynb
│   │   └── 06_optimize_vacuum.ipynb
│   │
│   ├── api/                           # FastAPI REST API
│   │   ├── main.py
│   │   ├── db.py
│   │   └── models.py
│   │
│   ├── dashboard/                     # Streamlit dashboard
│   │   ├── app.py
│   │   ├── api_client.py
│   │   └── pages/
│   │       ├── 1_Package_Detail.py
│   │       ├── 2_Compare.py
│   │       └── 3_Pipeline_Stats.py
│   │
│   ├── contracts/                     # YAML data contracts
│   │   ├── package_health_scores.yaml
│   │   ├── package_sentiment.yaml
│   │   └── package_unified_features.yaml
│   │
│   └── orchestration/                 # Pipeline scheduling
│       ├── cron_pipeline.py
│       └── databricks_job.py
│
├── data/                              # Sample data and logs
│   ├── bronze/                        # Raw API responses per source
│   │   ├── github/
│   │   ├── pypi/
│   │   └── stackoverflow/
│   └── logs/                          # Ingestor logs (.log + .jsonl)
│
├── docs/                              # Extended documentation
│   ├── pipeline_architecture.md
│   └── screenshots/
│
├── test/                              # Test suites
│   ├── conftest.py
│   ├── test_score_logic.py
│   ├── test_api_models.py
│   ├── test_api_endpoints.py
│   └── test_log_schema.py
│
├── example/                           # Pipeline run evidence / screenshots
│
└── misc/                              # Notes and references

Getting Started

What you need

Python 3.12+
An Azure Databricks workspace (free trial works)
A GitHub personal access token
A Stack Exchange API key

Setup

git clone <repository-url>
cd ddca2026-project-online5

python -m venv .venv
source .venv/bin/activate      # Linux/Mac
.venv\Scripts\activate         # Windows

pip install -r requirements.txt

cp .env.example .env
# Fill in your tokens and Databricks credentials in .env

Running everything

# 1. Collect data from all three APIs
python code/ingestion/github_ingestor.py
python code/ingestion/pypi_ingestor.py
python code/ingestion/stackoverflow_ingestor.py

# 2. Push bronze data to Databricks
python code/upload_to_dbfs.py

# 3. Run notebooks 01–06 in order (You can run the notebooks in databricks or establish connection locally)

# 4. Start the API
uvicorn code.api.main:app --port 8000

# 5. Start the dashboard
streamlit run code/dashboard/app.py

# Or just run the orchestrator which does steps 1-3 automatically:
python code/orchestration/cron_pipeline.py

Running tests

pytest test/ -v

Tech Stack

What	Technology
Data platform	Azure Databricks + Delta Lake
Storage	Delta Lake (Parquet + transaction log)
Compute	Databricks Connect, PySpark
ML	DistilBERT from Hugging Face (`distilbert-base-uncased-finetuned-sst-2-english`)
API	FastAPI + Uvicorn
Dashboard	Streamlit + Plotly
Contracts	YAML
Testing	pytest + httpx
Orchestration	Python + Databricks Jobs REST API

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
code		code
data		data
docs		docs
example		example
misc		misc
test		test
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data-driven Computing Architectures 2026 - Final Project

AI Tool Usage Declaration:

OpenLens: Python Package Health Intelligence Pipeline

Team & Contributions

What We Implemented

Documentation & Presentation

Data Ingestion

Data Processing & Cleaning

Architecture & Advanced Features

Visualization & Dashboard

Data Sharing

Data Product

Logging

How It Works

The Problem

Tracked Packages

Pipeline Architecture

Medallion Architecture

Bronze Layer: Raw Ingestion

Silver Layer: Cleaned & Enriched

Gold Layer: Scores & ML

Delta Lake Maintenance

Health Score Model

Data Schemas

Bronze Envelope

Gold Tables

gold.package_health_scores

gold.package_sentiment

Data Contracts

API Endpoints

Running the API

Dashboard

Running the Dashboard

Dashboard Screenshots

Orchestration

Automated Pipeline

Environment Variables

Structured Logging

Testing

Repository Structure

Getting Started

What you need

Setup

Running everything

Running tests

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`gold.package_health_scores`

`gold.package_sentiment`

Packages