Skip to content

BehrozKarim/OpenLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

138 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-driven Computing Architectures 2026 - Final Project

AI Tool Usage Declaration:

The AI tools were used for planning, debugging, idea validation, and enhancing the language and structure of our documentation

We used the following AI-based tools during this project:

ChatGPT / Claude: Helped us plan the overall pipeline architecture and brainstorm how to structure the medallion layers. We also used it to debug issues in our ingestion scripts and Spark notebooks when we ran into errors. The architecture diagrams used in the documentation were generated with AI assistance. We also used these LLMs for sparring on ideas e.g. which sub-score weights made sense, discussing trade-offs (like why we skipped Z-ordering) etc. These LLMs also helped us improve and polish the project documentation and README files.

Grammarly: Used it to improve and enhance the project documentation.


OpenLens: Python Package Health Intelligence Pipeline

OpenLens is a data pipeline that collects data from GitHub, PyPI, and Stack Overflow, runs it through a medallion architecture (Bronze -> Silver -> Gold) on Databricks with Delta Lake, and produces a health score for Python packages. We serve the results through a FastAPI backend and a Streamlit dashboard.

We wanted to build this because picking a Python library usually means checking stars on GitHub, maybe downloads on PyPI, or browsing Stack Overflow. But none of that gives you the full picture - a package can have 40k stars and still be unmaintained. So we combined everything into one score.


Team & Contributions

Member Contribution
Behroz Karim Implemented Bronze layer, profiling, performance tuning, tests. Documented the project.
Hasnain Ajmal Implemented data ingestion scripts (GitHub, PyPI, StackOverflow APIs), Databricks workspace setup, Gold layer (health scoring & sentiment aggregation), FastAPI REST endpoints, and Streamlit dashboard with including pipeline stats.
Talha Rizwan Implemented Silver layer, and Automated Job for ETL in Databricks. Helped a little in Documentation

What We Implemented

Here's what we did and where to find each part.

Documentation & Presentation

Task Where to find it
Solution description with diagrams, data schemas, pipeline description This README; docs/pipeline_architecture.md;
Repository organization & documentation See folder structure below; every directory has its own README
Presentation & language We used Markdown throughout with diagrams, tables, and screenshots
Scope & originality Three-source health score; DistilBERT sentiment analysis; automated orchestration pipeline

Data Ingestion

Task Where to find it
Structured data (batch) GitHub ingestor and PyPI ingestor pull structured JSON from REST APIs
Unstructured data (batch) Stack Overflow ingestor fetches raw HTML question/answer bodies; the GitHub ingestor also pulls README Markdown text
Automated batch ingestion Cron orchestrator runs all three ingestors, uploads to DBFS, and triggers the Databricks job automatically

Data Processing & Cleaning

Task Where to find it
ETL to clean, transform, and standardize Silver notebooks: 02_silver_github, 03_silver_pypi, 04_silver_stackoverflow
Deduplication, type validation, basic metadata Silver notebooks deduplicate on package name; Bronze envelope carries _ingested_at, _source, _endpoint metadata on every record
Advanced/automated metadata (profiling, schema evolution) 04b_silver_profiling profiles all silver tables; 01b_bronze_schema_evolution tracks schema changes across runs

Architecture & Advanced Features

Task Where to find it
Medallion architecture (Bronze-Silver-Gold) Pipeline architecture doc; notebooks 01 through 05 implement the three layers
Data lineage Data Lineage Bronze envelopes have _source and _endpoint fields on every record; schema evolution snapshots; YAML contracts list upstream table dependencies. Databricks also provide data lineage feature.
AI/ML model in the pipeline 05_gold_health_score runs DistilBERT SST-2 sentiment analysis as a Spark UDF
Performance tuning 06_optimize_vacuum runs OPTIMIZE (compaction) and VACUUM (stale file cleanup) on all Delta tables

Visualization & Dashboard

Task Where to find it
Visualizations and dashboard Streamlit dashboard with a leaderboard, radar charts, and grouped bar charts
Pipeline statistics Pipeline Stats page reads the structured JSONL logs and displays ingestion metrics

Data Sharing

Task Where to find it
External data usage via API FastAPI REST API with 6 endpoints serving Gold-layer data as JSON over HTTP

Data Product

Task Where to find it
Data contracts YAML contracts for package_health_scores, package_sentiment, and package_unified_features
Access data product through an API FastAPI endpoints
Automated testing of schema/data test/ contains 4 test suites that cover models, API endpoints, scoring logic, and log schema validation

Logging

Task Where to find it
Structured logging & error handling Every ingestor writes a human-readable .log file plus a structured .jsonl file; see data/logs/ and the Logging section below

How It Works

The Problem

When you pick a Python library you usually check GitHub stars or maybe download numbers. But that only shows part of the picture. A package might have tons of stars but be abandoned, or have low visibility but great community support on Stack Overflow.

We wanted a single score that pulls together all these signals so you can compare packages across multiple dimensions at once.

Tracked Packages

The pipeline currently tracks 8 Python packages:

PyPI Name GitHub Repository Stack Overflow Tag
pandas pandas-dev/pandas pandas
requests psf/requests python-requests
fastapi tiangolo/fastapi fastapi
numpy numpy/numpy numpy
flask pallets/flask flask
django django/django django
scikit-learn scikit-learn/scikit-learn scikit-learn
torch pytorch/pytorch pytorch

To add a new package you just add an entry to TARGET_PACKAGES in the ingestor files.


Pipeline Architecture

High-level data flow:

┌──────────────────────────────────────────────────────────────────────────────┐
│                          OpenLens Data Pipeline                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐                                │ 
│  │ GitHub   │  │  PyPI    │  │ Stack        │     DATA SOURCES               │
│  │ REST API │  │ JSON API │  │ Overflow API │                                │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘                                │
│       │              │               │                                       │
│       ▼              ▼               ▼                                       │
│  ┌─────────────────────────────────────────────┐                             │
│  │           Local Ingestion Layer              │     code/ingestion/        │
│  │  github_ingestor  pypi_ingestor  so_ingestor│                             │
│  └────────────────────┬────────────────────────┘                             │
│                       │                                                      │
│                       ▼                                                      │
│  ┌─────────────────────────────────────────────┐                             │
│  │           data/bronze/ (local JSON)          │     Raw API responses      │
│  └────────────────────┬────────────────────────┘                             │
│                       │  upload_to_dbfs.py                                   │
│                       ▼                                                      │
│  ┌─────────────────────────────────────────────┐                             │
│  │         DBFS  (dbfs:/FileStore/bronze/)      │     Azure Databricks       │
│  └────────────────────┬────────────────────────┘                             │
│                       │                                                      │
│       ┌───────────────┼───────────────┐                                      │
│       ▼               ▼               ▼                                      │
│  ┌─────────┐    ┌──────────┐    ┌──────────┐                                 │
│  │ BRONZE  │    │  SILVER  │    │   GOLD   │     Delta Lake Tables           │
│  │ (raw)   │───▶│ (clean)  │───▶│ (scored)│     ddc_databricks catalog      │
│  └─────────┘    └──────────┘    └────┬─────┘                                 │
│                                      │                                       │
│                       ┌──────────────┼──────────────┐                        │
│                       ▼                             ▼                        │
│              ┌─────────────────┐          ┌──────────────────┐               │
│              │  FastAPI REST   │          │    Streamlit     │               │
│              │  API (6 routes) │◀────────│    Dashboard     │               │
│              └─────────────────┘          └──────────────────┘               │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

More detail in docs/pipeline_architecture.md.


Medallion Architecture

Bronze Layer: Raw Ingestion

Raw API responses are stored as-is in JSON files under data/bronze/. Each record is wrapped in a metadata envelope so we can track where it came from:

{
  "_ingested_at": "2026-03-22T10:00:00Z",
  "_source": "github_rest_api",
  "_pypi_package": "pandas",
  "_endpoint": "https://api.github.com/repos/pandas-dev/pandas",
  "data": { "...raw GitHub API response..." }
}

Those underscore-prefixed fields are our own addition. They're not part of the original API response. They let us trace every single record back to its source, which is essentially our data lineage mechanism. After ingestion, [upload_to_dbfs.py](code/upload_to_dbfs.py) pushes these files to Databricks and [01_bronze_to_delta.ipynb](code/databricks/01_bronze_to_delta.ipynb) converts them into Delta tables.

[01b_bronze_schema_evolution.ipynb](code/databricks/01b_bronze_schema_evolution.ipynb) tracks whether the column structure changes between runs. This matters because external APIs sometimes add or drop fields.

Silver Layer: Cleaned & Enriched

The silver notebooks clean and transform the raw data:

Notebook Source What it does
02_silver_github GitHub Parses READMEs with BeautifulSoup, deduplicates records, aggregates contributor stats, extracts features like stars, forks, days since last push, README quality metrics
03_silver_pypi PyPI Flattens the nested metadata, normalizes download stats, computes download momentum (30 day vs 90 day trend)
04_silver_stackoverflow Stack Overflow Strips HTML from Q&A bodies, computes answered rates and average scores, measures community size
04b_silver_profiling All Silver Profiles all 16 silver tables automatically: row counts, null rates, value distributions

Gold Layer: Scores & ML

05_gold_health_score.ipynb joins the silver feature tables and produces three gold tables:

  1. gold.package_unified_features - wide table with all GitHub, PyPI, and SO features per package (30+ columns)
  2. gold.package_sentiment - sentiment scores from DistilBERT (SST-2), run as a Spark UDF, with a keyword fallback
  3. gold.package_health_scores - final composite scores and tier assignments

Delta Lake Maintenance

06_optimize_vacuum.ipynb runs after the pipeline finishes. It runs OPTIMIZE (compacts small files) and VACUUM (removes stale files older than 7 days) on every table across all three schemas. It discovers tables dynamically with SHOW TABLES so we don't need to update it when adding new tables.


Health Score Model

Each package gets scored on four dimensions (0-100 each), then combined with weights:

Sub-Score Weight Criteria
GitHub 30% Stars (25%), recency of last push (20%), contributor count (15%), forks (10%), contributor diversity (10%), README quality (10%), archived flag (5%), wiki flag (5%)
PyPI 25% Monthly downloads (35%), release maturity (25%), download momentum 30d vs 90d (20%), total releases (20%)
Community 25% SO question volume (25%), answered rate (20%), answer quality (20%), community size (20%), 30-day activity (15%)
Sentiment 20% DistilBERT compound sentiment across SO questions (30%), SO answers (30%), README text (25%), PyPI description (15%)

Overall score: 0.30 * GitHub + 0.25 * PyPI + 0.25 * Community + 0.20 * Sentiment

Tier assignment:

Tier Range
A 65 and above
B 52-64.99
C 40-51.99
D Below 40

There was no science behind choosing these weights since our main focus was to build a data pipeline.


Data Schemas

Bronze Envelope

Every record in Bronze follows this format:

Field Type What it is
_ingested_at string (ISO 8601) When we pulled this data
_source string Which API it came from (e.g. github_rest_api, pypi_json_api, stackexchange_api_v2.3)
_pypi_package string The package name
_endpoint string The exact API URL we called
data / questions / answers / ... object or array The raw API response, untouched

Gold Tables

We wrote formal column-level contracts in YAML in [code/contracts/](code/contracts/):

  • [package_health_scores.yaml](code/contracts/package_health_scores.yaml) - composite scores
  • [package_sentiment.yaml](code/contracts/package_sentiment.yaml) - per-source sentiment
  • [package_unified_features.yaml](code/contracts/package_unified_features.yaml) - wide feature table (30+ columns)

gold.package_health_scores

Column Type Description
package_name string PyPI package name
github_score double GitHub sub-score [0-100]
pypi_score double PyPI sub-score [0-100]
community_score double SO community sub-score [0-100]
sentiment_score double Sentiment sub-score [0-100]
overall_health_score double Weighted composite [0-100]
health_tier string A, B, C, or D
scored_at timestamp When this scoring run happened

gold.package_sentiment

Column Type Description
package_name string PyPI package name
so_question_sentiment_avg double Avg sentiment of SO questions [-1, +1]
so_answer_sentiment_avg double Avg sentiment of SO answers [-1, +1]
readme_sentiment_compound double README sentiment [-1, +1]
pypi_desc_sentiment_compound double PyPI description sentiment [-1, +1]
overall_sentiment double Weighted aggregate [-1, +1]

Data Contracts

We have YAML data contracts in [code/contracts/](code/contracts/) that define the schema, ownership, update frequency, upstream dependencies, and which API endpoints serve each table. Shortened example:

contract:
  name: package_health_scores
  catalog: ddc_databricks
  schema: gold
  table: package_health_scores
  owner: openlens-team
  update_frequency: on-demand (manual pipeline run)
  freshness_sla: 7 days
  columns: [...]
  upstream:
    - ddc_databricks.gold.package_unified_features
    - ddc_databricks.gold.package_sentiment
  served_by:
    - endpoint: "GET /packages"
    - endpoint: "GET /scores/leaderboard"

API Endpoints

The FastAPI app ([code/api/main.py](code/api/main.py)) reads Gold tables from Databricks at startup, caches them as pandas DataFrames, and serves them as JSON. No Databricks calls on each request.

Method Path What it returns
GET /health Liveness check with cached package count and timestamp
GET /packages All packages ranked by health score (highest first)
GET /packages/{name} Full health score + sentiment breakdown for one package
GET /packages/{name}/scores Just the health score breakdown
GET /packages/{name}/sentiment Just the sentiment breakdown
GET /scores/leaderboard All packages with full score details, ranked

Response models are defined with Pydantic in [code/api/models.py](code/api/models.py): PackageSummary, HealthScore, SentimentDetail, PackageDetail.

Swagger UI of the APIs can be accessed at https://openlensbackend.onrender.com/docs

Running the API

uvicorn code.api.main:app --reload --port 8000

Dashboard

The Streamlit dashboard ([code/dashboard/app.py](code/dashboard/app.py)) talks to the FastAPI backend and shows Plotly charts. The dashbaord can be accessed at https://openlensapp.streamlit.app/

Page File What's on it
Leaderboard [app.py](code/dashboard/app.py) Tier counts at the top, a grouped bar chart showing all sub-scores, and a ranked table with progress bars
Package Detail [1_Package_Detail.py](code/dashboard/pages/1_Package_Detail.py) Pick a package and see its radar chart, score cards, and a sentiment breakdown bar chart
Compare [2_Compare.py](code/dashboard/pages/2_Compare.py) Select two packages for a side-by-side comparison with overlapping radar and a delta table
Pipeline Stats [3_Pipeline_Stats.py](code/dashboard/pages/3_Pipeline_Stats.py) Reads the JSONL logs directly (no API needed) and shows ingestion run history, stars, downloads, and SO question counts

Running the Dashboard

# Make sure the API is running first, then:
streamlit run code/dashboard/app.py

The dashboard can also be accessed at https://openlensapp.streamlit.app/

Dashboard Screenshots

Leaderboard Page showing package ranking Leaderboard Page Showing Package Health and Ranking

Page showing package sub-scores Grouped Bar Chart of Sub-scores

Page showing package  details Package Detail view

Compare page with side-by-side comparision two packages Package Comparision

Compare page with side-by-side radar overlay Package Comparision - Radar Overlay


Orchestration

Automated Pipeline

The orchestrator ([code/orchestration/cron_pipeline.py](code/orchestration/cron_pipeline.py)) handles the full pipeline:

  1. Runs all three ingestors (GitHub, PyPI, Stack Overflow) in sequence
  2. Uploads bronze data to DBFS via [upload_to_dbfs.py](code/upload_to_dbfs.py)
  3. Triggers the Databricks Workflows job (runs notebooks 01-06) through the Jobs API
  4. Polls until the job finishes and writes a summary to data/logs/nightly_pipeline_summary.jsonl

The Databricks job interaction is in [DatabricksJobsClient](code/orchestration/databricks_job.py), a wrapper around the Databricks REST API 2.1.

Environment Variables

Set these in a .env file (copy from .env.example):

Variable What it's for
GITHUB_TOKEN GitHub personal access token
SO_API_KEY Stack Exchange API key for higher quotas
DATABRICKS_HOST Your Databricks workspace URL
DATABRICKS_TOKEN Databricks personal access token
DATABRICKS_CLUSTER_ID The cluster ID that databricks-connect should use

Structured Logging

Each ingestor writes two log files:

File Format What it's for
data/logs/<source>_ingestor.log Plain text Readable log for debugging
data/logs/<source>_ingestor_structured.jsonl JSON Lines Machine-readable telemetry that feeds the Pipeline Stats dashboard

Example structured log entry:

{
  "timestamp": "2026-03-22T10:00:05Z",
  "event": "repo_metadata",
  "package": "pandas",
  "status": "success",
  "file": "data/bronze/github/repos/pandas.json",
  "bytes": 14823,
  "stars": 43000,
  "forks": 17800,
  "open_issues": 3500
}

Each run ends with a run_complete event that has success/error counts. The Pipeline Stats page reads these logs.


Testing

We have four test files in [test/](test/). You can run them all with:

pytest test/ -v
Test File Type What it checks
[test_score_logic.py](test/test_score_logic.py) Unit The min-max scaling function, that weights sum to 1.0, the overall score formula, and all the tier boundary conditions
[test_api_models.py](test/test_api_models.py) Unit That our Pydantic models handle required fields, optional fields, type coercion, and JSON round-trips correctly
[test_api_endpoints.py](test/test_api_endpoints.py) Integration All 6 API routes, but with Databricks completely stubbed out and mock DataFrames injected instead
[test_log_schema.py](test/test_log_schema.py) Integration Reads the actual JSONL log files and checks that the right fields exist with reasonable values

Terminal output showing all pytest tests Test Outputs


Repository Structure

ddca2026-project-online5/
├── README.md                          # This file: the main project report
├── requirements.txt                   # Python dependencies
├── .env.example                       # Template for environment variables
│
├── code/                              # All pipeline code
│   ├── README.md
│   ├── config.py                      # Shared config (tokens, paths, packages)
│   ├── upload_to_dbfs.py              # Pushes bronze data to DBFS
│   │
│   ├── ingestion/                     # Data ingestion scripts
│   │   ├── github_ingestor.py
│   │   ├── pypi_ingestor.py
│   │   └── stackoverflow_ingestor.py
│   │
│   ├── databricks/                    # Databricks notebooks (run in order)
│   │   ├── 01_bronze_to_delta.ipynb
│   │   ├── 01b_bronze_schema_evolution.ipynb
│   │   ├── 02_silver_github.ipynb
│   │   ├── 03_silver_pypi.ipynb
│   │   ├── 04_silver_stackoverflow.ipynb
│   │   ├── 04b_silver_profiling.ipynb
│   │   ├── 05_gold_health_score.ipynb
│   │   └── 06_optimize_vacuum.ipynb
│   │
│   ├── api/                           # FastAPI REST API
│   │   ├── main.py
│   │   ├── db.py
│   │   └── models.py
│   │
│   ├── dashboard/                     # Streamlit dashboard
│   │   ├── app.py
│   │   ├── api_client.py
│   │   └── pages/
│   │       ├── 1_Package_Detail.py
│   │       ├── 2_Compare.py
│   │       └── 3_Pipeline_Stats.py
│   │
│   ├── contracts/                     # YAML data contracts
│   │   ├── package_health_scores.yaml
│   │   ├── package_sentiment.yaml
│   │   └── package_unified_features.yaml
│   │
│   └── orchestration/                 # Pipeline scheduling
│       ├── cron_pipeline.py
│       └── databricks_job.py
│
├── data/                              # Sample data and logs
│   ├── bronze/                        # Raw API responses per source
│   │   ├── github/
│   │   ├── pypi/
│   │   └── stackoverflow/
│   └── logs/                          # Ingestor logs (.log + .jsonl)
│
├── docs/                              # Extended documentation
│   ├── pipeline_architecture.md
│   └── screenshots/
│
├── test/                              # Test suites
│   ├── conftest.py
│   ├── test_score_logic.py
│   ├── test_api_models.py
│   ├── test_api_endpoints.py
│   └── test_log_schema.py
│
├── example/                           # Pipeline run evidence / screenshots
│
└── misc/                              # Notes and references

Getting Started

What you need

  • Python 3.12+
  • An Azure Databricks workspace (free trial works)
  • A GitHub personal access token
  • A Stack Exchange API key

Setup

git clone <repository-url>
cd ddca2026-project-online5

python -m venv .venv
source .venv/bin/activate      # Linux/Mac
.venv\Scripts\activate         # Windows

pip install -r requirements.txt

cp .env.example .env
# Fill in your tokens and Databricks credentials in .env

Running everything

# 1. Collect data from all three APIs
python code/ingestion/github_ingestor.py
python code/ingestion/pypi_ingestor.py
python code/ingestion/stackoverflow_ingestor.py

# 2. Push bronze data to Databricks
python code/upload_to_dbfs.py

# 3. Run notebooks 01–06 in order (You can run the notebooks in databricks or establish connection locally)

# 4. Start the API
uvicorn code.api.main:app --port 8000

# 5. Start the dashboard
streamlit run code/dashboard/app.py

# Or just run the orchestrator which does steps 1-3 automatically:
python code/orchestration/cron_pipeline.py

Running tests

pytest test/ -v

Tech Stack

What Technology
Data platform Azure Databricks + Delta Lake
Storage Delta Lake (Parquet + transaction log)
Compute Databricks Connect, PySpark
ML DistilBERT from Hugging Face (distilbert-base-uncased-finetuned-sst-2-english)
API FastAPI + Uvicorn
Dashboard Streamlit + Plotly
Contracts YAML
Testing pytest + httpx
Orchestration Python + Databricks Jobs REST API

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors