Project name
Awareness
Repo URL
https://github.com/nazmiefearmutcu/awareness
Description (description-only, per CONTRIBUTING)
Local-first public-text ingestion engine. Backfills any historical date range from Common Crawl + FineWeb and live-tails RSS / sitemaps / GDELT into Apache Iceberg, queryable from DuckDB. Single process, no cluster required.
Suggested tags
internet, dev.storage (data-pipeline / storage)
Why it fits
- Pure Python (FastAPI + pyiceberg + duckdb + httpx)
- Self-contained application (web UI ships with the FastAPI server at
/)
- Long-running (live tail) + one-shot (backfill) modes
- Not a library / not a notebook / not a tutorial
Status
- v0.1.0 released, MIT licensed
- README has a built-in 'Research Workbench' UI section with 4 screenshots
- Architecture diagram in README
Happy to add to projects.yaml directly via PR if preferred — let me know.
Project name
Awareness
Repo URL
https://github.com/nazmiefearmutcu/awareness
Description (description-only, per CONTRIBUTING)
Local-first public-text ingestion engine. Backfills any historical date range from Common Crawl + FineWeb and live-tails RSS / sitemaps / GDELT into Apache Iceberg, queryable from DuckDB. Single process, no cluster required.
Suggested tags
internet,dev.storage(data-pipeline / storage)Why it fits
/)Status
Happy to add to projects.yaml directly via PR if preferred — let me know.