Skip to content

[suggestion] Awareness — local-first public-text ingestion engine #225

Description

@nazmiefearmutcu

Project name

Awareness

Repo URL

https://github.com/nazmiefearmutcu/awareness

Description (description-only, per CONTRIBUTING)

Local-first public-text ingestion engine. Backfills any historical date range from Common Crawl + FineWeb and live-tails RSS / sitemaps / GDELT into Apache Iceberg, queryable from DuckDB. Single process, no cluster required.

Suggested tags

internet, dev.storage (data-pipeline / storage)

Why it fits

  • Pure Python (FastAPI + pyiceberg + duckdb + httpx)
  • Self-contained application (web UI ships with the FastAPI server at /)
  • Long-running (live tail) + one-shot (backfill) modes
  • Not a library / not a notebook / not a tutorial

Status

  • v0.1.0 released, MIT licensed
  • README has a built-in 'Research Workbench' UI section with 4 screenshots
  • Architecture diagram in README

Happy to add to projects.yaml directly via PR if preferred — let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions