OEO Data Management

This is the official repository for versioned input databases used by the Open Energy Outlook (OEO) initiative. It contains a command-line tool (datamanager) designed to manage these Temoa-compatible SQLite databases using a secure, auditable, and CI/CD-driven workflow.

About the Data

The SQLite databases hosted here are designed to be used as inputs for Temoa, an open-source energy system optimization model. This data is curated and maintained by the Open Energy Outlook (OEO) team. The goal is to provide a transparent, version-controlled, and publicly accessible set of data for energy systems modeling and analysis.

The Core Concept

The system works by treating your Git repository as a source of truth for metadata. The final publication of data is handled by a trusted, automated GitHub Actions workflow after a Pull Request has been reviewed and merged.

This two-phase process ensures security and consistency:

Prepare Phase (Local): A developer prepares a new data version. The large file is uploaded to a temporary staging bucket, and a change to manifest.json is proposed.
Publish Phase (Automated): After the proposal is approved and merged into the main branch, a GitHub Action performs a secure, server-side copy from the staging bucket to the final production bucket, making the data live.

---
config:
  layout: elk
  theme: mc
  look: classic
---

flowchart TD
 subgraph Developer_Machine["Developer_Machine"]
        B["Staging R2 Bucket"]
        A["datamanager prepare"]
        C["New Git Branch"]
        D["GitHub"]
  end
 subgraph GitHub["GitHub"]
        E["Open Pull Request"]
        F["main branch"]
        G["Publish Workflow"]
  end
 subgraph Cloudflare_R2["Cloudflare_R2"]
        H["Production R2 Bucket"]
  end
    A -- Upload GBs --> B
    A -- Commit manifest change --> C
    C -- Push Branch --> D
    D --> E
    E -- Review & Merge --> F
    F -- Triggers --> G
    G -- "Server-Side Copy" --> B
    B -- "...to" --> H
    G -- Finalize manifest --> F

Features

Multi-Bucket Support: Choose between production (public) and internal (team-only) buckets for different data access levels.
CI/CD-Driven Publishing: Data publication is transactional and automated via GitHub Actions after a pull request is merged, preventing inconsistent states.
Enhanced Security: Production credentials are never stored on developer machines; they are only used by the trusted GitHub Actions runner.
Interactive TUI: Run datamanager with no arguments for a user-friendly, menu-driven interface.
Data Lifecycle Management: A full suite of commands for rollback, deletion, and pruning, all gated by the same secure PR workflow.
Integrity Verification: All downloaded files are automatically checked against their SHA256 hash from the manifest.
Credential Verification: A detailed verify command reports read/write/delete permissions for production, staging, and internal buckets.

Bucket Types

The system supports two types of data storage buckets:

Production Bucket (Public)

Access Level: Publicly accessible
Use Case: Data intended for public consumption
Default: Used by default for all operations
Permissions: Requires appropriate public access settings in Cloudflare R2

Internal Bucket (Team-Only)

Access Level: Team members only
Use Case: Sensitive or internal data not for public consumption
Security: Private access, team credentials required
Usage: Specify --bucket internal when using commands

Prerequisites

Python 3.12+
Git
sqlite3 command-line tool
An active Cloudflare account with three R2 buckets:
- Production bucket (publicly accessible)
- Staging bucket (for temporary uploads)
- Internal bucket (team-only access)
For the data in this repo, contact the OEO team for access to the R2 buckets.

⚙️ Setup and Installation

Clone the Repository:

git clone git@github.com:TemoaProject/data.git
cd data

Install Dependencies: This project uses and recommends uv for fast and reliable dependency management.
```
# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -e .
```
The -e flag installs the package in "editable" mode, so changes to the source code are immediately reflected.

Configure Environment Variables: The tool is configured using a .env file. Create one by copying the example:

cp .env.example .env

Now, edit the .env file with your Cloudflare R2 credentials. This file should be in your .gitignore and never committed to the repository.

.env

# Get these from your Cloudflare R2 dashboard
R2_ACCOUNT_ID="your_cloudflare_account_id"
R2_ACCESS_KEY_ID="your_r2_access_key"
R2_SECRET_ACCESS_KEY="your_r2_secret_key"
R2_PRODUCTION_BUCKET="your-production-bucket-name"
R2_STAGING_BUCKET="your-staging-bucket-name"
R2_INTERNAL_BUCKET="your-internal-bucket-name"

Verify Configuration: Run the verify command to ensure your credentials and bucket access are correct.
```
uv run datamanager verify
```

📖 The Data Publishing Workflow

All changes to the data—whether creating, updating, or deleting—follow a strict, safe, and reviewable Git-based workflow.

Step 1: Create a New Branch

Always start by creating a new branch from the latest version of main. This isolates your changes.

git checkout main
git pull
git checkout -b feat/update-energy-data

Step 2: Prepare Your Changes

Use the datamanager tool to stage your changes. The prepare command handles both creating new datasets and updating existing ones.

# Prepare for production bucket (default)
uv run datamanager prepare energy-data.sqlite ./local-files/new-energy.sqlite

# Prepare for internal bucket
uv run datamanager prepare energy-data.sqlite ./local-files/new-energy.sqlite --bucket internal

The tool will guide you through the process. For other maintenance tasks like rollback or delete, use the corresponding command.

Step 3: Commit and Push

Commit the modified manifest.json file to your branch with a descriptive message. This message will become the official description for the new data version.

git add manifest.json
git commit -m "feat: Add 2025 energy data with new technology columns"
git push --set-upstream origin feat/update-energy-data

Step 4: Open a Pull Request

Go to GitHub and open a pull request from your feature branch to main. The diff will clearly show the proposed changes to the manifest for your team to review.

Step 5: Merge and Automate

Once the PR is reviewed, approved, and all status checks pass, merge it. The CI/CD pipeline takes over automatically:

It copies the data from the staging bucket to the production bucket.
It finalizes the manifest.json with the new commit hash and description.
It pushes a final commit back to main.

The new data version is now live and available to all users via datamanager pull.

🚀 Usage

The primary workflow is now to prepare a dataset, then use standard Git practices to propose the change.

Interactive TUI

For a guided experience, simply run the command with no arguments:

uv run datamanager

This will launch a menu where you can choose your desired action, including the new "Prepare a dataset for release" option.

Command-Line Interface (CLI)

You can also use the command-line interface directly for specific tasks or for scripting purposes. Use the --bucket option to specify whether to work with production or internal data.

Core Commands

`prepare`

Prepares a dataset for release by uploading it to the staging area and updating the manifest locally. This command intelligently handles both creating new datasets and updating existing ones.

This is the first step of the new workflow.

uv run datamanager prepare <dataset-name.sqlite> <path/to/local/file.sqlite>

After running prepare, follow the on-screen instructions:

git add manifest.json
git commit -m "Your descriptive message"
git push
Open a Pull Request in GitHub.

`list-datasets`

Lists all datasets currently tracked in manifest.json.

uv run datamanager list-datasets

`pull`

Downloads a dataset from the production R2 bucket and verifies its integrity.

# Pull the latest version from production (default)
uv run datamanager pull user-profiles.sqlite

# Pull from internal bucket
uv run datamanager pull user-profiles.sqlite --bucket internal

# Pull a specific version from production
uv run datamanager pull user-profiles.sqlite --version v2

# Pull a specific version from internal bucket
uv run datamanager pull user-profiles.sqlite --version v2 --bucket internal

Maintenance Commands

`rollback`

Prepares a rollback to a previous stable version by creating a new version entry that points to the old data.

uv run datamanager rollback <dataset-name.sqlite> --to-version v1

`delete`

Prepares the permanent deletion of an entire dataset and all its versions. Requires strong confirmation.

uv run datamanager delete <dataset-name.sqlite>

`prune-versions`

Prepares the permanent deletion of old versions of a dataset, keeping a specified number of recent versions.

uv run datamanager prune-versions <dataset-name.sqlite> --keep 5

`verify`

Checks R2 credentials and reports granular read/write/delete permissions for production, staging, and internal buckets.

uv run datamanager verify

🧑‍💻 Development and Testing

To contribute to the tool's development:

Install development dependencies using uv pip install -e .[dev].
Run the test suite using pytest:
```
uv run pytest
```
For code quality checks, run pre-commit:
```
uv run pre-commit run --all-files
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OEO Data Management

About the Data

The Core Concept

Features

Bucket Types

Production Bucket (Public)

Internal Bucket (Team-Only)

Prerequisites

⚙️ Setup and Installation

📖 The Data Publishing Workflow

Step 1: Create a New Branch

Step 2: Prepare Your Changes

Step 3: Commit and Push

Step 4: Open a Pull Request

Step 5: Merge and Automate

🚀 Usage

Interactive TUI

Command-Line Interface (CLI)

Core Commands

`prepare`

`list-datasets`

`pull`

Maintenance Commands

`rollback`

`delete`

`prune-versions`

`verify`

🧑‍💻 Development and Testing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OEO Data Management

About the Data

The Core Concept

Features

Bucket Types

Production Bucket (Public)

Internal Bucket (Team-Only)

Prerequisites

⚙️ Setup and Installation

📖 The Data Publishing Workflow

Step 1: Create a New Branch

Step 2: Prepare Your Changes

Step 3: Commit and Push

Step 4: Open a Pull Request

Step 5: Merge and Automate

🚀 Usage

Interactive TUI

Command-Line Interface (CLI)

Core Commands

prepare

list-datasets

pull

Maintenance Commands

rollback

delete

prune-versions

verify

🧑‍💻 Development and Testing

`prepare`

`list-datasets`

`pull`

`rollback`

`delete`

`prune-versions`

`verify`