Skip to content

Commit b7906b1

Browse files
ci testing
1 parent 3595756 commit b7906b1

4 files changed

Lines changed: 233 additions & 1 deletion

File tree

.github/workflows/ci.yml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: ["master", "main"]
6+
pull_request:
7+
branches: ["master", "main"]
8+
9+
jobs:
10+
test:
11+
name: Setup and Run Tests
12+
runs-on: ubuntu-latest
13+
strategy:
14+
matrix:
15+
python-version: ["3.12"]
16+
17+
steps:
18+
- name: Checkout repository
19+
uses: actions/checkout@v4
20+
21+
- name: Set up Python ${{ matrix.python-version }}
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: ${{ matrix.python-version }}
25+
26+
- name: Install uv
27+
uses: astral-sh/setup-uv@v5
28+
29+
- name: Install project and dependencies
30+
run: |
31+
uv sync --all-extras --dev
32+
uv pip install -e . --no-deps
33+
34+
- name: Create dummy .env file
35+
run: |
36+
echo "R2_ACCOUNT_ID=dummy-id" >> .env
37+
echo "R2_ACCESS_KEY_ID=dummy-key" >> .env
38+
echo "R2_SECRET_ACCESS_KEY=dummy-secret" >> .env
39+
echo "R2_BUCKET=dummy-bucket" >> .env
40+
41+
- name: Run tests
42+
run: uv run pytest tests

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
!.python-version
1717
!netlify.toml
1818
!manifest.json
19+
!env.example
1920

2021
# recursively re-ignore
2122
__pycache__

README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# OEO Data management
2+
3+
This repository provides a command-line tool (datamanager) to manage large, versioned datasets (like SQLite files) using Git for metadata and Cloudflare R2 for object storage.
4+
5+
This approach avoids the pitfalls of storing large binary files directly in Git (or the costs associated with Git LFS) while still providing a robust, auditable version history for your data assets.
6+
7+
## The Core Concept
8+
9+
The system works by treating your Git repository as a source of truth for *metadata*, not for the data itself. Large data files are stored in a cost-effective object store (Cloudflare R2), which has the major benefit of zero egress fees for open data projects.
10+
11+
The workflow is as follows:
12+
13+
```mermaid
14+
flowchart TD
15+
subgraph "Local Machine / CI Runner"
16+
A[Developer] --> B{datamanager CLI};
17+
B --> C[Git Repo];
18+
C --> D[manifest.json];
19+
C --> E[.diff files];
20+
end
21+
22+
subgraph "Cloud"
23+
F[Remote Git Repo <br/>(GitHub, GitLab)];
24+
G[Cloudflare R2 Bucket];
25+
end
26+
27+
B -- "1. Uploads new .sqlite file" --> G;
28+
B -- "2. Calculates hash & diff" --> C;
29+
C -- "3. git push" --> F;
30+
31+
style G fill:#f9f,stroke:#333,stroke-width:2px
32+
style F fill:#ccf,stroke:#333,stroke-width:2px
33+
```
34+
35+
A `manifest.json` file in the Git repo acts as a "pointer" system, mapping dataset versions to specific, immutable objects in R2, complete with integrity hashes.
36+
37+
## Features
38+
39+
- **Transactional Operations:** Updates and creations are transactional. If an R2 upload or `git push` fails, the operation is automatically rolled back to prevent inconsistent states.
40+
- **Interactive TUI:** Run `datamanager` with no arguments for a user-friendly, menu-driven interface.
41+
- **CLI for Automation:** A full suite of commands for scripting and CI/CD integration.
42+
- **Integrity Verification:** All downloaded files are automatically checked against their SHA256 hash from the manifest.
43+
- **Small SQL Diffs:** For small changes, human-readable `.diff` files are stored directly in Git for quick review.
44+
- **Credential Verification:** A simple `verify` command to check your R2 configuration.
45+
46+
## Prerequisites
47+
48+
- Python 3.12+
49+
- Git
50+
- `sqlite3` command-line tool
51+
- An active Cloudflare account with an R2 bucket.
52+
- For the data in this repo, contact the OEO team for access to the R2 bucket.
53+
54+
## ⚙️ Setup and Installation
55+
56+
1. **Clone the Repository:**
57+
58+
```bash
59+
git clone git@github.com:ParticularlyPythonicBS/oeo_data.git
60+
cd oeo_data
61+
```
62+
63+
2. **Install Dependencies:**
64+
This project uses and recommends `uv` for fast and reliable dependency management.
65+
66+
```bash
67+
uv sync
68+
uv pip install -e .
69+
```
70+
71+
The `-e` flag installs the package in "editable" mode, so changes to the source code are immediately reflected.
72+
73+
3. **Configure Environment Variables:**
74+
The tool is configured using a `.env` file. Create one by copying the example:
75+
76+
```bash
77+
cp .env.example .env
78+
```
79+
80+
Now, edit the `.env` file with your Cloudflare R2 credentials. **This file should be in your `.gitignore` and never committed to the repository.**
81+
82+
**`.env`**
83+
84+
```ini
85+
# Get these from your Cloudflare R2 dashboard
86+
R2_ACCOUNT_ID="your_cloudflare_account_id"
87+
R2_ACCESS_KEY_ID="your_r2_access_key"
88+
R2_SECRET_ACCESS_KEY="your_r2_secret_key"
89+
R2_BUCKET="your-r2-bucket-name"
90+
```
91+
92+
4. **Verify Configuration:**
93+
Run the `verify` command to ensure your credentials and bucket access are correct.
94+
95+
```bash
96+
datamanager verify
97+
```
98+
99+
![Verify Output](assets/verification.png)
100+
101+
## 🚀 Usage
102+
103+
### Interactive TUI
104+
105+
For a guided experience, simply run the command with no arguments:
106+
107+
```bash
108+
uv run datamanager
109+
```
110+
111+
This will launch a menu where you can choose your desired action.
112+
113+
![alt text](assets/tui.png)
114+
115+
### Command-Line Interface (CLI)
116+
117+
#### `verify`
118+
119+
Checks R2 credentials and bucket access.
120+
121+
```bash
122+
uv run datamanager verify
123+
```
124+
125+
![alt text](assets/verification.png)
126+
127+
#### `list-datasets`
128+
129+
Lists all datasets currently tracked in `manifest.json`.
130+
131+
```bash
132+
uv run datamanager list-datasets
133+
134+
```
135+
136+
![alt text](assets/list-datasets.png)
137+
138+
#### `create`
139+
140+
Adds a new dataset to be tracked.
141+
142+
```bash
143+
uv run datamanager create <dataset-name.sqlite> <path/to/local/file.sqlite>
144+
```
145+
146+
![alt text](assets/creating.png)
147+
148+
#### `update`
149+
150+
Creates a new version of an existing dataset.
151+
152+
```bash
153+
uv run datamanager update <dataset-name.sqlite> <path/to/new/file.sqlite>
154+
```
155+
156+
![alt text](assets/interactive_update.png)
157+
158+
#### `pull`
159+
160+
Downloads a dataset from R2 and verifies its integrity.
161+
162+
```bash
163+
# Pull the latest version
164+
uv run datamanager pull user-profiles.sqlite
165+
166+
# Pull a specific version
167+
uv run datamanager pull user-profiles.sqlite --version v2
168+
169+
# Pull and save to a different path/name
170+
uv run datamanager pull user-profiles.sqlite -o ./downloads/users_v2.sqlite
171+
```
172+
173+
![alt text](assets/pulling.png)
174+
175+
## 🧑‍💻 Development and Testing
176+
177+
To contribute to the tool's development:
178+
179+
1. Install development dependencies (if any are added to `pyproject.toml`).
180+
2. Run the test suite using `pytest`:
181+
182+
```bash
183+
uv run pytest
184+
```
185+
186+
3. For code quality checks, run:
187+
188+
```bash
189+
uv pre-commit run --all-files
190+
```

src/datamanager/config.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ def get_required_env_var(env_dict: dict[str, Optional[str]], var_name: str) -> s
1414

1515

1616
dotenv_path = find_dotenv()
17-
print(f"Loading environment variables from: {dotenv_path}")
1817

1918
env = dotenv_values(dotenv_path)
2019

0 commit comments

Comments
 (0)