Skip to content

Commit be08036

Browse files
committed
Add initial R -> Python package rewrite.
0 parents  commit be08036

14 files changed

Lines changed: 1474 additions & 0 deletions

.gitignore

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[codz]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py.cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
# Pipfile.lock
96+
97+
# UV
98+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# uv.lock
102+
103+
# poetry
104+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105+
# This is especially recommended for binary packages to ensure reproducibility, and is more
106+
# commonly ignored for libraries.
107+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108+
# poetry.lock
109+
# poetry.toml
110+
111+
# pdm
112+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
113+
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
114+
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
115+
# pdm.lock
116+
# pdm.toml
117+
.pdm-python
118+
.pdm-build/
119+
120+
# pixi
121+
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
122+
# pixi.lock
123+
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
124+
# in the .venv directory. It is recommended not to include this directory in version control.
125+
.pixi
126+
127+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128+
__pypackages__/
129+
130+
# Celery stuff
131+
celerybeat-schedule
132+
celerybeat.pid
133+
134+
# Redis
135+
*.rdb
136+
*.aof
137+
*.pid
138+
139+
# RabbitMQ
140+
mnesia/
141+
rabbitmq/
142+
rabbitmq-data/
143+
144+
# ActiveMQ
145+
activemq-data/
146+
147+
# SageMath parsed files
148+
*.sage.py
149+
150+
# Environments
151+
.env
152+
.envrc
153+
.venv
154+
env/
155+
venv/
156+
ENV/
157+
env.bak/
158+
venv.bak/
159+
160+
# Spyder project settings
161+
.spyderproject
162+
.spyproject
163+
164+
# Rope project settings
165+
.ropeproject
166+
167+
# mkdocs documentation
168+
/site
169+
170+
# mypy
171+
.mypy_cache/
172+
.dmypy.json
173+
dmypy.json
174+
175+
# Pyre type checker
176+
.pyre/
177+
178+
# pytype static type analyzer
179+
.pytype/
180+
181+
# Cython debug symbols
182+
cython_debug/
183+
184+
# PyCharm
185+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
186+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
187+
# and can be added to the global gitignore or merged into this file. For a more nuclear
188+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
189+
# .idea/
190+
191+
# Abstra
192+
# Abstra is an AI-powered process automation framework.
193+
# Ignore directories containing user credentials, local state, and settings.
194+
# Learn more at https://abstra.io/docs
195+
.abstra/
196+
197+
# Visual Studio Code
198+
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
199+
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
200+
# and can be added to the global gitignore or merged into this file. However, if you prefer,
201+
# you could uncomment the following to ignore the entire vscode folder
202+
# .vscode/
203+
204+
# Ruff stuff:
205+
.ruff_cache/
206+
207+
# PyPI configuration file
208+
.pypirc
209+
210+
# Marimo
211+
marimo/_static/
212+
marimo/_lsp/
213+
__marimo__/
214+
215+
# Streamlit
216+
.streamlit/secrets.toml

CLAUDE.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Versioning
2+
3+
We are recreating the R package found in `../versioning.R/` as a clean Python package with tests and documentation, so that it is fully ready to be uploaded to PyPL. The package name is "versioning".
4+
5+
The purpose of this package is to parse YAML config files that simplify file reading and writing, with some opinionated package choices for file reading and writing of particular file types. The package is also intended to make it easy to deploy different versions of data pipelines over time.

README.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# versioning
2+
3+
A Python package for YAML-based configuration management in data pipelines, with versioned directory support and automatic file I/O by extension.
4+
5+
## Installation
6+
7+
```bash
8+
pip install versioning
9+
```
10+
11+
Install optional extras for specific file formats:
12+
13+
```bash
14+
pip install versioning[pandas] # CSV, TSV, Excel, Stata
15+
pip install versioning[geo] # Shapefiles, GeoJSON, GeoPackage, etc.
16+
pip install versioning[raster] # GeoTIFF, rasterio formats
17+
pip install versioning[xarray] # NetCDF
18+
pip install versioning[dbfread] # DBF files
19+
pip install versioning[all] # All of the above
20+
```
21+
22+
## Quick Start
23+
24+
### 1. Create a config YAML file
25+
26+
```yaml
27+
# project_config.yaml
28+
project_name: 'my_analysis'
29+
30+
directories:
31+
raw_data:
32+
versioned: false
33+
path: '~/data/raw'
34+
files:
35+
input_table: 'records.csv'
36+
37+
results:
38+
versioned: true
39+
path: '~/data/results'
40+
files:
41+
output_table: 'processed.csv'
42+
summary: 'summary.txt'
43+
44+
versions:
45+
results: 'v1'
46+
```
47+
48+
### 2. Load the config
49+
50+
```python
51+
from versioning import Config
52+
53+
cfg = Config('project_config.yaml')
54+
```
55+
56+
### 3. Access settings
57+
58+
```python
59+
cfg.get('project_name') # 'my_analysis'
60+
cfg.get('versions', 'results') # 'v1'
61+
cfg.get() # full config dict
62+
```
63+
64+
### 4. Build paths
65+
66+
```python
67+
# Non-versioned: returns ~/data/raw
68+
cfg.get_dir_path('raw_data')
69+
70+
# Versioned: returns ~/data/results/v1
71+
cfg.get_dir_path('results')
72+
73+
# With a custom version override
74+
cfg.get_dir_path('results', custom_version='v2')
75+
76+
# Full file path
77+
cfg.get_file_path('raw_data', 'input_table') # ~/data/raw/records.csv
78+
cfg.get_file_path('results', 'output_table') # ~/data/results/v1/processed.csv
79+
```
80+
81+
All path methods return `pathlib.Path` objects.
82+
83+
### 5. Read and write files
84+
85+
```python
86+
import pandas as pd
87+
88+
# Read a file (path resolved from config)
89+
df = cfg.read('raw_data', 'input_table')
90+
91+
# Process data
92+
processed = df.head(10)
93+
94+
# Write results (directory must exist)
95+
cfg.write(processed, 'results', 'output_table')
96+
cfg.write(['Summary: 10 rows written\n'], 'results', 'summary')
97+
98+
# Write the config itself to the results directory
99+
cfg.write_self('results')
100+
```
101+
102+
### 6. Override versions at load time
103+
104+
```python
105+
# Run the same pipeline with a new version
106+
cfg_v2 = Config('project_config.yaml', versions={'results': 'v2'})
107+
cfg_v2.get_dir_path('results') # ~/data/results/v2
108+
```
109+
110+
## Standalone autoread / autowrite
111+
112+
```python
113+
from versioning import autoread, autowrite
114+
115+
# Read by extension
116+
df = autoread('data/records.csv')
117+
config = autoread('config.yaml')
118+
lines = autoread('notes.txt')
119+
120+
# Write by extension
121+
autowrite(df, 'output/results.csv')
122+
autowrite({'key': 'value'}, 'output/config.yaml')
123+
autowrite(['line one\n', 'line two\n'], 'output/notes.txt')
124+
```
125+
126+
## Supported File Extensions
127+
128+
| Format | Extensions | Requires |
129+
|--------|-----------|---------|
130+
| CSV / TSV | csv, tsv, gz, bz2 | `pandas` |
131+
| Excel | xls, xlsx | `pandas`, `openpyxl` |
132+
| Stata | dta | `pandas` |
133+
| DBF | dbf | `dbfread` |
134+
| YAML | yaml, yml | *(core)* |
135+
| Text | txt | *(core)* |
136+
| Shapefile / Vector | shp, geojson, gpkg, fgb, gml, kml, and more | `geopandas` |
137+
| Raster | tif, geotiff | `rasterio` |
138+
| NetCDF | nc | `xarray` |
139+
140+
For raster files, `autoread` returns `{"data": np.ndarray, "profile": dict}` and `autowrite` accepts that same structure (or a `(data, profile)` tuple).
141+
142+
## Example Config File
143+
144+
A bundled example is included with the package:
145+
146+
```python
147+
import importlib.resources as r
148+
from versioning import Config
149+
150+
path = str(r.files("versioning") / "data" / "example_config.yaml")
151+
cfg = Config(path)
152+
```

0 commit comments

Comments
 (0)