Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## Metaflow API Docs

- [BatchInferencePipeline](docs/metaflow/batch_inference_pipeline.md)
- [create_ownership_registry_view](docs/metaflow/create_ownership_registry_view.md)
- [make_pydantic_parser_fn](docs/metaflow/make_pydantic_parser_fn.md)
- [publish](docs/metaflow/publish.md)
- [publish_pandas](docs/metaflow/publish_pandas.md)
Expand Down
46 changes: 46 additions & 0 deletions docs/metaflow/create_ownership_registry_view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# `create_ownership_registry_view`

Source: `ds_platform_utils.metaflow.registry.create_ownership_registry_view`

Creates (or replaces) the central **table-ownership registry view**,
`PATTERN_DB.DATA_SCIENCE.TABLE_OWNERSHIP_REGISTRY`. The view pivots the object tags
applied by [`publish`](publish.md) / [`publish_pandas`](publish_pandas.md) into one row
per table, exposing `owner`, `team`, `domain`, `project`, `status`, `sla` and `contact`.

This is a one-time admin helper.

## Signature

```python
create_ownership_registry_view(conn: SnowflakeConnection | None = None) -> str
```

| Parameter | Type | Required | Description |
| --------- | ----------------------------- | -------: | ------------------------------------------------------------------------ |
| `conn` | `SnowflakeConnection \| None` | No | Open Snowflake connection. If omitted, one is created via `get_snowflake_connection()`. |

**Returns:** the executed `CREATE OR REPLACE VIEW` SQL string.

## Usage

```python
from ds_platform_utils.metaflow import create_ownership_registry_view

create_ownership_registry_view()
```

Then query it:

```sql
SELECT * FROM PATTERN_DB.DATA_SCIENCE.TABLE_OWNERSHIP_REGISTRY
ORDER BY team, table_name;
```

## Notes

- **No refresh needed.** A view is not materialized — it re-runs its query on every read,
so it is always live.
- **~2h lag.** The view reads `SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES`, which itself lags
up to ~2 hours. For the current value of a single table's tag, use
`SYSTEM$GET_TAG('PATTERN_DB.DATA_SCIENCE.TABLE_OWNER', '<table>', 'table')` instead.
- **Adoption-based.** Only tables that have at least one ownership tag appear in the view.
57 changes: 57 additions & 0 deletions docs/metaflow/publish.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ publish(
ctx: dict[str, Any] | None = None,
warehouse: Literal["XS", "MED", "XL"] = None,
use_utc: bool = True,
tags: dict[str, str] | None = None,
) -> None
```

Expand All @@ -22,6 +23,8 @@ publish(
- Reads SQL from a string or `.sql` path.
- Runs write/audit/publish operations through Snowflake.
- Adds operation details and table links to the Metaflow card when available.
- **Automatically applies ownership object tags to production tables** (see
[Ownership tags](#ownership-tags) below).

## Parameters

Expand All @@ -33,6 +36,7 @@ publish(
| `ctx` | `dict[str, Any] \| None` | No | Optional template substitution context for SQL operations. |
| `warehouse` | `Literal["XS", "MED", "XL"] \| None` | No | Snowflake warehouse override for this operation. Supports `XS`/`MED`/`XL` shortcuts or a full warehouse name. |
| `use_utc` | `bool` | No | If `True`, uses UTC timezone for Snowflake session. |
| `tags` | `dict[str, str] \| None` | No | Overrides for the ownership object tags applied to the published table. See [Ownership tags](#ownership-tags).|

**Returns:** `None`

Expand All @@ -47,3 +51,56 @@ publish(
audits=["SELECT COUNT(*) > 0 FROM PATTERN_DB.{{schema}}.{{table_name}}"],
)
```

## Ownership tags

When publishing to **production**, `publish()` automatically applies the table-ownership
object tags from the table-ownership RFC. The seven tags are:

| Tag | Source | Always set? |
| --------------- | ------------------------------------------------------- | --------------- |
| `TABLE_OWNER` | owning-team alias derived from the domain (`ds-<domain>-team`), else `unknown` | yes |
| `TABLE_TEAM` | `data-science` | yes |
| `TABLE_DOMAIN` | `ds.domain` Metaflow tag, else `unknown` | yes |
| `TABLE_PROJECT` | `ds.project` Metaflow tag, else `unknown` | yes |
| `TABLE_STATUS` | `active` (override allows `active`/`development`/`testing`/`deprecated`/`archived`/`retired`) | yes |
| `TABLE_SLA` | override only (`streaming`/`realtime`/`hourly`/`daily`/`weekly`/`monthly`/`quarterly`/`ad_hoc`/`on_demand`) | only if given |
| `TABLE_CONTACT` | override only (Slack channel or email) | only if given |

> **`TABLE_OWNER` is derived from the domain, not the run user.** Owner is resolved by
> priority: (1) an explicit `tags={"owner": ...}` override, else (2) the owning-team alias
> `ds-<domain>-team` when the domain is known (e.g. domain `advertising` →
> `ds-advertising-team`), else (3) `unknown`. We don't use `current.username`, because on
> deployed/argo runs it resolves to a service identity (`argo-workflows`) rather than a
> person. Pass `tags={"owner": ...}` to set a specific individual or alias.

> **`TABLE_DOMAIN` / `TABLE_PROJECT` depend on flow tags.** These are read from the
> `ds.domain` / `ds.project` Metaflow tags. If a flow runs without them, the value falls
> back to the literal string `unknown` and a warning is printed (the same warning used
> for select.dev cost tracking). Make sure your flow carries `--tag "ds.domain:..."` and
> `--tag "ds.project:..."` — these are applied automatically in CI and the standard `poe`
> run commands in the monorepo — or pass `tags={"domain": ..., "project": ...}` explicitly.
> Note: because owner is derived from the domain, a missing domain also means
> `TABLE_OWNER` falls back to `unknown`.

Pass `tags=` to override any value. Keys may be `owner`/`team`/`domain`/`project`/
`status`/`sla`/`contact` (optionally `TABLE_`-prefixed):

```python
publish(
table_name="OUT_OF_STOCK_ADS",
query="sql/create_training_data.sql",
tags={"sla": "daily", "contact": "#ds-recsys", "status": "active"},
)
```

Notes:

- Tags are applied **only to production tables** (`DATA_SCIENCE`). Non-prod
(`DATA_SCIENCE_STAGE`) runs apply no tags. The publishing role needs `APPLY` on the tags.
- The tag *definitions* must first be created once by a Snowflake admin in `DATA_SCIENCE`
(the RFC `CREATE TAG` setup). Until then, tagging is **skipped with a warning** — the publish
still succeeds.
- Invalid `status`/`sla` values raise `ValueError` before any data is written.
- Tagged tables surface in the `TABLE_OWNERSHIP_REGISTRY` view (see
`create_ownership_registry_view`).
28 changes: 28 additions & 0 deletions docs/metaflow/publish_pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ publish_pandas(
use_utc: bool = True,
use_s3_stage: bool = False,
table_definition: list[tuple[str, str]] | None = None,
tags: dict[str, str] | None = None,
) -> None
```

Expand All @@ -30,6 +31,8 @@ publish_pandas(
- Validates DataFrame input.
- Writes directly via `write_pandas` or via S3 stage flow for large data.
- Adds a Snowflake table URL to Metaflow card output.
- **Automatically applies ownership object tags to production tables** (see
[Ownership tags](#ownership-tags) below).

## Parameters

Expand All @@ -49,9 +52,34 @@ publish_pandas(
| `use_utc` | `bool` | No | If `True`, uses UTC timezone for Snowflake session. |
| `use_s3_stage` | `bool` | No | If `True`, publishes via S3 stage flow; otherwise uses direct `write_pandas`. |
| `table_definition` | `list[tuple[str, str]] \| None` | No | Optional Snowflake table schema; used by S3 stage flow when table creation is needed. |
| `tags` | `dict[str, str] \| None` | No | Overrides for the ownership object tags applied to the published table. See [Ownership tags](#ownership-tags).|

**Returns:** `None`

## Ownership tags

When publishing to **production**, `publish_pandas()` automatically applies the same
seven table-ownership object tags as [`publish`](publish.md#ownership-tags):
`TABLE_OWNER`, `TABLE_TEAM`, `TABLE_DOMAIN`, `TABLE_PROJECT`, `TABLE_STATUS` and
(when provided via `tags=`) `TABLE_SLA` / `TABLE_CONTACT`.

```python
publish_pandas(
table_name="MY_TABLE",
df=df,
tags={"sla": "daily", "contact": "#ds-recsys"},
)
```

- Tags are applied **only to production tables** (`DATA_SCIENCE`); non-prod runs apply none.
- `TABLE_DOMAIN` / `TABLE_PROJECT` come from the `ds.domain` / `ds.project` Metaflow tags;
if a flow runs without them they fall back to the literal `unknown` and a warning is
printed. Ensure the flow carries those tags (automatic in CI / standard `poe` commands)
or pass `tags={"domain": ..., "project": ...}`. See [`publish`](publish.md#ownership-tags).
- Tag *definitions* must first be created by a Snowflake admin (RFC `CREATE TAG` setup);
until then tagging is **skipped with a warning** and the publish still succeeds.
- Invalid `status`/`sla` values raise `ValueError` before any data is written.

## Limitations

- When `use_s3_stage=True`, some column data types may not map exactly as expected between pandas/parquet and Snowflake.
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
[project]
name = "ds-platform-utils"
version = "0.4.2"
version = "0.5.0"
description = "Utility library for Pattern Data Science."
readme = "README.md"
authors = [
{ name = "Amit Vikram Raj", email = "amit.raj@pattern.com" },
{ name = "Eric Riddoch", email = "eric.riddoch@pattern.com" }
{ name = "Eric Riddoch", email = "eric.riddoch@pattern.com" },
{ name = "Vinay Shende", email = "vinay.shende@pattern.com" }
]
# requires-python = ">=3.7"
dependencies = [
Expand Down
Loading
Loading