Skip to content

Commit eb47bde

Browse files
committed
feat(Snowflake): Add Snowflake dialect
beep boop
1 parent 377aa0e commit eb47bde

6 files changed

Lines changed: 1118 additions & 11 deletions

File tree

README.md

Lines changed: 38 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,21 @@ regex flavour can't compile. Callers should fall back to
5353
## Schema
5454

5555
Each dialect publishes the table layout it expects via a `schema_ddl`
56-
constant. For ClickHouse:
56+
constant. For Snowflake:
57+
58+
```sql
59+
CREATE TABLE IF NOT EXISTS IDENTITIES (
60+
environment_id STRING NOT NULL,
61+
id NUMBER NOT NULL,
62+
identifier STRING NOT NULL,
63+
identity_key STRING NOT NULL,
64+
traits VARIANT,
65+
PRIMARY KEY (environment_id, id)
66+
)
67+
CLUSTER BY (environment_id, id);
68+
```
69+
70+
For ClickHouse:
5771

5872
```sql
5973
CREATE TABLE IF NOT EXISTS IDENTITIES (
@@ -67,11 +81,12 @@ ENGINE = MergeTree()
6781
ORDER BY (environment_id, id);
6882
```
6983

70-
Traits live in a single `JSON` column (CH 24+, GA in 25.x). Each key is
71-
stored as a typed subcolumn, so trait reads are direct columnar scans
72-
rather than per-row JSON parses. Trait keys are *data* — new keys appear
73-
without schema changes — and the translator only sees the abstract path
74-
extraction.
84+
Both engines store traits in a single columnar-JSON column —
85+
Snowflake's `VARIANT` and ClickHouse's `JSON` (24+, GA in 25.x). Each
86+
key is stored as a typed subcolumn, so trait reads are direct columnar
87+
scans rather than per-row JSON parses. Trait keys are *data* — new keys
88+
appear without schema changes — and the translator only sees the
89+
abstract path extraction.
7590

7691
ClickHouse Cloud requires `SET allow_experimental_json_type = 1` when
7792
creating a `JSON`-column table (the type is GA on OSS 25.x); the test
@@ -80,7 +95,8 @@ harness applies this setting automatically.
8095
Programmatic access:
8196

8297
```python
83-
from flagsmith_sql_flag_engine.dialects.clickhouse import SCHEMA_DDL
98+
from flagsmith_sql_flag_engine.dialects.snowflake import SCHEMA_DDL as SNOWFLAKE_DDL
99+
from flagsmith_sql_flag_engine.dialects.clickhouse import SCHEMA_DDL as CLICKHOUSE_DDL
84100
```
85101

86102
## Engine parity
@@ -95,10 +111,22 @@ To run the engine-parity suite locally:
95111

96112
```bash
97113
git submodule update --init # pull engine-test-data
114+
115+
# Snowflake
116+
export SNOWFLAKE_ACCOUNT=...
117+
export SNOWFLAKE_USER=...
118+
export SNOWFLAKE_PRIVATE_KEY_PATH=...
119+
120+
# ClickHouse — bring up the local container the CI workflow also uses
98121
docker compose up --detach --wait clickhouse
122+
99123
uv run pytest tests/test_engine.py
100124
```
101125

126+
Each harness's environment variables are only read at session-create
127+
time; to run a single dialect's parity, pass e.g. `-k snowflake` or
128+
`-k clickhouse` and only export that dialect's credentials.
129+
102130
Adding a new dialect's parity coverage is one harness module — see
103131
`tests/harnesses/` for the shape.
104132

@@ -107,9 +135,9 @@ Adding a new dialect's parity coverage is one harness module — see
107135
The translator is dialect-aware: a `Dialect` protocol abstracts the
108136
SQL fragments that differ across SQL engines — MD5 hex, hex-to-int
109137
parsing, prefix-anchored regex, padded-version comparison, type-aware
110-
trait predicates, regex flavour. Today `ClickHouseDialect` is the only
111-
implementation; adding another engine such as Snowflake, DuckDB or
112-
Postgres means writing one class.
138+
trait predicates, regex flavour. Today `SnowflakeDialect` and
139+
`ClickHouseDialect` are implemented; adding another engine such as
140+
DuckDB or Postgres means writing one class.
113141

114142
## Operator coverage
115143

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ dev = [
2222
"pytest-xdist>=3",
2323
"mypy>=1.10",
2424
"prek>=0.3",
25+
"snowflake-snowpark-python>=1.20",
2526
"clickhouse-connect>=0.7",
2627
"json5>=0.14.0",
2728
"pytest-cov>=7.1.0",
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""Dialect implementations."""
22

33
from flagsmith_sql_flag_engine.dialects.clickhouse import ClickHouseDialect
4+
from flagsmith_sql_flag_engine.dialects.snowflake import SnowflakeDialect
45

5-
__all__ = ["ClickHouseDialect"]
6+
__all__ = ["ClickHouseDialect", "SnowflakeDialect"]
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
"""Snowflake dialect: SQL fragments tailored to Snowflake's function set.
2+
3+
## Expected schema
4+
5+
The translator emits predicates against a single `IDENTITIES` table —
6+
four typed columns `environment_id`, `id`, `identifier`, `identity_key`,
7+
plus one `VARIANT` column `traits` holding the identity's full trait
8+
map. Trait keys are *data* in the VARIANT, not schema columns.
9+
10+
VARIANT was chosen over column-per-trait wide-form because:
11+
12+
- Snowflake caps tables at ~3,000 columns; large trait vocabularies
13+
cross that.
14+
- VARIANT path-extraction is columnar, not a JSON parse per row;
15+
perf is within ~30% of typed columns for simple key lookups.
16+
17+
## Notable choices
18+
19+
- `MD5_HEX` returns the 32-char hex digest directly.
20+
- Hex-to-int parsing uses `TO_NUMBER(SUBSTR(hex, n, 8), 'XXXXXXXX')`,
21+
producing a non-negative number that fits Snowflake's 38-digit NUMBER.
22+
- Anchored regex uses `REGEXP_INSTR(value, pattern) = 1`, equivalent
23+
to Python's `re.match` — start-anchored, prefix-allowed, not full-
24+
match.
25+
- n-th digit run uses `REGEXP_SUBSTR(value, '\\\\d+', 1, n)`;
26+
Snowflake's occurrence parameter is 1-indexed.
27+
"""
28+
29+
from flagsmith_sql_flag_engine.utils import re2_safe, string_literal
30+
31+
# Canonical IDENTITIES schema the translator emits against.
32+
SCHEMA_DDL = """\
33+
CREATE TABLE IF NOT EXISTS IDENTITIES (
34+
-- environment.key from EnvironmentContext; used as the env partition
35+
environment_id STRING NOT NULL,
36+
37+
-- stable per-identity row id
38+
id NUMBER NOT NULL,
39+
40+
-- the identity's external identifier, exposed as $.identity.identifier
41+
identifier STRING NOT NULL,
42+
43+
-- the composite identity key, exposed as $.identity.key
44+
identity_key STRING NOT NULL,
45+
46+
-- the identity's full trait map: {"plan": "growth", "country": "GB", ...}.
47+
-- Trait keys are object keys; Snowflake stores VARIANT as columnar-encoded
48+
-- JSON-ish so subkey lookups are vectorised and fast. NULL when the
49+
-- identity has no traits.
50+
traits VARIANT,
51+
52+
PRIMARY KEY (environment_id, id)
53+
)
54+
CLUSTER BY (environment_id, id);
55+
"""
56+
57+
58+
class SnowflakeDialect:
59+
name = "snowflake"
60+
schema_ddl = SCHEMA_DDL
61+
62+
# ----- IDENTITIES schema access -----
63+
64+
def identifier_expr(self, alias: str) -> str:
65+
return f"{alias}.identifier"
66+
67+
def identity_key_expr(self, alias: str) -> str:
68+
return f"{alias}.identity_key"
69+
70+
def trait_path(self, alias: str, trait_key: str) -> str:
71+
# Snowflake VARIANT path syntax: `i.traits:"key"`. The key is
72+
# double-quoted and any embedded double quotes are doubled per
73+
# the SQL standard.
74+
escaped = trait_key.replace('"', '""')
75+
return f'{alias}.traits:"{escaped}"'
76+
77+
def trait_eq(self, alias: str, trait_key: str, value: object, negate: bool) -> str:
78+
path = self.trait_path(alias, trait_key)
79+
str_path = self.cast_string(path)
80+
str_value = str(value)
81+
str_lit = string_literal(str_value)
82+
# Engine bool cast: `lambda v: v not in ("False", "false")`. We compare
83+
# against the variant's `::STRING` form 'true'/'false' rather than
84+
# invoke `(...)::BOOLEAN` directly — Snowflake's optimiser eagerly
85+
# evaluates the BOOLEAN cast even when the IS_BOOLEAN guard would
86+
# have short-circuited, and a non-bool variant blows up the query
87+
# with `100037: Boolean value 'red' is not recognized`.
88+
bool_str_lit = "'false'" if str_value in ("False", "false") else "'true'"
89+
# Engine int/float cast: int(v) / float(v); ValueError → no match.
90+
try:
91+
int_lit: str | None = str(int(str_value))
92+
except (ValueError, TypeError):
93+
int_lit = None
94+
try:
95+
float_lit: str | None = repr(float(str_value))
96+
except (ValueError, TypeError):
97+
float_lit = None
98+
99+
if not negate:
100+
# Fast string compare always present — handles VARCHAR traits and
101+
# canonically-stringified INTEGER traits in one cheap branch.
102+
clauses = [f"{str_path} = {str_lit}"]
103+
clauses.append(f"(IS_BOOLEAN({path}) AND {str_path} = {bool_str_lit})")
104+
if float_lit is not None:
105+
# Variant float `1.23` stringifies to `'1.230000000000000e+00'`-ish
106+
# in Snowflake — direct string compare misses it, so a typed
107+
# branch is needed. TRY_TO_DOUBLE on the string form sidesteps
108+
# the same eager-eval trap as the bool branch.
109+
clauses.append(
110+
f"((IS_DECIMAL({path}) OR IS_DOUBLE({path}))"
111+
f" AND TRY_TO_DOUBLE({str_path}) = {float_lit})"
112+
)
113+
return "(" + " OR ".join(clauses) + ")"
114+
115+
# NOT_EQUAL: per-type dispatch — engine returns True only when the
116+
# cast succeeded *and* values differ, which an OR-of-positives
117+
# can't express without over-matching.
118+
no_match = "FALSE" # engine returns False on cast failure
119+
bool_branch = f"{str_path} <> {bool_str_lit}"
120+
int_branch = f"({path})::NUMBER <> {int_lit}" if int_lit is not None else no_match
121+
float_branch = f"({path})::FLOAT <> {float_lit}" if float_lit is not None else no_match
122+
return (
123+
f"((TYPEOF({path}) = 'BOOLEAN' AND {bool_branch})"
124+
f" OR (TYPEOF({path}) = 'INTEGER' AND {int_branch})"
125+
f" OR (TYPEOF({path}) IN ('DECIMAL', 'DOUBLE') AND {float_branch})"
126+
f" OR (TYPEOF({path}) NOT IN ('BOOLEAN', 'INTEGER', 'DECIMAL', 'DOUBLE')"
127+
f" AND {str_path} <> {str_lit}))"
128+
)
129+
130+
def trait_in(self, alias: str, trait_key: str, items: list[str]) -> str:
131+
# Collapsed to a single `TYPEOF` gate around one string IN compare —
132+
# Snowflake stringifies INTEGER variants without decimals, so the same
133+
# `(path)::STRING IN (...)` works for both VARCHAR and INTEGER. Bool /
134+
# float / array traits never match per engine semantics, so they fall
135+
# outside the gate.
136+
path = self.trait_path(alias, trait_key)
137+
str_path = self.cast_string(path)
138+
item_lits = ",".join(string_literal(v) for v in items)
139+
return f"(TYPEOF({path}) IN ('VARCHAR', 'INTEGER') AND {str_path} IN ({item_lits}))"
140+
141+
# ----- string operations -----
142+
143+
def position(self, needle_lit: str, haystack_expr: str) -> str:
144+
return f"POSITION({needle_lit}, {haystack_expr}) > 0"
145+
146+
def lpad(self, expr: str, width: int, pad_lit: str) -> str:
147+
return f"LPAD({expr}, {width}, {pad_lit})"
148+
149+
def coalesce(self, *exprs: str) -> str:
150+
return f"COALESCE({', '.join(exprs)})"
151+
152+
# ----- regex -----
153+
154+
def regex_supports(self, pattern: str) -> bool:
155+
# Snowflake's regex engine is RE2.
156+
return re2_safe(pattern)
157+
158+
@staticmethod
159+
def _regex_literal(pattern: str) -> str:
160+
# Snowflake's regex flavour is POSIX-style: a single backslash in the
161+
# SQL literal is treated as a literal backslash by both the SQL string
162+
# parser AND the regex engine, so `'\d'` matches the character `d`,
163+
# not a digit. To get a regex metachar like `\d`, `\s` or `\w`, we
164+
# double the backslash so the engine sees `\\d`. SQL single quotes
165+
# are escaped by doubling per the SQL standard.
166+
doubled = pattern.replace("\\", "\\\\").replace("'", "''")
167+
return f"'{doubled}'"
168+
169+
def regexp_anchored_match(self, value_expr: str, pattern: str) -> str:
170+
# REGEXP_INSTR returns 1-indexed position of first match; = 1 means
171+
# the match starts at the beginning. Equivalent to re.match.
172+
return f"REGEXP_INSTR({value_expr}, {self._regex_literal(pattern)}) = 1"
173+
174+
def regexp_nth_digit_run(self, value_expr: str, n: int) -> str:
175+
# `\d+` finds runs of digits; 4th arg is 1-indexed occurrence number.
176+
digit_run = self._regex_literal("\\d+")
177+
return f"REGEXP_SUBSTR({value_expr}, {digit_run}, 1, {n})"
178+
179+
# ----- hashing -----
180+
181+
def md5_hex(self, expr: str) -> str:
182+
return f"MD5_HEX({expr})"
183+
184+
def parse_hex_chunk(self, hex_expr: str, start: int, length: int = 8) -> str:
185+
format_str = "X" * length
186+
return f"TO_NUMBER(SUBSTR({hex_expr}, {start}, {length}), '{format_str}')"
187+
188+
# ----- casts -----
189+
190+
def cast_string(self, expr: str) -> str:
191+
return f"({expr})::STRING"
192+
193+
def cast_float(self, expr: str) -> str:
194+
# TRY_TO_DOUBLE / TRY_TO_NUMBER instead of TRY_CAST: they accept
195+
# VARIANT directly, and a non-numeric variant value yields NULL
196+
# instead of erroring out the whole query. Engine behaviour for
197+
# type-mismatched comparisons is "doesn't match", which NULL
198+
# propagation through the predicate gives us.
199+
return f"TRY_TO_DOUBLE(({expr})::STRING)"
200+
201+
def cast_number(self, expr: str) -> str:
202+
return f"TRY_TO_NUMBER(({expr})::STRING)"
203+
204+
# ----- composition -----
205+
206+
def mod(self, dividend: str, divisor: str) -> str:
207+
return f"MOD({dividend}, {divisor})"

0 commit comments

Comments
 (0)