fix: Normalize Cassandra column name casing for SortedFeatureView sch…#360
Conversation
…ema operations - Cassandra/Scylla case-fold unquoted identifiers to lowercase at storage time. The _alter_table method and Go OnlineReadRange path compared feature names case-sensitively against Cassandra metadata, causing "Column already exists" errors on materialization and "Undefined column"errors on reads for any SortedFeatureView with mixed-case feature names. - Add _canonical_column_name (Python) and canonicalColumnName (Go) helpers - Patch _alter_table to use case-insensitive column diff - Patch buildRangeQueryCQL, rangeFilterToCQL, and MapScan loop to use unquoted lowercase identifiers in CQL while preserving original case in API response payloads - Add case-collision validator to SortedFeatureView.ensure_valid() - No DDL changes to existing tables; no modifications to the tall-schema (regular FeatureView) read/write paths
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ureView Replace the O(n²) generator scan with an auxiliary canonical_to_original dict for constant-time collision detection. Also improve the error message to be more directive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rning in FV layer Backend-specific constraints belong in the backend. SortedFeatureView now emits a logger.warning for case-colliding feature names (they work fine on Valkey/DynamoDB). The Cassandra plugin rejects them with a hard CassandraInvalidConfig error in _create_table and _alter_table, where the constraint actually applies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // Use unquoted, lowercased identifiers to match the on-disk form written | ||
| // by the Python materializer. Quoting + mixed case would make Cassandra | ||
| // do a case-sensitive lookup that misses the (always-lowercase) stored | ||
| // column. | ||
| columnRefs := make([]string, len(featureNames)) | ||
| for i, name := range featureNames { | ||
| quotedFeatures[i] = fmt.Sprintf(`"%s"`, name) | ||
| columnRefs[i] = canonicalColumnName(name) | ||
| } |
There was a problem hiding this comment.
This makes sense for new tables created with unquoted identifiers, but I’m a little worried about backward compatibility. If any existing deployments have quoted mixed-case column names, forcing lowercase unquoted refs here could miss those columns when doing reads right?
There was a problem hiding this comment.
The previous write function has always written unquoted DDL (see _build_sorted_table_cql and the old _alter_table), so the write path in Cassandra case-folds everything to lowercase on storage. Every existing table should already have lowercase columns, the only path where this is not true is if someone directly calls the Cassandra client and writes to it.
| `SELECT entity_key, event_ts, %s FROM %s WHERE %s%s%s%s`, | ||
| strings.Join(columnRefs, ", "), |
There was a problem hiding this comment.
Same compatibility question here on the SELECT projection: we now always emit lowercase unquoted column names. Can we confirm this is safe for all previously created Cassandra/Scylla tables, including any that may have been created with quoted identifiers?
There was a problem hiding this comment.
Same as above comment, the write path always stored it as lower case column names
| assert.NotContains(t, cql, "ORDER BY", "ORDER BY should be omitted when all SortKeyFilters have Order == nil") | ||
| } | ||
|
|
||
| func TestBuildRangeQueryCQL_UsesLowercaseUnquotedIdentifiers(t *testing.T) { |
There was a problem hiding this comment.
Could we also add coverage for the OnlineReadRange read path (mixed-case feature refs vs lowercase row keys) at cassandraonlinestore.go:954, since that is the actual regression fix?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Cassandra and Scylla case-fold unquoted column identifiers to lowercase at storage time. A feature named pct_filter_bookings_adult_countXdestination_geo_id_2w is stored as pct_filter_bookings_adult_countxdestination_geo_id_2w on disk.
Python materialization path: _alter_table compared desired_cols (original case from FV definition) against existing_cols (lowercase from Cassandra metadata) using a case-sensitive set difference. On every materialization after the initial table creation, it concluded the column was "new" and attempted
ALTER TABLE ADD, which Cassandra rejected because the lowercase column already existed.
Go read path: buildRangeQueryCQL and rangeFilterToCQL wrapped column identifiers in double quotes ("featureX"), making Cassandra perform a case-sensitive lookup against the always-lowercase stored column — resulting in "Undefined column name" errors.
Checks
git commit -s)Testing Strategy
Misc