CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813
CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813Jens-G wants to merge 1 commit into
Conversation
…ng backslashes format_value_text in formatting.py doubles backslashes for terminal display (so SELECT output renders them visibly). When used via ExportProcess.format_value for COPY TO, this pre-escaping is applied before csv.writer runs its own backslash escaping (escapechar='\\'), resulting in quadrupled backslashes in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra — data corruption that compounds on every round-trip. The fix adds an escape_backslash parameter (default True, preserving existing terminal display behaviour) and passes escape_backslash=False from the CSV export path in ExportProcess.format_value. The parameter is propagated through format_simple_collection, format_value_list/set/tuple/map, and format_value_utype so that collection types (list<text>, set<text>, map<text,text>, UDTs) are covered as well. Generated-by: Claude Sonnet 4.6 (Anthropic) with human review and direction
|
@Jens-G It seems the exporter is sending values through the display formatter, which doubles backslashes for human-readable SELECT output, before handing them to the CSV writer. Claude suggests just stop using the display formatter for CSV export of text in copyutil.py. What would you think of that approach? |
|
TBH if it works I'm fine with either approach 👍 🚀 |
| formatted = formatter(val, cqltype=cqltype, | ||
| encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format, | ||
| float_precision=cqltype.precision, nullval=self.nullval, quote=False, | ||
| escape_backslash=False, |
There was a problem hiding this comment.
Consider an alternative approach where formatted_value() bypasses display formatting for text.
format_value():
...
if cqltype.type_name in ('text', 'varchar', 'ascii'):
return val if val.isprintable() else None
| escape_backslash=False, | ||
| decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep, | ||
| boolean_styles=self.boolean_styles) | ||
| return formatted |
|
Hi @Jens-G , if you are busy I can add the unit tests for this and you can review that — just let me know if that works. ? |
|
+1 LGTM |
|
@Jens-G the PR needs additional work. Based on our discuss of RFC 4180, proper text values don't require formatting. In format_value(): that may eliminate the need to have a new argument for escape_backslash |
Summary
COPY TOfollowed byCOPY FROMcorrupts text column values that contain backslashes: each round-trip doubles the backslash count. Reported in CASSANDRA-21131.Before (one round-trip):
V\S→ exported CSV:V\\\\S→ re-imported:V\\S❌\"Marianne"\→ re-imported:\\"Marianne"\\❌list<text>,set<text>,map<text,text>, tuples and UDTs with text fields are affected in the same way.Root Cause
format_value_textinformatting.pydoubles backslashes unconditionally:This is intentional for terminal display (SELECT output shows
V\\Sso the backslash is visible). However,ExportProcess.format_valueincopyutil.pycalls the same function when writing CSV. Thecsv.writer(configured withescapechar='\\') then escapes backslashes a second time, quadrupling them in the CSV file. OnCOPY FROMthecsv.readerunescapes once, leaving doubled backslashes in Cassandra.Fix
Add an
escape_backslashparameter (defaultTrue, preserving existing terminal display behaviour) toformat_value_text,format_simple_collection, and all collection formatters. Passescape_backslash=FalsefromExportProcess.format_valueso thecsv.writerhandles all backslash escaping exclusively.Changed functions:
format_value_text— new parameterformat_simple_collection— new parameter, propagated to elementformat_valuecallsformat_value_list,format_value_set,format_value_tuple— new parameter, forwarded toformat_simple_collectionformat_value_map— new parameter, propagated throughsubformatformat_value_utype— new parameter, propagated throughformat_field_valueExportProcess.format_valueincopyutil.py— passesescape_backslash=FalseTest Plan
Two standalone Python test scripts (no running Cassandra cluster required) are attached to the JIRA ticket and verify the bug and fix:
test_cassandra_21131.py— 10 test cases for plaintextcolumns: 5/10 pass before fix → 10/10 aftertest_cassandra_21131_collections.py— 12 test cases forlist/set/map<text>: 3/12 before → 12/12 afterIntegration testing against a live cluster with the exact scenario from the bug report (
COPY TO→TRUNCATE→COPY FROM→SELECT) is needed before merge.Notes
UNICODE_CONTROLCHARS_REconverting control chars like\nto repr-notation\\nduring CSV export) was discovered and will be tracked in a separate ticket.Generated-by:commit token is included per ASF generative tooling policy. The fix was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review and direction. All code has been verified manually.🤖 Generated with Claude Code