Skip to content

Commit b0b3ed7

Browse files
docs: add blob column comment migration guide
Explains how DataJoint 2.0 identifies blob columns via `:<blob>:` comment markers. Without these markers, legacy blobs are treated as raw binary and not deserialized. Added: - "Column Comment Format" section explaining the `:type:` format - Usage examples for check_migration_status(), migrate_columns(), and migrate_blob_columns() from datajoint.migrate - Expanded "Option B: In-Place Migration" with step-by-step using actual migration functions (backup_schema, migrate_columns, rebuild_lineage, migrate_external, verify_schema_v20) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 88ed2b1 commit b0b3ed7

File tree

1 file changed

+162
-14
lines changed

1 file changed

+162
-14
lines changed

src/how-to/migrate-to-v20.md

Lines changed: 162 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,88 @@ These codecs are NEW—there's no legacy equivalent to migrate:
118118

119119
**Learn more:** [Codec API Reference](../reference/specs/codec-api.md) · [Custom Codecs](../explanation/custom-codecs.md)
120120

121+
### Column Comment Format (Critical for Blob Migration)
122+
123+
DataJoint 2.0 stores type information in the SQL column comment using a `:type:` prefix format:
124+
125+
```sql
126+
-- 2.0 column comment format
127+
COMMENT ':<type>:user comment'
128+
129+
-- Examples
130+
COMMENT ':int64:subject identifier'
131+
COMMENT ':<blob>:serialized neural data'
132+
COMMENT ':<blob@store>:large array in object storage'
133+
```
134+
135+
**Why this matters for blob columns:**
136+
137+
In pre-2.0, `longblob` columns automatically deserialized Python objects using DataJoint's binary serialization format. DataJoint 2.0 identifies blob columns by checking for `:<blob>:` in the column comment. **Without this marker, blob columns are treated as raw binary data and will NOT be deserialized.**
138+
139+
| Column Comment | DataJoint 2.0 Behavior |
140+
|----------------|----------------------|
141+
| `:<blob>:neural data` | ✓ Deserializes to Python/NumPy objects |
142+
| `neural data` (no marker) | ✗ Returns raw bytes (no deserialization) |
143+
144+
**Migration requirement:** Existing blob columns need their comments updated to include the `:<blob>:` prefix. This is a metadata-only change—the actual blob data format is unchanged.
145+
146+
#### Checking Migration Status
147+
148+
```python
149+
from datajoint.migrate import check_migration_status
150+
151+
status = check_migration_status(schema)
152+
print(f"Blob columns: {status['total_blob_columns']}")
153+
print(f" Migrated: {status['migrated']}")
154+
print(f" Pending: {status['pending']}")
155+
```
156+
157+
#### Migrating Blob Column Comments
158+
159+
Use `migrate_columns()` to add type markers to all columns (integers, floats, and blobs):
160+
161+
```python
162+
from datajoint.migrate import migrate_columns
163+
164+
# Preview changes (dry run)
165+
result = migrate_columns(schema, dry_run=True)
166+
print(f"Would migrate {len(result['sql_statements'])} columns")
167+
for sql in result['sql_statements']:
168+
print(f" {sql}")
169+
170+
# Apply changes
171+
result = migrate_columns(schema, dry_run=False)
172+
print(f"Migrated {result['columns_migrated']} columns")
173+
```
174+
175+
Or use `migrate_blob_columns()` to migrate only blob columns:
176+
177+
```python
178+
from datajoint.migrate import migrate_blob_columns
179+
180+
# Preview
181+
result = migrate_blob_columns(schema, dry_run=True)
182+
print(f"Would migrate {result['needs_migration']} blob columns")
183+
184+
# Apply
185+
result = migrate_blob_columns(schema, dry_run=False)
186+
print(f"Migrated {result['migrated']} blob columns")
187+
```
188+
189+
**What the migration does:**
190+
191+
```sql
192+
-- Before migration
193+
ALTER TABLE `schema`.`table`
194+
MODIFY COLUMN `data` longblob COMMENT 'neural recording';
195+
196+
-- After migration
197+
ALTER TABLE `schema`.`table`
198+
MODIFY COLUMN `data` longblob COMMENT ':<blob>:neural recording';
199+
```
200+
201+
The data itself is unchanged—only the comment metadata is updated.
202+
121203
### Unified Stores Configuration
122204

123205
DataJoint 2.0 replaces `external.*` with unified `stores.*` configuration:
@@ -2149,24 +2231,90 @@ DROP DATABASE `my_pipeline_old`;
21492231
21502232
**Warning:** Modifies production schema directly. Test thoroughly first!
21512233
2234+
#### Step 1: Backup Production
2235+
21522236
```python
2153-
from datajoint.migrate import migrate_schema_in_place
2237+
from datajoint.migrate import backup_schema
21542238
2155-
# Backup first
2156-
backup_schema('my_pipeline', 'my_pipeline_backup_20260114')
2239+
result = backup_schema('my_pipeline', 'my_pipeline_backup_20260114')
2240+
print(f"Backed up {result['tables_backed_up']} tables, {result['rows_backed_up']} rows")
2241+
```
21572242
2158-
# Migrate in place
2159-
result = migrate_schema_in_place(
2160-
schema='my_pipeline',
2161-
backup=True,
2162-
steps=[
2163-
'update_blob_comments', # Add :<blob>: markers
2164-
'add_lineage_table', # Create ~lineage
2165-
'migrate_external_storage', # BINARY(16) → JSON
2166-
]
2167-
)
2243+
#### Step 2: Add Type Markers to Column Comments
2244+
2245+
This is the critical step for blob deserialization. Without `:<blob>:` markers, blob columns return raw bytes instead of deserialized Python objects.
2246+
2247+
```python
2248+
from datajoint.migrate import migrate_columns, check_migration_status
2249+
import datajoint as dj
2250+
2251+
schema = dj.Schema('my_pipeline')
2252+
2253+
# Check current status
2254+
status = check_migration_status(schema)
2255+
print(f"Blob columns needing migration: {status['pending']}")
2256+
2257+
# Preview changes
2258+
result = migrate_columns(schema, dry_run=True)
2259+
print(f"Would update {len(result['sql_statements'])} columns:")
2260+
for sql in result['sql_statements'][:5]: # Show first 5
2261+
print(f" {sql}")
2262+
2263+
# Apply changes (updates column comments only, no data changes)
2264+
result = migrate_columns(schema, dry_run=False)
2265+
print(f"Migrated {result['columns_migrated']} columns")
2266+
```
2267+
2268+
**What this does:** Adds `:<type>:` prefix to column comments:
2269+
2270+
- `longblob``COMMENT ':<blob>:...'`
2271+
- `int unsigned``COMMENT ':uint32:...'`
2272+
- etc.
2273+
2274+
#### Step 3: Rebuild Lineage Table
21682275
2169-
print(f"Migrated {result['steps_completed']} steps")
2276+
```python
2277+
from datajoint.migrate import rebuild_lineage
2278+
2279+
result = rebuild_lineage(schema, dry_run=False)
2280+
print(f"Rebuilt lineage: {result['lineage_entries']} entries")
2281+
```
2282+
2283+
#### Step 4: Migrate External Storage (if applicable)
2284+
2285+
If you use `blob@store`, `attach@store`, or `filepath@store`:
2286+
2287+
```python
2288+
from datajoint.migrate import migrate_external, migrate_filepath
2289+
2290+
# Preview external blob/attach migration
2291+
result = migrate_external(schema, dry_run=True)
2292+
print(f"Found {result['columns_found']} external columns")
2293+
2294+
# Apply migration (adds _v2 columns with JSON metadata)
2295+
result = migrate_external(schema, dry_run=False)
2296+
print(f"Migrated {result['rows_migrated']} rows")
2297+
2298+
# Similarly for filepath columns
2299+
result = migrate_filepath(schema, dry_run=False)
2300+
2301+
# After verification, finalize (rename columns)
2302+
result = migrate_external(schema, finalize=True)
2303+
result = migrate_filepath(schema, finalize=True)
2304+
```
2305+
2306+
#### Step 5: Verify Migration
2307+
2308+
```python
2309+
from datajoint.migrate import verify_schema_v20
2310+
2311+
result = verify_schema_v20('my_pipeline')
2312+
if result['compatible']:
2313+
print("✓ Schema fully migrated to 2.0")
2314+
else:
2315+
print("Issues found:")
2316+
for issue in result['issues']:
2317+
print(f" - {issue}")
21702318
```
21712319
21722320
### Option C: Gradual Migration with Legacy Compatibility

0 commit comments

Comments
 (0)