Skip to content

Commit 025c196

Browse files
Merge pull request #678 from datajoint/dev
Add migration utility, docs, and travis updates
2 parents 15b9b61 + 90ed2b0 commit 025c196

21 files changed

+593
-198
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@
2020
* Accept alias for supported MySQL datatypes (#544) PR #545
2121
* Support for pandas in `fetch` (#459, #537) PR #534
2222
* Support for ordering by "KEY" in `fetch` (#541) PR #534
23+
* Add config to enable python native blobs PR #672, #676
24+
* Add secure option for external storage (#663) PR #674, #676
25+
* Add blob migration utility from DJ011 to DJ012 PR #673
2326
* Improved external storage - a migration script needed from version 0.11 (#467, #475, #480, #497) PR #532
2427
* Increase default display rows (#523) PR #526
2528
* Bugfixes (#521, #205, #279, #477, #570, #581, #597, #596, #618, #633, #643, #644, #647, #648, #650, #656)

LNX-docker-compose.yml

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ services:
1818
- DJ_TEST_HOST=db
1919
- DJ_TEST_USER=datajoint
2020
- DJ_TEST_PASSWORD=datajoint
21-
- S3_ENDPOINT=minio:9000
21+
- S3_ENDPOINT=fakeminio.datajoint.io:9000
2222
- S3_ACCESS_KEY=datajoint
2323
- S3_SECRET_KEY=datajoint
2424
- S3_BUCKET=datajoint-test
@@ -70,5 +70,26 @@ services:
7070
timeout: 5s
7171
retries: 60
7272
interval: 1s
73+
fakeminio.datajoint.io:
74+
<<: *net
75+
image: nginx:alpine
76+
environment:
77+
- URL=datajoint.io
78+
- SUBDOMAINS=fakeminio
79+
- MINIO_SERVER=http://minio:9000
80+
entrypoint: /entrypoint.sh
81+
healthcheck:
82+
test: wget --quiet --tries=1 --spider https://fakeminio.datajoint.io:443/minio/health/live || exit 1
83+
timeout: 5s
84+
retries: 300
85+
interval: 1s
86+
# ports:
87+
# - "9000:9000"
88+
# - "443:443"
89+
volumes:
90+
- ./tests/nginx/base.conf:/base.conf
91+
- ./tests/nginx/entrypoint.sh:/entrypoint.sh
92+
- ./tests/nginx/fullchain.pem:/certs/fullchain.pem
93+
- ./tests/nginx/privkey.pem:/certs/privkey.pem
7394
networks:
7495
main:

README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,65 @@ If you already have an older version of DataJoint installed using `pip`, upgrade
2222
```bash
2323
pip3 install --upgrade datajoint
2424
```
25+
## Python Native Blobs
26+
27+
For the v0.12 release, the variable `enable_python_native_blobs` can be
28+
safely enabled for improved blob support of python datatypes if the following
29+
are true:
30+
31+
* This is a new DataJoint installation / pipeline(s)
32+
* You have not used DataJoint prior to v0.12 with your pipeline(s)
33+
* You do not share blob data between Python and Matlab
34+
35+
Otherwise, please read the following carefully:
36+
37+
DataJoint v0.12 expands DataJoint's blob serialization mechanism with
38+
improved support for complex native python datatypes, such as dictionaries
39+
and lists of strings.
40+
41+
Prior to DataJoint v0.12, certain python native datatypes such as
42+
dictionaries were 'squashed' into numpy structured arrays when saved into
43+
blob attributes. This facilitated easier data sharing between Matlab
44+
and Python for certain record types. However, this created a discrepancy
45+
between insert and fetch datatypes which could cause problems in other
46+
portions of users pipelines.
47+
48+
For v0.12, it was decided to remove the type squashing behavior, instead
49+
creating a separate storage encoding which improves support for storing
50+
native python datatypes in blobs without squashing them into numpy
51+
structured arrays. However, this change creates a compatibility problem
52+
for pipelines which previously relied on the type squashing behavior
53+
since records saved via the old squashing format will continue to fetch
54+
as structured arrays, whereas new record inserted in DataJoint 0.12 with
55+
`enable_python_native_blobs` would result in records returned as the
56+
appropriate native python type (dict, etc). Read support for python
57+
native blobs also not yet implemented in DataJoint for Matlab.
58+
59+
To prevent data from being stored in mixed format within a table across
60+
upgrades from previous versions of DataJoint, the
61+
`enable_python_native_blobs` flag was added as a temporary guard measure
62+
for the 0.12 release. This flag will trigger an exception if any of the
63+
ambiguous cases are encountered during inserts in order to allow testing
64+
and migration of pre-0.12 pipelines to 0.11 in a safe manner.
65+
66+
The exact process to update a specific pipeline will vary depending on
67+
the situation, but generally the following strategies may apply:
68+
69+
* Altering code to directly store numpy structured arrays or plain
70+
multidimensional arrays. This strategy is likely best one for those
71+
tables requiring compatibility with Matlab.
72+
* Adjust code to deal with both structured array and native fetched data.
73+
In this case, insert logic is not adjusted, but downstream consumers
74+
are adjusted to handle records saved under the old and new schemes.
75+
* Manually convert data using fetch/insert into a fresh schema.
76+
In this approach, DataJoint's create_virtual_module functionality would
77+
be used in conjunction with a a fetch/convert/insert loop to update
78+
the data to the new native_blob functionality.
79+
* Drop/Recompute imported/computed tables to ensure they are in the new
80+
format.
81+
82+
As always, be sure that your data is safely backed up before modifying any
83+
important DataJoint schema or records.
2584

2685
## Documentation and Tutorials
2786
A number of labs are currently adopting DataJoint and we are quickly getting the documentation in shape in February 2017.

datajoint/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,5 +41,6 @@
4141
from .attribute_adapter import AttributeAdapter
4242
from . import errors
4343
from .errors import DataJointError
44+
from .migrate import migrate_dj011_external_blob_storage_to_dj012
4445

4546
ERD = Di = Diagram # Aliases for Diagram

datajoint/external.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,7 +316,7 @@ def delete(self, *, delete_external_files=None, limit=None, display_progress=Tru
316316
raise DataJointError("The delete_external_files argument must be set to either True or False in delete()")
317317

318318
if not delete_external_files:
319-
self.unused.delete_quick()
319+
self.unused().delete_quick()
320320
else:
321321
items = self.unused().fetch_external_paths(limit=limit)
322322
if display_progress:

datajoint/fetch.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def _get(connection, attr, data, squeeze, download_path):
5050
adapt = attr.adapter.get if attr.adapter else lambda x: x
5151

5252
if attr.is_filepath:
53-
return adapt(extern.download_filepath(uuid.UUID(bytes=data))[0])
53+
return str(adapt(extern.download_filepath(uuid.UUID(bytes=data))[0]))
5454

5555
if attr.is_attachment:
5656
# Steps:
@@ -65,22 +65,22 @@ def _get(connection, attr, data, squeeze, download_path):
6565
if local_filepath.is_file():
6666
attachment_checksum = _uuid if attr.is_external else hash.uuid_from_buffer(data)
6767
if attachment_checksum == hash.uuid_from_file(local_filepath, init_string=attachment_name + '\0'):
68-
return adapt(local_filepath) # checksum passed, no need to download again
68+
return str(adapt(local_filepath)) # checksum passed, no need to download again
6969
# generate the next available alias filename
7070
for n in itertools.count():
7171
f = local_filepath.parent / (local_filepath.stem + '_%04x' % n + local_filepath.suffix)
7272
if not f.is_file():
7373
local_filepath = f
7474
break
7575
if attachment_checksum == hash.uuid_from_file(f, init_string=attachment_name + '\0'):
76-
return adapt(f) # checksum passed, no need to download again
76+
return str(adapt(f)) # checksum passed, no need to download again
7777
# Save attachment
7878
if attr.is_external:
7979
extern.download_attachment(_uuid, attachment_name, local_filepath)
8080
else:
8181
# write from buffer
8282
safe_write(local_filepath, data.split(b"\0", 1)[1])
83-
return adapt(local_filepath) # download file from remote store
83+
return str(adapt(local_filepath)) # download file from remote store
8484

8585
return adapt(uuid.UUID(bytes=data) if attr.uuid else (
8686
blob.unpack(extern.get(uuid.UUID(bytes=data)) if attr.is_external else data, squeeze=squeeze)

datajoint/migrate.py

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
import datajoint as dj
2+
from pathlib import Path
3+
import re
4+
from .utils import user_choice
5+
6+
7+
def migrate_dj011_external_blob_storage_to_dj012(migration_schema, store):
8+
"""
9+
Utility function to migrate external blob data from 0.11 to 0.12.
10+
:param migration_schema: string of target schema to be migrated
11+
:param store: string of target dj.config['store'] to be migrated
12+
"""
13+
if not isinstance(migration_schema, str):
14+
raise ValueError(
15+
'Expected type {} for migration_schema, not {}.'.format(
16+
str, type(migration_schema)))
17+
18+
do_migration = False
19+
do_migration = user_choice(
20+
"""
21+
Warning: Ensure the following are completed before proceeding.
22+
- Appropriate backups have been taken,
23+
- Any existing DJ 0.11.X connections are suspended, and
24+
- External config has been updated to new dj.config['stores'] structure.
25+
Proceed?
26+
""", default='no') == 'yes'
27+
if do_migration:
28+
_migrate_dj011_blob(dj.schema(migration_schema), store)
29+
print('Migration completed for schema: {}, store: {}.'.format(
30+
migration_schema, store))
31+
return
32+
print('No migration performed.')
33+
34+
35+
def _migrate_dj011_blob(schema, default_store):
36+
query = schema.connection.query
37+
38+
LEGACY_HASH_SIZE = 43
39+
40+
legacy_external = dj.FreeTable(
41+
schema.connection,
42+
'`{db}`.`~external`'.format(db=schema.database))
43+
44+
# get referencing tables
45+
refs = query("""
46+
SELECT concat('`', table_schema, '`.`', table_name, '`')
47+
as referencing_table, column_name, constraint_name
48+
FROM information_schema.key_column_usage
49+
WHERE referenced_table_name="{tab}" and referenced_table_schema="{db}"
50+
""".format(
51+
tab=legacy_external.table_name,
52+
db=legacy_external.database), as_dict=True).fetchall()
53+
54+
for ref in refs:
55+
# get comment
56+
column = query(
57+
'SHOW FULL COLUMNS FROM {referencing_table}'
58+
'WHERE Field="{column_name}"'.format(
59+
**ref), as_dict=True).fetchone()
60+
61+
store, comment = re.match(
62+
r':external(-(?P<store>.+))?:(?P<comment>.*)',
63+
column['Comment']).group('store', 'comment')
64+
65+
# get all the hashes from the reference
66+
hashes = {x[0] for x in query(
67+
'SELECT `{column_name}` FROM {referencing_table}'.format(
68+
**ref))}
69+
70+
# sanity check make sure that store suffixes match
71+
if store is None:
72+
assert all(len(_) == LEGACY_HASH_SIZE for _ in hashes)
73+
else:
74+
assert all(_[LEGACY_HASH_SIZE:] == store for _ in hashes)
75+
76+
# create new-style external table
77+
ext = schema.external[store or default_store]
78+
79+
# add the new-style reference field
80+
temp_suffix = 'tempsub'
81+
82+
try:
83+
query("""ALTER TABLE {referencing_table}
84+
ADD COLUMN `{column_name}_{temp_suffix}` {type} DEFAULT NULL
85+
COMMENT ":blob@{store}:{comment}"
86+
""".format(
87+
type=dj.declare.UUID_DATA_TYPE,
88+
temp_suffix=temp_suffix,
89+
store=(store or default_store), comment=comment, **ref))
90+
except:
91+
print('Column already added')
92+
pass
93+
94+
# Copy references into the new external table
95+
# No Windows! Backslashes will cause problems
96+
97+
contents_hash_function = {
98+
'file': lambda ext, relative_path: dj.hash.uuid_from_file(
99+
str(Path(ext.spec['location'], relative_path))),
100+
's3': lambda ext, relative_path: dj.hash.uuid_from_buffer(
101+
ext.s3.get(relative_path))
102+
}
103+
104+
for _hash, size in zip(*legacy_external.fetch('hash', 'size')):
105+
if _hash in hashes:
106+
relative_path = str(Path(schema.database, _hash).as_posix())
107+
uuid = dj.hash.uuid_from_buffer(init_string=relative_path)
108+
external_path = ext._make_external_filepath(relative_path)
109+
if ext.spec['protocol'] == 's3':
110+
contents_hash = dj.hash.uuid_from_buffer(ext._download_buffer(external_path))
111+
else:
112+
contents_hash = dj.hash.uuid_from_file(external_path)
113+
ext.insert1(dict(
114+
filepath=relative_path,
115+
size=size,
116+
contents_hash=contents_hash,
117+
hash=uuid
118+
), skip_duplicates=True)
119+
120+
query(
121+
'UPDATE {referencing_table} '
122+
'SET `{column_name}_{temp_suffix}`=%s '
123+
'WHERE `{column_name}` = "{_hash}"'
124+
.format(
125+
_hash=_hash,
126+
temp_suffix=temp_suffix, **ref), uuid.bytes)
127+
128+
# check that all have been copied
129+
check = query(
130+
'SELECT * FROM {referencing_table} '
131+
'WHERE `{column_name}` IS NOT NULL'
132+
' AND `{column_name}_{temp_suffix}` IS NULL'
133+
.format(temp_suffix=temp_suffix, **ref)).fetchall()
134+
135+
assert len(check) == 0, 'Some hashes havent been migrated'
136+
137+
# drop old foreign key, rename, and create new foreign key
138+
query("""
139+
ALTER TABLE {referencing_table}
140+
DROP FOREIGN KEY `{constraint_name}`,
141+
DROP COLUMN `{column_name}`,
142+
CHANGE COLUMN `{column_name}_{temp_suffix}` `{column_name}`
143+
{type} DEFAULT NULL
144+
COMMENT ":blob@{store}:{comment}",
145+
ADD FOREIGN KEY (`{column_name}`) REFERENCES {ext_table_name}
146+
(`hash`)
147+
""".format(
148+
temp_suffix=temp_suffix,
149+
ext_table_name=ext.full_table_name,
150+
type=dj.declare.UUID_DATA_TYPE,
151+
store=(store or default_store), comment=comment, **ref))
152+
153+
# Drop the old external table but make sure it's no longer referenced
154+
# get referencing tables
155+
refs = query("""
156+
SELECT concat('`', table_schema, '`.`', table_name, '`') as
157+
referencing_table, column_name, constraint_name
158+
FROM information_schema.key_column_usage
159+
WHERE referenced_table_name="{tab}" and referenced_table_schema="{db}"
160+
""".format(
161+
tab=legacy_external.table_name,
162+
db=legacy_external.database), as_dict=True).fetchall()
163+
164+
assert not refs, 'Some references still exist'
165+
166+
# drop old external table
167+
legacy_external.drop_quick()

datajoint/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
__version__ = "0.12.dev8"
1+
__version__ = "0.12.dev9"
22

33
assert len(__version__) <= 10 # The log table limits version to the 10 characters
Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,17 @@
11
.. code-block:: python
22
3-
# default external storage
4-
dj.config['external'] = dict(
5-
protocol='s3',
6-
endpoint='https://s3.amazonaws.com',
7-
bucket = 'testbucket',
8-
location = '/datajoint-projects/myschema',
9-
access_key='1234567',
10-
secret_key='foaf1234')
3+
dj.config['stores'] = {
4+
'external': dict( # 'regular' external storage for this pipeline
5+
protocol='s3',
6+
endpoint='https://s3.amazonaws.com',
7+
bucket = 'testbucket',
8+
location = '/datajoint-projects/myschema',
9+
access_key='1234567',
10+
secret_key='foaf1234'),
11+
'external-raw'] = dict( # 'raw' storage for this pipeline
12+
protocol='file',
13+
location='/net/djblobs/myschema')
14+
}
15+
# external object cache - see fetch operation below for details.
16+
dj.config['cache'] = '/net/djcache'
1117
12-
# raw data storage
13-
dj.config['extnernal-raw'] = dict(
14-
protocol='file',
15-
location='/net/djblobs/myschema')
16-
17-
# external object cache - see fetch operation below for details.
18-
dj.config['cache'] = dict(
19-
protocol='file',
20-
location='/net/djcache')

0 commit comments

Comments
 (0)