datajoint
diff --git a/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎LNX-docker-compose.yml‎
Lines changed: 22 additions & 1 deletion b/‎LNX-docker-compose.yml‎
Lines changed: 22 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 59 additions & 0 deletions b/‎README.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎datajoint/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎datajoint/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎datajoint/external.py‎
Lines changed: 1 addition & 1 deletion b/‎datajoint/external.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎datajoint/fetch.py‎
Lines changed: 4 additions & 4 deletions b/‎datajoint/fetch.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎datajoint/migrate.py‎
Lines changed: 167 additions & 0 deletions b/‎datajoint/migrate.py‎
Lines changed: 167 additions & 0 deletions
diff --git a/‎datajoint/version.py‎
Lines changed: 1 addition & 1 deletion b/‎datajoint/version.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs-parts/admin/5-blob-config_lang1.rst‎
Lines changed: 14 additions & 17 deletions b/‎docs-parts/admin/5-blob-config_lang1.rst‎
Lines changed: 14 additions & 17 deletions
@@ -20,6 +20,9 @@
 * Accept alias for supported MySQL datatypes (#544) PR #545
 * Support for pandas in `fetch` (#459, #537) PR #534
 * Support for ordering by "KEY" in `fetch` (#541) PR #534
+* Add config to enable python native blobs PR #672, #676
+* Add secure option for external storage (#663) PR #674, #676
+* Add blob migration utility from DJ011 to DJ012 PR #673
 * Improved external storage - a migration script needed from version 0.11  (#467, #475, #480, #497) PR #532
 * Increase default display rows (#523) PR #526
 * Bugfixes (#521, #205, #279, #477, #570, #581, #597, #596, #618, #633, #643, #644, #647, #648, #650, #656)
 
@@ -18,7 +18,7 @@ services:
       - DJ_TEST_HOST=db
       - DJ_TEST_USER=datajoint
       - DJ_TEST_PASSWORD=datajoint
-      - S3_ENDPOINT=minio:9000
+      - S3_ENDPOINT=fakeminio.datajoint.io:9000
       - S3_ACCESS_KEY=datajoint
       - S3_SECRET_KEY=datajoint
       - S3_BUCKET=datajoint-test
@@ -70,5 +70,26 @@ services:
       timeout: 5s
       retries: 60
       interval: 1s
+  fakeminio.datajoint.io:
+    <<: *net
+    image: nginx:alpine
+    environment:
+      - URL=datajoint.io
+      - SUBDOMAINS=fakeminio
+      - MINIO_SERVER=http://minio:9000
+    entrypoint: /entrypoint.sh
+    healthcheck:
+      test: wget --quiet --tries=1 --spider https://fakeminio.datajoint.io:443/minio/health/live || exit 1
+      timeout: 5s
+      retries: 300
+      interval: 1s
+    # ports:
+    #   - "9000:9000"
+    #   - "443:443"
+    volumes:
+      - ./tests/nginx/base.conf:/base.conf
+      - ./tests/nginx/entrypoint.sh:/entrypoint.sh
+      - ./tests/nginx/fullchain.pem:/certs/fullchain.pem
+      - ./tests/nginx/privkey.pem:/certs/privkey.pem
 networks:
   main:
@@ -22,6 +22,65 @@ If you already have an older version of DataJoint installed using `pip`, upgrade
 ```bash
 pip3 install --upgrade datajoint
 ```
+## Python Native Blobs
+
+For the v0.12 release, the variable `enable_python_native_blobs` can be
+safely enabled for improved blob support of python datatypes if the following
+are true:
+
+  * This is a new DataJoint installation / pipeline(s)
+  * You have not used DataJoint prior to v0.12 with your pipeline(s)
+  * You do not share blob data between Python and Matlab
+
+Otherwise, please read the following carefully:
+
+DataJoint v0.12 expands DataJoint's blob serialization mechanism with
+improved support for complex native python datatypes, such as dictionaries
+and lists of strings.
+
+Prior to DataJoint v0.12, certain python native datatypes such as
+dictionaries were 'squashed' into numpy structured arrays when saved into
+blob attributes. This facilitated easier data sharing between Matlab
+and Python for certain record types. However, this created a discrepancy
+between insert and fetch datatypes which could cause problems in other
+portions of users pipelines.
+
+For v0.12, it was decided to remove the type squashing behavior, instead
+creating a separate storage encoding which improves support for storing
+native python datatypes in blobs without squashing them into numpy
+structured arrays. However, this change creates a compatibility problem
+for pipelines which previously relied on the type squashing behavior
+since records saved via the old squashing format will continue to fetch
+as structured arrays, whereas new record inserted in DataJoint 0.12 with
+`enable_python_native_blobs` would result in records returned as the
+appropriate native python type (dict, etc).  Read support for python
+native blobs also not yet implemented in DataJoint for Matlab.
+
+To prevent data from being stored in mixed format within a table across
+upgrades from previous versions of DataJoint, the
+`enable_python_native_blobs` flag was added as a temporary guard measure
+for the 0.12 release. This flag will trigger an exception if any of the
+ambiguous cases are encountered during inserts in order to allow testing
+and migration of pre-0.12 pipelines to 0.11 in a safe manner.
+
+The exact process to update a specific pipeline will vary depending on
+the situation, but generally the following strategies may apply:
+
+  * Altering code to directly store numpy structured arrays or plain
+    multidimensional arrays. This strategy is likely best one for those 
+    tables requiring compatibility with Matlab.
+  * Adjust code to deal with both structured array and native fetched data.
+    In this case, insert logic is not adjusted, but downstream consumers
+    are adjusted to handle records saved under the old and new schemes.
+  * Manually convert data using fetch/insert into a fresh schema.
+    In this approach, DataJoint's create_virtual_module functionality would 
+    be used in conjunction with a a fetch/convert/insert loop to update 
+    the data to the new native_blob functionality.
+  * Drop/Recompute imported/computed tables to ensure they are in the new
+    format.
+
+As always, be sure that your data is safely backed up before modifying any
+important DataJoint schema or records.
 
 ## Documentation and Tutorials
 A number of labs are currently adopting DataJoint and we are quickly getting the documentation in shape in February 2017.
 
@@ -41,5 +41,6 @@
 from .attribute_adapter import AttributeAdapter
 from . import errors
 from .errors import DataJointError
+from .migrate import migrate_dj011_external_blob_storage_to_dj012
 
 ERD = Di = Diagram   # Aliases for Diagram
@@ -316,7 +316,7 @@ def delete(self, *, delete_external_files=None, limit=None, display_progress=Tru
             raise DataJointError("The delete_external_files argument must be set to either True or False in delete()")
 
         if not delete_external_files:
-            self.unused.delete_quick()
+            self.unused().delete_quick()
         else:
             items = self.unused().fetch_external_paths(limit=limit)
             if display_progress:
 
@@ -50,7 +50,7 @@ def _get(connection, attr, data, squeeze, download_path):
     adapt = attr.adapter.get if attr.adapter else lambda x: x
 
     if attr.is_filepath:
-        return adapt(extern.download_filepath(uuid.UUID(bytes=data))[0])
+        return str(adapt(extern.download_filepath(uuid.UUID(bytes=data))[0]))
 
     if attr.is_attachment:
         # Steps:
@@ -65,22 +65,22 @@ def _get(connection, attr, data, squeeze, download_path):
         if local_filepath.is_file():
             attachment_checksum = _uuid if attr.is_external else hash.uuid_from_buffer(data)
             if attachment_checksum == hash.uuid_from_file(local_filepath, init_string=attachment_name + '\0'):
-                return adapt(local_filepath)  # checksum passed, no need to download again
+                return str(adapt(local_filepath))  # checksum passed, no need to download again
             # generate the next available alias filename
             for n in itertools.count():
                 f = local_filepath.parent / (local_filepath.stem + '_%04x' % n + local_filepath.suffix)
                 if not f.is_file():
                     local_filepath = f
                     break
                 if attachment_checksum == hash.uuid_from_file(f, init_string=attachment_name + '\0'):
-                    return adapt(f)  # checksum passed, no need to download again
+                    return str(adapt(f))  # checksum passed, no need to download again
         # Save attachment
         if attr.is_external:
             extern.download_attachment(_uuid, attachment_name, local_filepath)
         else:
             # write from buffer
             safe_write(local_filepath, data.split(b"\0", 1)[1])
-        return adapt(local_filepath)  # download file from remote store
+        return str(adapt(local_filepath))  # download file from remote store
 
     return adapt(uuid.UUID(bytes=data) if attr.uuid else (
             blob.unpack(extern.get(uuid.UUID(bytes=data)) if attr.is_external else data, squeeze=squeeze) 
 
@@ -0,0 +1,167 @@
+import datajoint as dj
+from pathlib import Path
+import re
+from .utils import user_choice
+
+
+def migrate_dj011_external_blob_storage_to_dj012(migration_schema, store):
+    """
+    Utility function to migrate external blob data from 0.11 to 0.12.
+    :param migration_schema: string of target schema to be migrated
+    :param store: string of target dj.config['store'] to be migrated
+    """
+    if not isinstance(migration_schema, str):
+        raise ValueError(
+                'Expected type {} for migration_schema, not {}.'.format(
+                str, type(migration_schema)))
+
+    do_migration = False
+    do_migration = user_choice(
+            """
+Warning: Ensure the following are completed before proceeding.
+- Appropriate backups have been taken,
+- Any existing DJ 0.11.X connections are suspended, and
+- External config has been updated to new dj.config['stores'] structure.
+Proceed?
+            """, default='no') == 'yes'
+    if do_migration:
+        _migrate_dj011_blob(dj.schema(migration_schema), store)
+        print('Migration completed for schema: {}, store: {}.'.format(
+                migration_schema, store))
+        return
+    print('No migration performed.')
+
+
+def _migrate_dj011_blob(schema, default_store):
+    query = schema.connection.query
+
+    LEGACY_HASH_SIZE = 43
+
+    legacy_external = dj.FreeTable(
+        schema.connection,
+        '`{db}`.`~external`'.format(db=schema.database))
+
+    # get referencing tables
+    refs = query("""
+    SELECT concat('`', table_schema, '`.`', table_name, '`')
+            as referencing_table, column_name, constraint_name
+    FROM information_schema.key_column_usage
+    WHERE referenced_table_name="{tab}" and referenced_table_schema="{db}"
+    """.format(
+        tab=legacy_external.table_name,
+        db=legacy_external.database), as_dict=True).fetchall()
+
+    for ref in refs:
+        # get comment
+        column = query(
+            'SHOW FULL COLUMNS FROM {referencing_table}'
+            'WHERE Field="{column_name}"'.format(
+                **ref), as_dict=True).fetchone()
+
+        store, comment = re.match(
+            r':external(-(?P<store>.+))?:(?P<comment>.*)',
+            column['Comment']).group('store', 'comment')
+
+        # get all the hashes from the reference
+        hashes = {x[0] for x in query(
+            'SELECT `{column_name}` FROM {referencing_table}'.format(
+                **ref))}
+
+        # sanity check make sure that store suffixes match
+        if store is None:
+            assert all(len(_) == LEGACY_HASH_SIZE for _ in hashes)
+        else:
+            assert all(_[LEGACY_HASH_SIZE:] == store for _ in hashes)
+
+        # create new-style external table
+        ext = schema.external[store or default_store]
+
+        # add the new-style reference field
+        temp_suffix = 'tempsub'
+
+        try:
+            query("""ALTER TABLE {referencing_table}
+                ADD COLUMN `{column_name}_{temp_suffix}` {type} DEFAULT NULL
+            COMMENT ":blob@{store}:{comment}"
+            """.format(
+                type=dj.declare.UUID_DATA_TYPE,
+                temp_suffix=temp_suffix,
+                store=(store or default_store), comment=comment, **ref))
+        except:
+            print('Column already added')
+            pass
+
+        # Copy references into the new external table
+        # No Windows! Backslashes will cause problems
+
+        contents_hash_function = {
+            'file': lambda ext, relative_path: dj.hash.uuid_from_file(
+                str(Path(ext.spec['location'], relative_path))),
+            's3': lambda ext, relative_path: dj.hash.uuid_from_buffer(
+                ext.s3.get(relative_path))
+        }
+
+        for _hash, size in zip(*legacy_external.fetch('hash', 'size')):
+            if _hash in hashes:
+                relative_path = str(Path(schema.database, _hash).as_posix())
+                uuid = dj.hash.uuid_from_buffer(init_string=relative_path)
+                external_path = ext._make_external_filepath(relative_path)
+                if ext.spec['protocol'] == 's3':
+                    contents_hash = dj.hash.uuid_from_buffer(ext._download_buffer(external_path))
+                else:
+                    contents_hash = dj.hash.uuid_from_file(external_path)
+                ext.insert1(dict(
+                    filepath=relative_path,
+                    size=size,
+                    contents_hash=contents_hash,
+                    hash=uuid
+                ), skip_duplicates=True)
+
+                query(
+                    'UPDATE {referencing_table} '
+                    'SET `{column_name}_{temp_suffix}`=%s '
+                    'WHERE `{column_name}` = "{_hash}"'
+                    .format(
+                        _hash=_hash,
+                        temp_suffix=temp_suffix, **ref), uuid.bytes)
+
+        # check that all have been copied
+        check = query(
+            'SELECT * FROM {referencing_table} '
+            'WHERE `{column_name}` IS NOT NULL'
+            '  AND `{column_name}_{temp_suffix}` IS NULL'
+            .format(temp_suffix=temp_suffix, **ref)).fetchall()
+
+        assert len(check) == 0, 'Some hashes havent been migrated'
+
+        # drop old foreign key, rename, and create new foreign key
+        query("""
+            ALTER TABLE {referencing_table}
+            DROP FOREIGN KEY `{constraint_name}`,
+            DROP COLUMN `{column_name}`,
+            CHANGE COLUMN `{column_name}_{temp_suffix}` `{column_name}`
+                {type} DEFAULT NULL
+                COMMENT ":blob@{store}:{comment}",
+            ADD FOREIGN KEY (`{column_name}`) REFERENCES {ext_table_name}
+                (`hash`)
+            """.format(
+                temp_suffix=temp_suffix,
+                ext_table_name=ext.full_table_name,
+                type=dj.declare.UUID_DATA_TYPE,
+                store=(store or default_store), comment=comment, **ref))
+
+    # Drop the old external table but make sure it's no longer referenced
+    # get referencing tables
+    refs = query("""
+    SELECT concat('`', table_schema, '`.`', table_name, '`') as
+        referencing_table, column_name, constraint_name
+    FROM information_schema.key_column_usage
+    WHERE referenced_table_name="{tab}" and referenced_table_schema="{db}"
+    """.format(
+        tab=legacy_external.table_name,
+        db=legacy_external.database), as_dict=True).fetchall()
+
+    assert not refs, 'Some references still exist'
+
+    # drop old external table
+    legacy_external.drop_quick()
@@ -1,3 +1,3 @@
-__version__ = "0.12.dev8"
+__version__ = "0.12.dev9"
 
 assert len(__version__) <= 10  # The log table limits version to the 10 characters
@@ -1,20 +1,17 @@
 .. code-block:: python
 
-   # default external storage
-   dj.config['external'] = dict(
-                 protocol='s3',
-                 endpoint='https://s3.amazonaws.com',
-                 bucket = 'testbucket',
-                 location = '/datajoint-projects/myschema',
-                 access_key='1234567',
-                 secret_key='foaf1234')
+  dj.config['stores'] = {
+    'external': dict(  # 'regular' external storage for this pipeline
+                  protocol='s3',
+                  endpoint='https://s3.amazonaws.com',
+                  bucket = 'testbucket',
+                  location = '/datajoint-projects/myschema',
+                  access_key='1234567',
+                  secret_key='foaf1234'),
+    'external-raw'] = dict( # 'raw' storage for this pipeline
+                  protocol='file',
+                  location='/net/djblobs/myschema')
+  }
+  # external object cache - see fetch operation below for details.
+  dj.config['cache'] = '/net/djcache'
 
-   # raw data storage
-   dj.config['extnernal-raw'] = dict(
-                 protocol='file',
-                 location='/net/djblobs/myschema')
-
-   # external object cache - see fetch operation below for details.
-   dj.config['cache'] = dict(
-                 protocol='file',
-                 location='/net/djcache')
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`		`-__version__ = "0.12.dev8"`
	`1`	`+__version__ = "0.12.dev9"`
`2`	`2`
`3`	`3`	`assert len(__version__) <= 10 # The log table limits version to the 10 characters`