feat: add tests and refactor codec by d-v-b · Pull Request #1 · datajoint/dj-zarr-codecs

d-v-b · 2026-01-23T15:25:50Z

This PR adds a test suite that currently contains two tests that check for the round-tripability of numpy arrays and zarr arrays.

I also refactored the ZarrCodec (renamed in this PR to ZarrArrayCodec because it does not accept zarr groups).

the codec version was removed. I also removed the procedure for storing datajoint-specific metadata in the Zarr array's attributes, as there is no clean way to avoid exposing this to the user when they read the array back. Instead, information about the state of the software at insert time is added to a JSON column in the database called "provenance", where it cannot collide with user attributes.
the codec supports reading zarr arrays and numpy arrays alike

d-v-b · 2026-01-23T15:32:28Z

+        if not isinstance(value, np.ndarray | zarr.Array):
+            raise TypeError(
+                f"<zarr> requires a Numpy array or Zarr array, got {type(value).__name__}"


this raises a TypeError instead of a DataJointError, because the DataJointError does not convey to callers that the problem is related to values being the wrong type.

d-v-b · 2026-01-23T15:33:06Z

        DataJointError
            If encoding fails.
        """
-        try:


this is no longer wrapped in a try... except block (a bad LLM habit)

d-v-b · 2026-01-23T15:33:54Z

+            "provenance": {
+                "datajoint-python": dj.__version__,
+                "dj-zarr-codecs": __version__,
+            },


rows in this table have a provenance column that stores JSON documents conveying the state of the software that inserted the column.

d-v-b · 2026-01-23T15:34:31Z

-            # Check codec version for backward compatibility
-            # Priority: Zarr attrs > DB metadata > default "1.0"
-            version = z.attrs.get(
-                "codec_version", stored.get("codec_version", "1.0")
-            )


version guards are removed because it isn't clear what "1.0" vs "2.0" would mean here

d-v-b · 2026-01-23T15:36:02Z

@@ -0,0 +1,113 @@
+import logging


notice how much boilerplate is required just to write simple tests against datajoint. This could be made MUCH simpler by using sqlite as an optional datajoint backend. sqlite doesn't require docker, or any other external process, and so many of these fixtures could be removed.

d-v-b · 2026-01-23T15:36:52Z

+    })
+
+    # Fetch returns the array
+    fetched = (Table & {'id': 1}).fetch1('data')


it would be great to figure out how to make operations like this type-safe.

ideally, we would want Table.__rmultiply__ or whatever the dunder method is called to be defined relative to a TypedDict that knows the schema for Table. Similarly for fetch1

You mean the restriction operator? & calls .restrict. We can specify the type there.

d-v-b · 2026-01-23T15:39:45Z

+        definition = """
+        id : int
+        ---
+        data : <zarr@store>


IMO the indirection here behind "store" is unfortunate. "store" should be an actual URI, so people reading this code know exactly where data will be saved.

d-v-b · 2026-01-23T15:40:27Z

 build-backend = "setuptools.build_meta"

+[tool.uv.sources]
+datajoint = {git = "https://github.com/datajoint/datajoint-python.git"}


this will need to be changed when 2.0 lands on pypi (maybe it's already there)

Yes, coming soon but not yet.

d-v-b · 2026-01-23T15:40:45Z


 [tool.pytest.ini_options]
 testpaths = ["tests"]
-addopts = "-v --cov=dj_zarr_codecs --cov-report=term-missing"


removed these because they generated a lot of noise

dimitri-yatsenko · 2026-01-23T16:30:35Z

Thoughts on Testing Infrastructure

Thanks for the thoughtful comments, Davis. The SQLite suggestion sparked some useful discussion about testing patterns for DataJoint ecosystem packages.

SQLite as Test Backend

What would work:

Tables, PKs, FKs (with PRAGMA foreign_keys = ON)
Cascading deletes, transactions, indexes
Basic queries, blobs

What would be challenging:

Enums → CHECK constraints
Decimal precision → stored as REAL (float64)
Date/datetime → stored as TEXT
Schema introspection → sqlite_master vs information_schema

What would NOT work:

Stored procedures (jobs/distributed processing relies on these)
Row-level locking for concurrent access

Estimated effort: 2-3 weeks for basic adapter, more for edge cases.

Alternative: Better Testing Patterns

Rather than (or in addition to) SQLite, we could create a datajoint-testing helper package:

from datajoint_testing import temp_schema

@pytest.fixture
def schema(dj_connection):
    with temp_schema('test_zarr') as schema:
        yield schema
    # Auto-cleanup on success AND failure

def test_codec(schema):
    @schema
    class Data(dj.Manual):
        definition = "..."

Or use testcontainers for ephemeral Docker databases:

from testcontainers.postgres import PostgresContainer

@pytest.fixture(scope="session")
def postgres():
    with PostgresContainer("postgres:15") as pg:
        dj.config['database.host'] = pg.get_container_host_ip()
        dj.config['database.port'] = pg.get_exposed_port(5432)
        yield pg

Principles for Integration Tests

Isolation - Each test gets its own schema, no shared state, automatic cleanup
Speed - Reuse connections (session-scoped), minimal data, test logic not volume
Reproducibility - Fixed seeds, pinned versions, CI matches local
Real database behavior - SQLite may hide bugs (type coercion, locking semantics)

On Type Safety

The Table * restriction and fetch1() type safety question is a real pain point. Options:

TypedDict for known schemas (but schemas are defined at runtime in definition strings)
Code generation from table definitions
Pydantic models (significant API change)

This would require substantial API work but could be valuable for IDE support and static analysis.

On Store URIs

The named store indirection ("store") does trade explicitness for flexibility (dev vs prod configs). Could potentially support both: <blob@s3://bucket/path> for explicit URIs and <blob@storename> for configured stores.

Summary: SQLite could work as a "lite mode" for fast feedback during development, but shouldn't replace real database tests. A testing helper package might give better ROI for the ecosystem.

dimitri-yatsenko · 2026-01-23T16:32:39Z

Correction on Store URIs:

I should clarify my earlier comment about stores. The named store indirection is intentional and important - it's not just about configuration flexibility.

DataJoint stores are carefully guarded because DataJoint manages:

Data integrity - ensuring consistency between relational metadata and stored objects
Transaction safety - coordinating database and object storage operations
Lifecycle management - garbage collection of orphaned objects, cleanup on delete

The <blob@storename> syntax conveys that this is a managed store under DataJoint's control, not an arbitrary location. Allowing raw URIs would bypass these guarantees and potentially lead to orphaned data, integrity violations, or objects that escape garbage collection.

So the current design is principled, not just convenient.

d-v-b · 2026-01-23T16:48:55Z

datajoint doesn't control the storage backend -- it just does IO against it, like any other client. Meanwhile, the distinction between data saved to local storage vs s3 is massive when it comes to making performance-related decisions (or cost-related decisions). IMO the real location where data is being stored should be exposed very obviously to consumers so they can make informed cost + perf decisions.

dimitri-yatsenko · 2026-01-23T16:55:56Z

RE: Testing setup — I just added a testing best practices guide to the documentation in PR #137:

Testing DataJoint Packages

It covers:

Pytest fixtures for connection and schema management (session-scoped connections, function-scoped schemas with UUID names)
Docker Compose setup for test databases
GitHub Actions CI configuration
Patterns for testing computed tables, FK constraints, cascading deletes, and blob storage

This should help standardize integration testing across packages that depend on DataJoint.

d-v-b · 2026-01-23T16:57:23Z

as for the type safety thing, I don't think python is there yet in terms of the utility types you would need for this, unfortunrtely

dimitri-yatsenko · 2026-01-23T17:30:52Z

datajoint doesn't control the storage backend -- it just does IO against it, like any other client. Meanwhile, the distinction between data saved to local storage vs s3 is massive when it comes to making performance-related decisions (or cost-related decisions). IMO the real location where data is being stored should be exposed very obviously to consumers so they can make informed cost + perf decisions.

This is true for the datajoint-python client: the client API does not control the store. However, the overall DataJoint Platform treats the relational database and integrated object stores as a single system with joint access management and data integrity, so the stores are tightly controlled--arbitrary URLs are not supported.

d-v-b added 10 commits January 20, 2026 17:11

cleanup

815f16d

tests

20231fb

test for numpy and zarr array roundtrip

ccf829d

move schema fixture

f6be988

rename to ZarrArrayCodec

5e4aa33

provenance column

cf58710

add .python-version

d7d7eda

add github actions workflow for running tests

aa5f2f3

add github actions workflow for running tests

862f208

allow .python-version

176c12a

d-v-b commented Jan 23, 2026

View reviewed changes

d-v-b requested a review from dimitri-yatsenko January 23, 2026 15:40

dimitri-yatsenko approved these changes Jan 23, 2026

View reviewed changes

dimitri-yatsenko merged commit c0ca91a into datajoint:main Jan 23, 2026

Uh oh!

Conversation

d-v-b commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitri-yatsenko commented Jan 23, 2026

Thoughts on Testing Infrastructure

SQLite as Test Backend

Alternative: Better Testing Patterns

Principles for Integration Tests

On Type Safety

On Store URIs

Uh oh!

dimitri-yatsenko commented Jan 23, 2026

Uh oh!

d-v-b commented Jan 23, 2026

Uh oh!

dimitri-yatsenko commented Jan 23, 2026

Uh oh!

d-v-b commented Jan 23, 2026

Uh oh!

dimitri-yatsenko commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitri-yatsenko commented Jan 23, 2026 •

edited

Loading