Skip to content

feat: add tests and refactor codec#1

Merged
dimitri-yatsenko merged 10 commits into
datajoint:mainfrom
d-v-b:feat/tests
Jan 23, 2026
Merged

feat: add tests and refactor codec#1
dimitri-yatsenko merged 10 commits into
datajoint:mainfrom
d-v-b:feat/tests

Conversation

@d-v-b

@d-v-b d-v-b commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

This PR adds a test suite that currently contains two tests that check for the round-tripability of numpy arrays and zarr arrays.

I also refactored the ZarrCodec (renamed in this PR to ZarrArrayCodec because it does not accept zarr groups).

  • the codec version was removed. I also removed the procedure for storing datajoint-specific metadata in the Zarr array's attributes, as there is no clean way to avoid exposing this to the user when they read the array back. Instead, information about the state of the software at insert time is added to a JSON column in the database called "provenance", where it cannot collide with user attributes.
  • the codec supports reading zarr arrays and numpy arrays alike

Comment on lines +89 to +91
if not isinstance(value, np.ndarray | zarr.Array):
raise TypeError(
f"<zarr> requires a Numpy array or Zarr array, got {type(value).__name__}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this raises a TypeError instead of a DataJointError, because the DataJointError does not convey to callers that the problem is related to values being the wrong type.

DataJointError
If encoding fails.
"""
try:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer wrapped in a try... except block (a bad LLM habit)

Comment on lines +149 to +152
"provenance": {
"datajoint-python": dj.__version__,
"dj-zarr-codecs": __version__,
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows in this table have a provenance column that stores JSON documents conveying the state of the software that inserted the column.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent.

Comment on lines -200 to -204
# Check codec version for backward compatibility
# Priority: Zarr attrs > DB metadata > default "1.0"
version = z.attrs.get(
"codec_version", stored.get("codec_version", "1.0")
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version guards are removed because it isn't clear what "1.0" vs "2.0" would mean here

Comment thread tests/conftest.py
@@ -0,0 +1,113 @@
import logging

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice how much boilerplate is required just to write simple tests against datajoint. This could be made MUCH simpler by using sqlite as an optional datajoint backend. sqlite doesn't require docker, or any other external process, and so many of these fixtures could be removed.

Comment thread tests/test_codec.py
})

# Fetch returns the array
fetched = (Table & {'id': 1}).fetch1('data')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to figure out how to make operations like this type-safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, we would want Table.__rmultiply__ or whatever the dunder method is called to be defined relative to a TypedDict that knows the schema for Table. Similarly for fetch1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the restriction operator? & calls .restrict. We can specify the type there.

Comment thread tests/test_codec.py
definition = """
id : int
---
data : <zarr@store>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the indirection here behind "store" is unfortunate. "store" should be an actual URI, so people reading this code know exactly where data will be saved.

Comment thread pyproject.toml
build-backend = "setuptools.build_meta"

[tool.uv.sources]
datajoint = {git = "https://github.com/datajoint/datajoint-python.git"}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to be changed when 2.0 lands on pypi (maybe it's already there)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, coming soon but not yet.

Comment thread pyproject.toml

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-v --cov=dj_zarr_codecs --cov-report=term-missing"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these because they generated a lot of noise

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Thoughts on Testing Infrastructure

Thanks for the thoughtful comments, Davis. The SQLite suggestion sparked some useful discussion about testing patterns for DataJoint ecosystem packages.

SQLite as Test Backend

What would work:

  • Tables, PKs, FKs (with PRAGMA foreign_keys = ON)
  • Cascading deletes, transactions, indexes
  • Basic queries, blobs

What would be challenging:

  • Enums → CHECK constraints
  • Decimal precision → stored as REAL (float64)
  • Date/datetime → stored as TEXT
  • Schema introspection → sqlite_master vs information_schema

What would NOT work:

  • Stored procedures (jobs/distributed processing relies on these)
  • Row-level locking for concurrent access

Estimated effort: 2-3 weeks for basic adapter, more for edge cases.

Alternative: Better Testing Patterns

Rather than (or in addition to) SQLite, we could create a datajoint-testing helper package:

from datajoint_testing import temp_schema

@pytest.fixture
def schema(dj_connection):
    with temp_schema('test_zarr') as schema:
        yield schema
    # Auto-cleanup on success AND failure

def test_codec(schema):
    @schema
    class Data(dj.Manual):
        definition = "..."

Or use testcontainers for ephemeral Docker databases:

from testcontainers.postgres import PostgresContainer

@pytest.fixture(scope="session")
def postgres():
    with PostgresContainer("postgres:15") as pg:
        dj.config['database.host'] = pg.get_container_host_ip()
        dj.config['database.port'] = pg.get_exposed_port(5432)
        yield pg

Principles for Integration Tests

  1. Isolation - Each test gets its own schema, no shared state, automatic cleanup
  2. Speed - Reuse connections (session-scoped), minimal data, test logic not volume
  3. Reproducibility - Fixed seeds, pinned versions, CI matches local
  4. Real database behavior - SQLite may hide bugs (type coercion, locking semantics)

On Type Safety

The Table * restriction and fetch1() type safety question is a real pain point. Options:

  • TypedDict for known schemas (but schemas are defined at runtime in definition strings)
  • Code generation from table definitions
  • Pydantic models (significant API change)

This would require substantial API work but could be valuable for IDE support and static analysis.

On Store URIs

The named store indirection ("store") does trade explicitness for flexibility (dev vs prod configs). Could potentially support both: <blob@s3://bucket/path> for explicit URIs and <blob@storename> for configured stores.


Summary: SQLite could work as a "lite mode" for fast feedback during development, but shouldn't replace real database tests. A testing helper package might give better ROI for the ecosystem.

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Correction on Store URIs:

I should clarify my earlier comment about stores. The named store indirection is intentional and important - it's not just about configuration flexibility.

DataJoint stores are carefully guarded because DataJoint manages:

  • Data integrity - ensuring consistency between relational metadata and stored objects
  • Transaction safety - coordinating database and object storage operations
  • Lifecycle management - garbage collection of orphaned objects, cleanup on delete

The <blob@storename> syntax conveys that this is a managed store under DataJoint's control, not an arbitrary location. Allowing raw URIs would bypass these guarantees and potentially lead to orphaned data, integrity violations, or objects that escape garbage collection.

So the current design is principled, not just convenient.

@d-v-b

d-v-b commented Jan 23, 2026

Copy link
Copy Markdown
Contributor Author

datajoint doesn't control the storage backend -- it just does IO against it, like any other client. Meanwhile, the distinction between data saved to local storage vs s3 is massive when it comes to making performance-related decisions (or cost-related decisions). IMO the real location where data is being stored should be exposed very obviously to consumers so they can make informed cost + perf decisions.

@dimitri-yatsenko

Copy link
Copy Markdown
Member

RE: Testing setup — I just added a testing best practices guide to the documentation in PR #137:

Testing DataJoint Packages

It covers:

  • Pytest fixtures for connection and schema management (session-scoped connections, function-scoped schemas with UUID names)
  • Docker Compose setup for test databases
  • GitHub Actions CI configuration
  • Patterns for testing computed tables, FK constraints, cascading deletes, and blob storage

This should help standardize integration testing across packages that depend on DataJoint.

@d-v-b

d-v-b commented Jan 23, 2026

Copy link
Copy Markdown
Contributor Author

as for the type safety thing, I don't think python is there yet in terms of the utility types you would need for this, unfortunrtely

@dimitri-yatsenko

dimitri-yatsenko commented Jan 23, 2026

Copy link
Copy Markdown
Member

datajoint doesn't control the storage backend -- it just does IO against it, like any other client. Meanwhile, the distinction between data saved to local storage vs s3 is massive when it comes to making performance-related decisions (or cost-related decisions). IMO the real location where data is being stored should be exposed very obviously to consumers so they can make informed cost + perf decisions.

This is true for the datajoint-python client: the client API does not control the store. However, the overall DataJoint Platform treats the relational database and integrated object stores as a single system with joint access management and data integrity, so the stores are tightly controlled--arbitrary URLs are not supported.

@dimitri-yatsenko dimitri-yatsenko merged commit c0ca91a into datajoint:main Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants