feat: add tests and refactor codec#1
Conversation
| if not isinstance(value, np.ndarray | zarr.Array): | ||
| raise TypeError( | ||
| f"<zarr> requires a Numpy array or Zarr array, got {type(value).__name__}" |
There was a problem hiding this comment.
this raises a TypeError instead of a DataJointError, because the DataJointError does not convey to callers that the problem is related to values being the wrong type.
| DataJointError | ||
| If encoding fails. | ||
| """ | ||
| try: |
There was a problem hiding this comment.
this is no longer wrapped in a try... except block (a bad LLM habit)
| "provenance": { | ||
| "datajoint-python": dj.__version__, | ||
| "dj-zarr-codecs": __version__, | ||
| }, |
There was a problem hiding this comment.
rows in this table have a provenance column that stores JSON documents conveying the state of the software that inserted the column.
| # Check codec version for backward compatibility | ||
| # Priority: Zarr attrs > DB metadata > default "1.0" | ||
| version = z.attrs.get( | ||
| "codec_version", stored.get("codec_version", "1.0") | ||
| ) |
There was a problem hiding this comment.
version guards are removed because it isn't clear what "1.0" vs "2.0" would mean here
| @@ -0,0 +1,113 @@ | |||
| import logging | |||
There was a problem hiding this comment.
notice how much boilerplate is required just to write simple tests against datajoint. This could be made MUCH simpler by using sqlite as an optional datajoint backend. sqlite doesn't require docker, or any other external process, and so many of these fixtures could be removed.
| }) | ||
|
|
||
| # Fetch returns the array | ||
| fetched = (Table & {'id': 1}).fetch1('data') |
There was a problem hiding this comment.
it would be great to figure out how to make operations like this type-safe.
There was a problem hiding this comment.
ideally, we would want Table.__rmultiply__ or whatever the dunder method is called to be defined relative to a TypedDict that knows the schema for Table. Similarly for fetch1
There was a problem hiding this comment.
You mean the restriction operator? & calls .restrict. We can specify the type there.
| definition = """ | ||
| id : int | ||
| --- | ||
| data : <zarr@store> |
There was a problem hiding this comment.
IMO the indirection here behind "store" is unfortunate. "store" should be an actual URI, so people reading this code know exactly where data will be saved.
| build-backend = "setuptools.build_meta" | ||
|
|
||
| [tool.uv.sources] | ||
| datajoint = {git = "https://github.com/datajoint/datajoint-python.git"} |
There was a problem hiding this comment.
this will need to be changed when 2.0 lands on pypi (maybe it's already there)
There was a problem hiding this comment.
Yes, coming soon but not yet.
|
|
||
| [tool.pytest.ini_options] | ||
| testpaths = ["tests"] | ||
| addopts = "-v --cov=dj_zarr_codecs --cov-report=term-missing" |
There was a problem hiding this comment.
removed these because they generated a lot of noise
Thoughts on Testing InfrastructureThanks for the thoughtful comments, Davis. The SQLite suggestion sparked some useful discussion about testing patterns for DataJoint ecosystem packages. SQLite as Test BackendWhat would work:
What would be challenging:
What would NOT work:
Estimated effort: 2-3 weeks for basic adapter, more for edge cases. Alternative: Better Testing PatternsRather than (or in addition to) SQLite, we could create a from datajoint_testing import temp_schema
@pytest.fixture
def schema(dj_connection):
with temp_schema('test_zarr') as schema:
yield schema
# Auto-cleanup on success AND failure
def test_codec(schema):
@schema
class Data(dj.Manual):
definition = "..."Or use testcontainers for ephemeral Docker databases: from testcontainers.postgres import PostgresContainer
@pytest.fixture(scope="session")
def postgres():
with PostgresContainer("postgres:15") as pg:
dj.config['database.host'] = pg.get_container_host_ip()
dj.config['database.port'] = pg.get_exposed_port(5432)
yield pgPrinciples for Integration Tests
On Type SafetyThe
This would require substantial API work but could be valuable for IDE support and static analysis. On Store URIsThe named store indirection ( Summary: SQLite could work as a "lite mode" for fast feedback during development, but shouldn't replace real database tests. A testing helper package might give better ROI for the ecosystem. |
|
Correction on Store URIs: I should clarify my earlier comment about stores. The named store indirection is intentional and important - it's not just about configuration flexibility. DataJoint stores are carefully guarded because DataJoint manages:
The So the current design is principled, not just convenient. |
|
datajoint doesn't control the storage backend -- it just does IO against it, like any other client. Meanwhile, the distinction between data saved to local storage vs s3 is massive when it comes to making performance-related decisions (or cost-related decisions). IMO the real location where data is being stored should be exposed very obviously to consumers so they can make informed cost + perf decisions. |
|
RE: Testing setup — I just added a testing best practices guide to the documentation in PR #137: It covers:
This should help standardize integration testing across packages that depend on DataJoint. |
|
as for the type safety thing, I don't think python is there yet in terms of the utility types you would need for this, unfortunrtely |
This is true for the datajoint-python client: the client API does not control the store. However, the overall DataJoint Platform treats the relational database and integrated object stores as a single system with joint access management and data integrity, so the stores are tightly controlled--arbitrary URLs are not supported. |
This PR adds a test suite that currently contains two tests that check for the round-tripability of numpy arrays and zarr arrays.
I also refactored the
ZarrCodec(renamed in this PR toZarrArrayCodecbecause it does not accept zarr groups)."provenance", where it cannot collide with user attributes.