feat: AnnData.unwriteable based on AnnData._reduce + iter_outer + refactorings of other relevant functions#2372
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2372 +/- ##
==========================================
- Coverage 87.35% 85.45% -1.91%
==========================================
Files 49 49
Lines 7694 7720 +26
==========================================
- Hits 6721 6597 -124
- Misses 973 1123 +150
|
AnnData.can_write based on AnnData.foldAnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__
|
Just looking for a first-pass vibe check on this! |
flying-sheep
left a comment
There was a problem hiding this comment.
Very exciting! I think this mainly needs some serious documenting and examples, not much else.
For examples, maybe first something that just handles one kind of accessor/reference first to show how one can short-circuit by returning the accumulator early.
Regarding naming: the canonical name in the stdlib is reduce, it even used to be in builtins, but maybe the scientific python world deviates for no good reason again. It’s also fair if you have other objections to reduce!
| (is_ad_ref := isinstance(ref_acc, AdRef)) | ||
| or ( | ||
| is_ref_acc := isinstance( | ||
| ref_acc, LayerAcc | MultiAcc | GraphAcc | ||
| ) | ||
| ) |
There was a problem hiding this comment.
this is
isinstance(ref_acc, AdRef | RefAcc) and not isinstance(ref_acc, MetaAcc)because you want to handle each homogeneous array.
I think there are some ways to make this more ergonomic, e.g.:
- we can add more marker superclasses that handle this use case, or
- we could change the implementation of fold/reduce so it descends into things and returning or raising some marker would prevent descent. that might help handling things as early as possible and therefore having less complicated logic.
|
Hi @ilan-gold, interesting direction! The HTML repr PR (#2236) intersects with HTML repr (#2236): traversal overlapThe SECTION_ORDER = ("X", "obs", "var", "uns", "obsm", "varm", "layers", "obsp", "varp", "raw")
for section in SECTION_ORDER:
parts.append(_render_section(adata, section, context))This is structurally very similar to
On the positive side, the per-column iteration of obs/var via The repr would naturally use DFS-pre order (render section header first, then its entries), so it's good that option exists. If Serialization checks:
|
Great catch! I think we currently don't have a path accessor for it, so I think that's the same thing as
Phil brought this up in a call on Friday as well, not sure (again) what is to be done
I'm not sure I follow this line of thinking fully - why can't you distinguish between these two now? If you are doing
That could make sense. Once you merge this into your PR we can refactor this if needed + work out more kinks |
|
|
||
| sizes = {} | ||
| attrs = ["X", "_obs", "_var"] | ||
| attrs_multi = ["_uns", "_obsm", "_varm", "varp", "_obsp", "_layers"] |
There was a problem hiding this comment.
FWIW this never handled raw anyway
AnnData.can_write based on AnnData.{reduce,iter} + refactorings of other relevant functionsAnnData.can_write based on AnnData._reduce + iter_outer + refactorings of other relevant functions
|
So I've made |
flying-sheep
left a comment
There was a problem hiding this comment.
Some nice gains, great!
I wonder if a more granular can_write would be helpful for people that shows them which arrays/columns block the writeability.
| AnnData object with n_obs × n_vars = 700 × 765 | ||
| obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain' | ||
| var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable' | ||
| uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups' |
There was a problem hiding this comment.
It got moved to the end, absorbed by obsp: ...
There was a problem hiding this comment.
hmm, maybe replace obsp: ... with ... then to make that clear
That would be nice but feels downstream of "we might need better ref/acc objects to describe the whole anndata" which I think we have realized in the course of this PR is too big of a lift ATM. I think keeping the |
There was a problem hiding this comment.
Looks great!
I’m ill, so I won’t check for myself, but is can_write complete? Like will it actually handle all non-bug cases where an anndata object fails to write, like a nested anndata object in uns that contains something unwriteable? Or some unhandled dtype in a categorical column or so?
If yes, I just have one ask: I think we should add a note or so to can_write that the return type isn’t guaranteed to stay the same, but the coercion to bool is stable: bool(adata.can_write(...) / if adata.can_write(...):. That way we can make it return a structure later that has info about unwriteable bits or so.
I wonder if this is too strong. For example: from anndata.acc import A
assert bool(A.obs)We would then need special classes whose |
yeah, the only useful alternative could be renaming it to do you have a different idea? |
|
No and I that's even better given the current state of things. You would want to use this in a guard clause anyway so if adata.unwriteable():is nicer than if not adata.can_write() |
flying-sheep
left a comment
There was a problem hiding this comment.
Yay! Thank you for your patience!
AnnData.can_write based on AnnData._reduce + iter_outer + refactorings of other relevant functionsAnnData.unwriteable based on AnnData._reduce + iter_outer + refactorings of other relevant functions
Use anndata.utils.iter_outer from scverse#2372 as the canonical source for standard section iteration in the HTML repr. Drop the redundant SECTION_ORDER tuple; display order now follows iter_outer. - _render_all_sections iterates (name, elem) pairs from iter_outer and passes elem to downstream renderers, avoiding a second getattr (which would trigger another file open/close cycle on backed AnnData). - _render_dataframe_section, _render_mapping_section, _render_uns_section, _render_raw_section now take the elem directly. - _detect_unknown_sections and _get_custom_sections_by_position use a local STANDARD_SECTIONS frozenset for name-only membership checks so they don't pay iter_outer's per-yield I/O just to get names. - Drop unused adata param from _render_uns_entry. Display order changes to X, obs, var, obsm, varm, obsp, varp, layers, uns, raw (uns moves from position 4 to position 9).
maybe TODOs:
fold?reduce?write_{zarr,h5ad}to prevent partial stores from being written?We introduce
AnnData.foldandAnnData.can_writefor "crawling" anAnnDataand accumulating, implementing acan_writeto tell if anAnnDatacan be written to a given format.I'm also open to
reduceas the name instead offold.For #2236, I think we could implement this more complexly by building on
foldand accumulating asetofAdRef(or similar) objects pointing at what can't be written. Open to implement this! I reimplemented__sizeof__to stress-test using reference-accessors and it went ok, but it's not amazingly smooth (likeunsand trying to figure out if you are in a "top-level" elem likeobsorvarm)If zarr-developers/zarr-python#3613 ever happens (I hope so) or we do #2238, this will be crucial to making things smoother with hdf5 for preventing incompatibilities.
backedinternal checking #2369 and moving towards something raised in feat: Add HTML representation #2236