VCF-103 Shared Pointer Pool#878
Merged
alancleary merged 4 commits intomainfrom Mar 31, 2026
Merged
Conversation
This determines how the class reuses records, i.e. either manually or automatically. In manual mode, records must be explicitly returned to the class for reuse, otherwise the record's shared_ptr will destroy it when the use counter reaches 0. In automatic mode, the VCFV4 class keeps a copy of the record's shared_ptr and automatically reuses it when the use counter reaches 1.
XanthosXanthopoulos
approved these changes
Mar 31, 2026
alancleary
added a commit
that referenced
this pull request
Apr 13, 2026
* Rename tiledb.cc -> tiledb.libtiledb (#875) * Update github actions (#876) * Improvements to stats run-time performance (#877) * Added total_size() methods to allele count and variant stats classes These methods return the total size of the classes' buffers in bytes. * Improved AlleleCount process and flush run-time This was done by using an unordered set for tracking sample names. * Improved SampleStats process and flush run-time This was done by using an unordered map for saving sample stats, minimizing map lookups via memoization, and replacing a nested map with structs. * Improved VariantStats process and flush run-time This was done by reusing vectors for (missing)GT values, using an unordered set for tracking sample names, minimizing map lookups via memoization, and creating separate codepaths for updating v2 and v3 arrays, the prior of which can be done much more efficiently via appending instead of using the v3 insertion strategy. * Improvements to Python Tests (#879) * Added conftest.py Shared fixtures and helpers have been moved out of the monolithic test file into conftest.py so pytest can discover them across future test modules * Broke up test_ingest_with_stats_v3 into focused tests The ~400-line monolithic test covered 8 distinct behaviours across two APIs (read_variant_stats and read_allele_count). The test is now split into 15 tests and the repeated skipif condition was encapsulated into a shared module-level _skip_if_no_bcftools marker. Lastly, the equivalent read_variant_stats, read_variant_stats_arrow, and read_allele_count error checks were consolidated into loops for brevity. * Replace manual use_arrow loops with pytest.mark.parametrize Replacing the loops gives Pandas and Arrow their own named test paths, meaning failures are now reported distinctly. * Split monolithic test file into focused modules Additionally, _skip_if_no_bcftools was renamed to skip_if_no_bcftools and moved to conftest.py alongside the other shared helpers. * Add tests to cover gaps in testing of create_dataset() parameters The previously untested parameters include extra_attrs, tile_capacity, anchor_gap, allow_duplicates=False, checksum_type="md5", and variant_stats_version=2. Also, error-path tests were added for invalid checksum_type, supplying both extra_attrs and vcf_attrs, and calling create_dataset() on an already-existing dataset. * Add tests to cover gaps in testing of ingest_samples() parameters The previously untested parameters and execution paths include sample_uris=None early return, scratch_space_path/scratch_space_size must be provided together, invalid contig_mode, contigs_to_keep_separate/contigs_to_allow_merging must be lists, resume=True, sample_batch_size, and memory/thread tuning parameters. * Add tests to cover gaps in testing of ReadConfig parameters The previously untested parameters include sort_regions, buffer_percentage, tiledb_tile_cache_percentage, and tiledb_config supplied as both a dict and a tiledb.Config object. * Add tests to cover gaps in testing of conttinue_read() and continue_read_arrow() parameters Specifically, the release_buffers=False case is now tested for both methods. * Add tests for delete_samples() edge cases and consolidate delete tests The previously untested delete_samples() paths include empty list is a noop, None raises TypeError, and a non-existent sample name raises RuntimeError. * Add tests to cover gaps in testing of the static Dataset.delete() method The previously untested Dataset.delete() parameters and error paths include testing that the config parameter is accepted and does not prevent deletion of a local dataset and that deleting a non-existent URI raises TileDBError. * Add test for Dataset constructor invalid mode Tests that constructing a Dataset with an unrecognised mode string raises an exception with a descriptive message. * Add tests for config_logging() Test all six valid log levels via parametrization, that an invalid level raises with a descriptive message, and that the log_file parameter is accepted. * Split Dataset-level tests into test_dataset.py and added version/schema_version tests * Add tests for Dataset.sample_count() Test that sample_count() returns the correct value for v3 and v4 datasets and is consistent with len(samples()), and that calling it on a write-mode dataset raises. * Add tests for Dataset.tiledb_stats() Test that tiledb_stats() returns a non-empty valid JSON string in both read and write mode. Add a skipped test documenting a bug where the stats-not-enabled guard never fires because get_tiledb_stats_enabled is referenced without () making it always truthy. * Add tests for Dataset.export() Test all previously uncovered export() parameters including samples, regions, output_format, merge, output_path, samples_file, bed_file, and skip_check_samples. It also checks that exporting in write-mode raises and tests the merge=True without output_path error path. * Add tests for deprecated TileDBVCFDataset class Test that constructing TileDBVCFDataset emits a DeprecationWarning with the expected message, and that the instance remains fully functional as a Dataset after construction. * Fix DeprecationWarnings from deprecated region parameter in tests Replaced incidental uses of the deprecated positional "region" parameter with "regions=[...]" and wrapped intentional uses in pytest.warns(). * Test that delete_samples() raises when the dataset is open in read mode * Test that total_memory_percentage is accepted by ingest_samples() * Test read_iter() with samples_file and bed_file parameters * Test sample_qc() samples and config parameters * Test error paths for read_allele_frequency() Specifically, invalid region format and empty region are now tested. * Updated tests to ensure read/read_arrow/read_iter parameter parity Specifically, the samples, samples_file, bed_file, set_af_filter, and scan_all_samples parameters are now all exercised across all three methods. * Test attributes() invalid attr_type raises * Test that set_af_filter and scan_all_samples parameters can be combined in read() and read_arrow() * Add missing Python test docstrings and standardize their style across all tests * Removed extraneous comments from Python tests * Removed the TileDB TBB threads test The test was always marked to always be skipped and TBB threads has been long deprecated in TileDB. * VCF-103 Shared Pointer Pool (#878) * Encapsulated the shared_ptr pool code in a SharedPtrPool class * Added a sharing mode to the VCFV4 class This determines how the class reuses records, i.e. either manually or automatically. In manual mode, records must be explicitly returned to the class for reuse, otherwise the record's shared_ptr will destroy it when the use counter reaches 0. In automatic mode, the VCFV4 class keeps a copy of the record's shared_ptr and automatically reuses it when the use counter reaches 1. * Updated the VCFV4 class to extend the SharedPtrPool class * Improve export run-time performance by caching sample headers (#881) * Updated TileDBVCFDataset::SampleHeaders to support caching parsed, sample-specific headers This functionality is implemented via TileDBVCFDataset::SampleHeaders::get_sample_header_shared(), which returns a shared_ptr with the option for SampleHeaders to cache a copy of the shared_ptr. * Updated Reader::report_cell() to cache headers when exporting records This significantly reduces run-time by reducing time spent parsing header strings and (de)allocating header objects. * Allow users to skip updating aggregate stats when deleting samples (#880) * Add skip_delete_sample() static method to AlleleCount and VariantStats classes This can be used to note that the allele count and variant stats arrays weren't updated when a sample was deleted from the dataset. It notes this by adding skipped samples to the stats arrays' metadata as unique keys - one per sample. the sample to the arrays' "skipped_delete_samples" metadata field, which can be used to update the stats at a later time as long as the dataset hasn't been consolidated and vacuumed. * Added a skip_aggregate_stats parameter to TileDBVCFDataset::delete_samples() When this boolean is true, the skip_delete_sample() static method of the AlleleCount and VariantStats classes is called for each sample being deleted, instead of updating these stats arrays using the DeleteExporter. * Updated the Writer class to support a skip_aggregate_stats ingestion parameter This parameter is passed to TileDBVCFDataset::delete_samples() when Writer::delete_samples() is called. A unit test was added to test the skip aggregate stats functionality in the TileDBVCFDataset class via the Writer class's skip_aggregate_stats ingestion parameter. * Added a --skip-aggregate-stats flag to the delete CLI command This allows users to use the skip aggregate stats functionality now available via the TileDBVCFDataset and Writer classes. CLI tests were added to cover this new flag. * Added a tiledb_vcf_writer_set_skip_aggregate_stats hook to the C API This allows the Writer's skip_aggregate_stats ingestion parameter to be set via the C API. * Added a skip_aggregate_stats parameter to Dataset.delete_samples() in the Python API This uses the tiledb_vcf_writer_set_skip_aggregate_stats hook in the C API to skip updating allele count and variant stats when deleting samples. Python tests were added to cover this new parameter. * Support pybind11 3.0: replace deprecated py::module with py::module_ (#883) * Pin pybind11 < 3.0 when building TileDB-Py 0.36.0 in nightly CI (#885) TileDB-Py 0.36.0 (used with the release-2.30 configuration) is incompatible with pybind11 3.0+, which introduced stricter lambda return type deduction that breaks domain.cc. The main configuration is unaffected since TileDB-Py main already has pybind11 3.0 support. * Update TileDB and TileDB-Py Versions (#886) * Bumped TileDB version to 2.30.1 * Bumped TileDB-Py version to 0.36.1 --------- Co-authored-by: Julia Dark <24235303+jp-dark@users.noreply.github.com> Co-authored-by: Agis Kounelis <36283973+kounelisagis@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
VCF-103 implements a new parallelization strategy for the ingestion code path. One of the optimizations implemented is a shared pointer pool that allows
shared_ptr<T>instances to be kept in a pool for reuse with the API supporting both automatic and manual management of the pointers. This PR cherry picks those changes and the VCFV4 class's implementation and use of the pool to reduce the scope of the VCF-103 PR.