Skip to content

VCF-103 Shared Pointer Pool#878

Merged
alancleary merged 4 commits intomainfrom
alancleray/VCF-103-shared-ptr-pool
Mar 31, 2026
Merged

VCF-103 Shared Pointer Pool#878
alancleary merged 4 commits intomainfrom
alancleray/VCF-103-shared-ptr-pool

Conversation

@alancleary
Copy link
Copy Markdown
Member

VCF-103 implements a new parallelization strategy for the ingestion code path. One of the optimizations implemented is a shared pointer pool that allows shared_ptr<T> instances to be kept in a pool for reuse with the API supporting both automatic and manual management of the pointers. This PR cherry picks those changes and the VCFV4 class's implementation and use of the pool to reduce the scope of the VCF-103 PR.

This determines how the class reuses records, i.e. either manually or automatically. In manual mode, records must be explicitly returned to the class for reuse, otherwise the record's shared_ptr will destroy it when the use counter reaches 0. In automatic mode, the VCFV4 class keeps a copy of the record's shared_ptr and automatically reuses it when the use counter reaches 1.
@alancleary alancleary merged commit 9783d70 into main Mar 31, 2026
11 checks passed
@alancleary alancleary deleted the alancleray/VCF-103-shared-ptr-pool branch March 31, 2026 14:46
alancleary added a commit that referenced this pull request Apr 13, 2026
* Rename tiledb.cc -> tiledb.libtiledb (#875)

* Update github actions (#876)

* Improvements to stats run-time performance (#877)

* Added total_size() methods to allele count and variant stats classes

These methods return the total size of the classes' buffers in bytes.

* Improved AlleleCount process and flush run-time

This was done by using an unordered set for tracking sample names.

* Improved SampleStats process and flush run-time

This was done by using an unordered map for saving sample stats, minimizing map lookups via memoization, and replacing a nested map with structs.

* Improved VariantStats process and flush run-time

This was done by reusing vectors for (missing)GT values, using an unordered set for tracking sample names, minimizing map lookups via memoization, and creating separate codepaths for updating v2 and v3 arrays, the prior of which can be done much more efficiently via appending instead of using the v3 insertion strategy.

* Improvements to Python Tests (#879)

* Added conftest.py

Shared fixtures and helpers have been moved out of the monolithic test file into conftest.py so pytest can discover them across future test modules

* Broke up test_ingest_with_stats_v3 into focused tests

The ~400-line monolithic test covered 8 distinct behaviours across two APIs (read_variant_stats and read_allele_count). The test is now split into 15 tests and the repeated skipif condition was encapsulated into a shared module-level _skip_if_no_bcftools marker. Lastly, the equivalent read_variant_stats, read_variant_stats_arrow, and read_allele_count error checks were consolidated into loops for brevity.

* Replace manual use_arrow loops with pytest.mark.parametrize

Replacing the loops gives Pandas and Arrow their own named test paths, meaning failures are now reported distinctly.

* Split monolithic test file into focused modules

Additionally, _skip_if_no_bcftools was renamed to skip_if_no_bcftools and moved to conftest.py alongside the other shared helpers.

* Add tests to cover gaps in testing of create_dataset() parameters

The previously untested parameters include extra_attrs, tile_capacity, anchor_gap, allow_duplicates=False, checksum_type="md5", and variant_stats_version=2. Also, error-path tests were added for invalid checksum_type, supplying both extra_attrs and vcf_attrs, and calling create_dataset() on an already-existing dataset.

* Add tests to cover gaps in testing of ingest_samples() parameters

The previously untested parameters and execution paths include sample_uris=None early return, scratch_space_path/scratch_space_size must be provided together, invalid contig_mode, contigs_to_keep_separate/contigs_to_allow_merging must be lists, resume=True, sample_batch_size, and memory/thread tuning parameters.

* Add tests to cover gaps in testing of ReadConfig parameters

The previously untested parameters include sort_regions, buffer_percentage, tiledb_tile_cache_percentage, and tiledb_config supplied as both a dict and a tiledb.Config object.

* Add tests to cover gaps in testing of conttinue_read() and continue_read_arrow() parameters

Specifically, the release_buffers=False case is now tested for both methods.

* Add tests for delete_samples() edge cases and consolidate delete tests

The previously untested delete_samples() paths include empty list is a noop, None raises TypeError, and a non-existent sample name raises RuntimeError.

* Add tests to cover gaps in testing of the static Dataset.delete() method

The previously untested Dataset.delete() parameters and error paths include testing that the config parameter is accepted and does not prevent deletion of a local dataset and that deleting a non-existent URI raises TileDBError.

* Add test for Dataset constructor invalid mode

Tests that constructing a Dataset with an unrecognised mode string raises an exception with a descriptive message.

* Add tests for config_logging()

Test all six valid log levels via parametrization, that an invalid level raises with a descriptive message, and that the log_file parameter is accepted.

* Split Dataset-level tests into test_dataset.py and added version/schema_version tests

* Add tests for Dataset.sample_count()

Test that sample_count() returns the correct value for v3 and v4 datasets and is consistent with len(samples()), and that calling it on a write-mode dataset raises.

* Add tests for Dataset.tiledb_stats()

Test that tiledb_stats() returns a non-empty valid JSON string in both read and write mode. Add a skipped test documenting a bug where the stats-not-enabled guard never fires because get_tiledb_stats_enabled is referenced without () making it always truthy.

* Add tests for Dataset.export()

Test all previously uncovered export() parameters including samples, regions, output_format, merge, output_path, samples_file, bed_file, and skip_check_samples. It also checks that exporting in write-mode raises and tests the merge=True without output_path error path.

* Add tests for deprecated TileDBVCFDataset class

Test that constructing TileDBVCFDataset emits a DeprecationWarning with the expected message, and that the instance remains fully functional as a Dataset after construction.

* Fix DeprecationWarnings from deprecated region parameter in tests

Replaced incidental uses of the deprecated positional "region" parameter with "regions=[...]" and wrapped intentional uses in pytest.warns().

* Test that delete_samples() raises when the dataset is open in read mode

* Test that total_memory_percentage is accepted by ingest_samples()

* Test read_iter() with samples_file and bed_file parameters

* Test sample_qc() samples and config parameters

* Test error paths for read_allele_frequency()

Specifically, invalid region format and empty region are now tested.

* Updated tests to ensure read/read_arrow/read_iter parameter parity

Specifically, the samples, samples_file, bed_file, set_af_filter, and scan_all_samples parameters are now all exercised across all three methods.

* Test attributes() invalid attr_type raises

* Test that set_af_filter and scan_all_samples parameters can be combined in read() and read_arrow()

* Add missing Python test docstrings and standardize their style across all tests

* Removed extraneous comments from Python tests

* Removed the TileDB TBB threads test

The test was always marked to always be skipped and TBB threads has been long deprecated in TileDB.

* VCF-103 Shared Pointer Pool (#878)

* Encapsulated the shared_ptr pool code in a SharedPtrPool class

* Added a sharing mode to the VCFV4 class

This determines how the class reuses records, i.e. either manually or automatically. In manual mode, records must be explicitly returned to the class for reuse, otherwise the record's shared_ptr will destroy it when the use counter reaches 0. In automatic mode, the VCFV4 class keeps a copy of the record's shared_ptr and automatically reuses it when the use counter reaches 1.

* Updated the VCFV4 class to extend the SharedPtrPool class

* Improve export run-time performance by caching sample headers (#881)

* Updated TileDBVCFDataset::SampleHeaders to support caching parsed, sample-specific headers

This functionality is implemented via TileDBVCFDataset::SampleHeaders::get_sample_header_shared(), which returns a shared_ptr with the option for SampleHeaders to cache a copy of the shared_ptr.

* Updated Reader::report_cell() to cache headers when exporting records

This significantly reduces run-time by reducing time spent parsing header strings and (de)allocating header objects.

* Allow users to skip updating aggregate stats when deleting samples (#880)

* Add skip_delete_sample() static method to AlleleCount and VariantStats classes

This can be used to note that the allele count and variant stats arrays weren't updated when a sample was deleted from the dataset. It notes this by adding skipped samples to the stats arrays' metadata as unique keys - one per sample.

 the sample to the arrays' "skipped_delete_samples" metadata field, which can be used to update the stats at a later time as long as the dataset hasn't been consolidated and vacuumed.

* Added a skip_aggregate_stats parameter to TileDBVCFDataset::delete_samples()

When this boolean is true, the skip_delete_sample() static method of the AlleleCount and VariantStats classes is called for each sample being deleted, instead of updating these stats arrays using the DeleteExporter.

* Updated the Writer class to support a skip_aggregate_stats ingestion parameter

This parameter is passed to TileDBVCFDataset::delete_samples() when Writer::delete_samples() is called. A unit test was added to test the skip aggregate stats functionality in the TileDBVCFDataset class via the Writer class's skip_aggregate_stats ingestion parameter.

* Added a --skip-aggregate-stats flag to the delete CLI command

This allows users to use the skip aggregate stats functionality now available via the TileDBVCFDataset and Writer classes. CLI tests were added to cover this new flag.

* Added a tiledb_vcf_writer_set_skip_aggregate_stats hook to the C API

This allows the Writer's skip_aggregate_stats ingestion parameter to be set via the C API.

* Added a skip_aggregate_stats parameter to Dataset.delete_samples() in the Python API

This uses the tiledb_vcf_writer_set_skip_aggregate_stats hook in the C API to skip updating allele count and variant stats when deleting samples. Python tests were added to cover this new parameter.

* Support pybind11 3.0: replace deprecated py::module with py::module_ (#883)

* Pin pybind11 < 3.0 when building TileDB-Py 0.36.0 in nightly CI (#885)

TileDB-Py 0.36.0 (used with the release-2.30 configuration) is incompatible with pybind11 3.0+, which introduced stricter lambda return type deduction that breaks domain.cc. The main configuration is unaffected since TileDB-Py main already has pybind11 3.0 support.

* Update TileDB and TileDB-Py Versions (#886)

* Bumped TileDB version to 2.30.1

* Bumped TileDB-Py version to 0.36.1

---------

Co-authored-by: Julia Dark <24235303+jp-dark@users.noreply.github.com>
Co-authored-by: Agis Kounelis <36283973+kounelisagis@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants