Commit 27ae521
authored
Raw counts support, backed-mode differential expression, and auto-detection of gene symbols (#67)
* Update version to 0.18.0 and enhance raw counts handling in save_features_matrix
- Bump package version to 0.18.0.
- Introduce _resolve_raw_counts method in CyteType to improve raw counts extraction from AnnData.
- Add _is_integer_valued utility function to check if matrices contain integer values.
- Update save_features_matrix to handle raw counts and include them in the output HDF5 file.
- Enhance tests to cover new raw counts functionality and integer value checks.
* Add artifact paths for vars.h5 and obs.duckdb, enhance artifact building and uploading
- Introduced vars_h5_path and obs_duckdb_path parameters in CyteType for customizable artifact paths.
- Implemented caching of raw counts and improved error handling during artifact creation.
- Updated _upload_artifacts method to handle pre-built artifacts and log errors appropriately.
- Modified integration tests to accommodate new parameters and ensure proper artifact cleanup.
* Refactor artifact cleanup in CyteType and update tests
- Replaced the static method _cleanup_artifact_files with an instance method cleanup to manage artifact file deletion after run completion.
- Removed the cleanup_artifacts parameter from run method, simplifying the interface.
- Updated integration tests to verify that cleanup correctly deletes artifact files and clears associated paths.
* Add rank_genes_groups_backed function and update exports
- Introduced rank_genes_groups_backed in marker_detection.py for memory-efficient gene ranking on backed AnnData objects.
- Updated __init__.py files to include rank_genes_groups_backed in the public API of cytetype and preprocessing modules.
- Refactored code for improved readability in main.py, enhancing the formatting of artifact cleanup logic.
* Enhance gene symbol handling in CyteType
- Introduced resolve_gene_symbols_column function to auto-detect gene symbols in AnnData, improving flexibility in gene symbol management.
- Updated gene_symbols_column type to accept None, allowing for better handling of cases where gene symbols are not explicitly provided.
- Refactored aggregate_expression_percentages and extract_marker_genes functions to accommodate the new gene symbol resolution logic.
- Enhanced validation in _validate_gene_symbols_column to provide clearer warnings about potential gene ID misclassifications.
* Update batch size for expression percentage calculations and refactor aggregation logic
- Increased the default batch size for calculating expression percentages from 2000 to 5000 to optimize memory usage.
- Refactored the aggregate_expression_percentages function to utilize a single-pass row-batched accumulation method for improved performance.
- Introduced a new _accumulate_group_stats function to streamline the computation of per-group statistics, enhancing efficiency for large datasets.
- Updated related documentation to reflect changes in parameters and processing logic.
* Refactor logging and enhance progress reporting in CyteType
- Removed unnecessary logging statements for calculating expression percentages and extracting visualization coordinates to streamline output.
- Updated logging message for saving obs.duckdb artifact for clarity.
- Integrated progress reporting using tqdm for batch processing in save_features_matrix and extract_visualization_coordinates functions.
- Improved handling of warnings during batch processing to suppress FutureWarnings from tqdm.
- Adjusted progress descriptions for better user feedback during long-running operations.
* Add WRITE_MEM_BUDGET constant and enhance logging in CyteType
- Introduced WRITE_MEM_BUDGET constant in config.py to define memory budget for writing artifacts.
- Updated logging messages in main.py for clarity during artifact saving processes.
- Enhanced progress reporting in artifact writing functions to improve user feedback.
- Refactored warning handling to suppress FutureWarnings from tqdm during batch processing.
- Added new functions in artifacts.py for improved handling of sparse matrix writing and progress tracking.
* Enhance file upload functionality and error handling in CyteType
- Increased maximum upload size for vars_h5 from 10GB to 50GB to accommodate larger datasets.
- Introduced a new ClientDisconnectedError exception to handle client disconnection scenarios.
- Improved progress reporting during file uploads by integrating tqdm for better user feedback.
- Refactored upload logic to ensure consistent progress updates and error handling across different upload scenarios.
* Add subsampling functionality to preprocessing module
- Introduced a new `subsample_by_group` function in `subsampling.py` to limit the number of cells per group in an AnnData object.
- Updated `__init__.py` to include `subsample_by_group` in the public API of the preprocessing module.
- Enhanced error handling to check for the existence of the specified group key in the AnnData object.
- Added logging to report the results of the subsampling process.
* Refactor subsampling functionality and improve logging in preprocessing module
- Enhanced the `subsample_by_group` function to optimize performance and memory usage during subsampling.
- Improved logging to provide clearer insights into the subsampling process and results.
- Updated error handling to ensure robustness when dealing with edge cases in AnnData objects.
- Refactored related tests to validate the new subsampling logic and logging enhancements.
* formatted
* Update subsampling logic to merge subsets by taking the first occurrence in the preprocessing module
- Modified the `subsample_by_group` function to use `merge="first"` when concatenating subsampled subsets, ensuring that the first occurrence of each observation is retained.
- This change enhances the subsampling process by providing a more consistent output when merging groups.
* Enhance gene name processing in preprocessing module
- Added `clean_gene_names` function to extract gene symbols from composite gene names, improving the handling of gene identifiers.
- Updated `extract_marker_genes` to utilize `clean_gene_names` for better gene name management.
- Integrated `clean_gene_names` into the `CyteType` class for consistent gene name processing across the module.
- Enhanced logging to provide insights when composite gene values are cleaned.
* Optimize group statistics accumulation for sparse matrices in marker detection
- Enhanced the `_accumulate_group_stats` function to handle both sparse and dense matrix inputs efficiently.
- Implemented conditional logic to process sparse matrices using CSR format, improving memory usage and performance.
- Maintained existing functionality for dense matrices, ensuring compatibility with previous implementations.
* Increase default timeout for file uploads in CyteType
- Updated the timeout settings in both `main.py` and `client.py` from 30 seconds to 60 seconds to allow for longer upload durations, improving reliability for larger files.
* fomatted
* Refactor subsampling logic in `_is_integer_valued` function to improve row selection
- Updated the logic to select rows for sampling based on the number of rows in the input matrix.
- Implemented random sampling when the number of rows exceeds the specified sample size, ensuring a more representative subset.
- Maintained functionality for cases where the number of rows is less than or equal to the sample size.
* Update public API in `__init__.py` to include new plotting and subsampling functions
- Added `marker_dotplot` and `subsample_by_group` to the `__all__` list, making them accessible for import.
- This change enhances the module's functionality by exposing additional features for users.1 parent 9a8fcf4 commit 27ae521
16 files changed
Lines changed: 1530 additions & 319 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
7 | 10 | | |
8 | | - | |
| 11 | + | |
9 | 12 | | |
10 | 13 | | |
11 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
24 | 37 | | |
25 | 38 | | |
26 | 39 | | |
27 | 40 | | |
28 | 41 | | |
29 | | - | |
| 42 | + | |
30 | 43 | | |
31 | 44 | | |
32 | 45 | | |
| |||
62 | 75 | | |
63 | 76 | | |
64 | 77 | | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
65 | 84 | | |
66 | 85 | | |
67 | 86 | | |
| |||
82 | 101 | | |
83 | 102 | | |
84 | 103 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
94 | 116 | | |
95 | 117 | | |
96 | 118 | | |
| |||
103 | 125 | | |
104 | 126 | | |
105 | 127 | | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
113 | 131 | | |
114 | 132 | | |
115 | 133 | | |
| |||
120 | 138 | | |
121 | 139 | | |
122 | 140 | | |
123 | | - | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
124 | 147 | | |
125 | | - | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
126 | 152 | | |
127 | 153 | | |
128 | 154 | | |
| |||
136 | 162 | | |
137 | 163 | | |
138 | 164 | | |
139 | | - | |
| 165 | + | |
140 | 166 | | |
141 | 167 | | |
142 | 168 | | |
| |||
153 | 179 | | |
154 | 180 | | |
155 | 181 | | |
156 | | - | |
| 182 | + | |
157 | 183 | | |
158 | 184 | | |
159 | 185 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
59 | 65 | | |
60 | 66 | | |
61 | 67 | | |
| |||
87 | 93 | | |
88 | 94 | | |
89 | 95 | | |
| 96 | + | |
90 | 97 | | |
91 | 98 | | |
92 | 99 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
0 commit comments