Skip to content

Commit 27ae521

Browse files
Raw counts support, backed-mode differential expression, and auto-detection of gene symbols (#67)
* Update version to 0.18.0 and enhance raw counts handling in save_features_matrix - Bump package version to 0.18.0. - Introduce _resolve_raw_counts method in CyteType to improve raw counts extraction from AnnData. - Add _is_integer_valued utility function to check if matrices contain integer values. - Update save_features_matrix to handle raw counts and include them in the output HDF5 file. - Enhance tests to cover new raw counts functionality and integer value checks. * Add artifact paths for vars.h5 and obs.duckdb, enhance artifact building and uploading - Introduced vars_h5_path and obs_duckdb_path parameters in CyteType for customizable artifact paths. - Implemented caching of raw counts and improved error handling during artifact creation. - Updated _upload_artifacts method to handle pre-built artifacts and log errors appropriately. - Modified integration tests to accommodate new parameters and ensure proper artifact cleanup. * Refactor artifact cleanup in CyteType and update tests - Replaced the static method _cleanup_artifact_files with an instance method cleanup to manage artifact file deletion after run completion. - Removed the cleanup_artifacts parameter from run method, simplifying the interface. - Updated integration tests to verify that cleanup correctly deletes artifact files and clears associated paths. * Add rank_genes_groups_backed function and update exports - Introduced rank_genes_groups_backed in marker_detection.py for memory-efficient gene ranking on backed AnnData objects. - Updated __init__.py files to include rank_genes_groups_backed in the public API of cytetype and preprocessing modules. - Refactored code for improved readability in main.py, enhancing the formatting of artifact cleanup logic. * Enhance gene symbol handling in CyteType - Introduced resolve_gene_symbols_column function to auto-detect gene symbols in AnnData, improving flexibility in gene symbol management. - Updated gene_symbols_column type to accept None, allowing for better handling of cases where gene symbols are not explicitly provided. - Refactored aggregate_expression_percentages and extract_marker_genes functions to accommodate the new gene symbol resolution logic. - Enhanced validation in _validate_gene_symbols_column to provide clearer warnings about potential gene ID misclassifications. * Update batch size for expression percentage calculations and refactor aggregation logic - Increased the default batch size for calculating expression percentages from 2000 to 5000 to optimize memory usage. - Refactored the aggregate_expression_percentages function to utilize a single-pass row-batched accumulation method for improved performance. - Introduced a new _accumulate_group_stats function to streamline the computation of per-group statistics, enhancing efficiency for large datasets. - Updated related documentation to reflect changes in parameters and processing logic. * Refactor logging and enhance progress reporting in CyteType - Removed unnecessary logging statements for calculating expression percentages and extracting visualization coordinates to streamline output. - Updated logging message for saving obs.duckdb artifact for clarity. - Integrated progress reporting using tqdm for batch processing in save_features_matrix and extract_visualization_coordinates functions. - Improved handling of warnings during batch processing to suppress FutureWarnings from tqdm. - Adjusted progress descriptions for better user feedback during long-running operations. * Add WRITE_MEM_BUDGET constant and enhance logging in CyteType - Introduced WRITE_MEM_BUDGET constant in config.py to define memory budget for writing artifacts. - Updated logging messages in main.py for clarity during artifact saving processes. - Enhanced progress reporting in artifact writing functions to improve user feedback. - Refactored warning handling to suppress FutureWarnings from tqdm during batch processing. - Added new functions in artifacts.py for improved handling of sparse matrix writing and progress tracking. * Enhance file upload functionality and error handling in CyteType - Increased maximum upload size for vars_h5 from 10GB to 50GB to accommodate larger datasets. - Introduced a new ClientDisconnectedError exception to handle client disconnection scenarios. - Improved progress reporting during file uploads by integrating tqdm for better user feedback. - Refactored upload logic to ensure consistent progress updates and error handling across different upload scenarios. * Add subsampling functionality to preprocessing module - Introduced a new `subsample_by_group` function in `subsampling.py` to limit the number of cells per group in an AnnData object. - Updated `__init__.py` to include `subsample_by_group` in the public API of the preprocessing module. - Enhanced error handling to check for the existence of the specified group key in the AnnData object. - Added logging to report the results of the subsampling process. * Refactor subsampling functionality and improve logging in preprocessing module - Enhanced the `subsample_by_group` function to optimize performance and memory usage during subsampling. - Improved logging to provide clearer insights into the subsampling process and results. - Updated error handling to ensure robustness when dealing with edge cases in AnnData objects. - Refactored related tests to validate the new subsampling logic and logging enhancements. * formatted * Update subsampling logic to merge subsets by taking the first occurrence in the preprocessing module - Modified the `subsample_by_group` function to use `merge="first"` when concatenating subsampled subsets, ensuring that the first occurrence of each observation is retained. - This change enhances the subsampling process by providing a more consistent output when merging groups. * Enhance gene name processing in preprocessing module - Added `clean_gene_names` function to extract gene symbols from composite gene names, improving the handling of gene identifiers. - Updated `extract_marker_genes` to utilize `clean_gene_names` for better gene name management. - Integrated `clean_gene_names` into the `CyteType` class for consistent gene name processing across the module. - Enhanced logging to provide insights when composite gene values are cleaned. * Optimize group statistics accumulation for sparse matrices in marker detection - Enhanced the `_accumulate_group_stats` function to handle both sparse and dense matrix inputs efficiently. - Implemented conditional logic to process sparse matrices using CSR format, improving memory usage and performance. - Maintained existing functionality for dense matrices, ensuring compatibility with previous implementations. * Increase default timeout for file uploads in CyteType - Updated the timeout settings in both `main.py` and `client.py` from 30 seconds to 60 seconds to allow for longer upload durations, improving reliability for larger files. * fomatted * Refactor subsampling logic in `_is_integer_valued` function to improve row selection - Updated the logic to select rows for sampling based on the number of rows in the input matrix. - Implemented random sampling when the number of rows exceeds the specified sample size, ensuring a more representative subset. - Maintained functionality for cases where the number of rows is less than or equal to the sample size. * Update public API in `__init__.py` to include new plotting and subsampling functions - Added `marker_dotplot` and `subsample_by_group` to the `__all__` list, making them accessible for import. - This change enhances the module's functionality by exposing additional features for users.
1 parent 9a8fcf4 commit 27ae521

16 files changed

Lines changed: 1530 additions & 319 deletions

cytetype/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,14 @@
1-
__version__ = "0.17.0"
1+
__version__ = "0.18.0"
22

33
import requests
44

55
from .config import logger
66
from .main import CyteType
7+
from .plotting import marker_dotplot
8+
from .preprocessing.marker_detection import rank_genes_groups_backed
9+
from .preprocessing.subsampling import subsample_by_group
710

8-
__all__ = ["CyteType"]
11+
__all__ = ["CyteType", "marker_dotplot", "rank_genes_groups_backed", "subsample_by_group"]
912

1013
_PYPI_JSON_URL = "https://pypi.org/pypi/cytetype/json"
1114

cytetype/api/client.py

Lines changed: 48 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,32 @@
1414

1515
MAX_UPLOAD_BYTES: dict[UploadFileKind, int] = {
1616
"obs_duckdb": 100 * 1024 * 1024, # 100MB
17-
"vars_h5": 10 * 1024 * 1024 * 1024, # 10GB
17+
"vars_h5": 50 * 1024 * 1024 * 1024, # 10GB
1818
}
1919

2020
_CHUNK_RETRY_DELAYS = (1, 5, 20)
2121
_RETRYABLE_API_ERROR_CODES = frozenset({"INTERNAL_ERROR", "HTTP_ERROR"})
2222

2323

24+
def _try_import_tqdm() -> type | None:
25+
try:
26+
import warnings
27+
28+
with warnings.catch_warnings():
29+
warnings.simplefilter("ignore")
30+
from tqdm.auto import tqdm
31+
32+
return tqdm # type: ignore[no-any-return]
33+
except ImportError:
34+
return None
35+
36+
2437
def _upload_file(
2538
base_url: str,
2639
auth_token: str | None,
2740
file_kind: UploadFileKind,
2841
file_path: str,
29-
timeout: float | tuple[float, float] = (30.0, 3600.0),
42+
timeout: float | tuple[float, float] = (60.0, 3600.0),
3043
max_workers: int = 4,
3144
) -> UploadResponse:
3245
path_obj = Path(file_path)
@@ -62,6 +75,12 @@ def _upload_file(
6275
# Memory is bounded to ~max_workers × chunk_size because each thread
6376
# reads its chunk on demand via seek+read.
6477
_tls = threading.local()
78+
tqdm_cls = _try_import_tqdm()
79+
pbar = (
80+
tqdm_cls(total=n_chunks, desc="Uploading", unit="chunk")
81+
if tqdm_cls is not None and n_chunks > 0
82+
else None
83+
)
6584
_progress_lock = threading.Lock()
6685
_chunks_done = [0]
6786

@@ -82,15 +101,18 @@ def _upload_chunk(chunk_idx: int) -> None:
82101
data=chunk_data,
83102
timeout=timeout,
84103
)
85-
with _progress_lock:
86-
_chunks_done[0] += 1
87-
done = _chunks_done[0]
88-
pct = 100 * done / n_chunks
89-
print(
90-
f"\r Uploading: {done}/{n_chunks} chunks ({pct:.0f}%)",
91-
end="",
92-
flush=True,
93-
)
104+
if pbar is not None:
105+
pbar.update(1)
106+
else:
107+
with _progress_lock:
108+
_chunks_done[0] += 1
109+
done = _chunks_done[0]
110+
pct = 100 * done / n_chunks
111+
print(
112+
f"\r Uploading: {done}/{n_chunks} chunks ({pct:.0f}%)",
113+
end="",
114+
flush=True,
115+
)
94116
return
95117
except (NetworkError, TimeoutError) as exc:
96118
last_exc = exc
@@ -103,13 +125,9 @@ def _upload_chunk(chunk_idx: int) -> None:
103125
if attempt < len(_CHUNK_RETRY_DELAYS):
104126
delay = _CHUNK_RETRY_DELAYS[attempt]
105127
logger.warning(
106-
"Chunk %d/%d upload failed (attempt %d/%d), retrying in %ds: %s",
107-
chunk_idx + 1,
108-
n_chunks,
109-
attempt + 1,
110-
1 + len(_CHUNK_RETRY_DELAYS),
111-
delay,
112-
last_exc,
128+
f"Chunk {chunk_idx + 1}/{n_chunks} upload failed "
129+
f"(attempt {attempt + 1}/{1 + len(_CHUNK_RETRY_DELAYS)}), "
130+
f"retrying in {delay}s: {last_exc}"
113131
)
114132
time.sleep(delay)
115133

@@ -120,9 +138,17 @@ def _upload_chunk(chunk_idx: int) -> None:
120138
try:
121139
with ThreadPoolExecutor(max_workers=effective_workers) as pool:
122140
list(pool.map(_upload_chunk, range(n_chunks)))
123-
print(f"\r \033[92m✓\033[0m Uploaded {n_chunks}/{n_chunks} chunks (100%)")
141+
if pbar is not None:
142+
pbar.close()
143+
else:
144+
print(
145+
f"\r \033[92m✓\033[0m Uploaded {n_chunks}/{n_chunks} chunks (100%)"
146+
)
124147
except BaseException:
125-
print() # ensure newline on failure
148+
if pbar is not None:
149+
pbar.close()
150+
else:
151+
print()
126152
raise
127153

128154
# Step 3 – Complete upload (returns same UploadResponse shape as before)
@@ -136,7 +162,7 @@ def upload_obs_duckdb(
136162
base_url: str,
137163
auth_token: str | None,
138164
file_path: str,
139-
timeout: float | tuple[float, float] = (30.0, 3600.0),
165+
timeout: float | tuple[float, float] = (60.0, 3600.0),
140166
max_workers: int = 4,
141167
) -> UploadResponse:
142168
return _upload_file(
@@ -153,7 +179,7 @@ def upload_vars_h5(
153179
base_url: str,
154180
auth_token: str | None,
155181
file_path: str,
156-
timeout: float | tuple[float, float] = (30.0, 3600.0),
182+
timeout: float | tuple[float, float] = (60.0, 3600.0),
157183
max_workers: int = 4,
158184
) -> UploadResponse:
159185
return _upload_file(

cytetype/api/exceptions.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,12 @@ class LLMValidationError(APIError):
5656
pass
5757

5858

59+
class ClientDisconnectedError(APIError):
60+
"""Server detected client disconnection mid-request - CLIENT_DISCONNECTED (HTTP 499)."""
61+
62+
pass
63+
64+
5965
# Client-side errors with default messages
6066
class TimeoutError(CyteTypeError):
6167
"""Client-side timeout waiting for results."""
@@ -87,6 +93,7 @@ def __init__(
8793
"JOB_NOT_FOUND": JobNotFoundError,
8894
"JOB_FAILED": JobFailedError,
8995
"LLM_VALIDATION_FAILED": LLMValidationError,
96+
"CLIENT_DISCONNECTED": ClientDisconnectedError,
9097
"JOB_PROCESSING": APIError, # Generic - expected during polling
9198
"JOB_NOT_COMPLETED": APIError, # Generic
9299
"HTTP_ERROR": APIError, # Generic

cytetype/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,5 @@ def _log_format(record: Record) -> str:
2424
level="INFO",
2525
format=_log_format,
2626
)
27+
28+
WRITE_MEM_BUDGET: int = 4 * 1024 * 1024 * 1024 # 4 GB

0 commit comments

Comments
 (0)