From 3b509852b8279b378fc6d362b9f3484a5f974376 Mon Sep 17 00:00:00 2001 From: Brae Bigge Date: Fri, 22 May 2026 13:58:26 -0700 Subject: [PATCH 1/2] fixed warnings from reproducibility audit --- README.md | 4 +++- setup.py | 3 +-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index aea3028..54ef473 100644 --- a/README.md +++ b/README.md @@ -74,6 +74,8 @@ The pipeline supports two modes: **Search** and **Cluster**. Both modes are impl In this mode, the pipeline starts with a set of input proteins of interest in PDB and FASTA format and performs broad BLAST and Foldseek searches to identify hits. The pipeline aggregates all hits, downloads PDBs, and builds a map. +Note: Search mode depends on live external APIs (Foldseek, BLAST, UniProt, and AlphaFold database). These services may change, experience downtime, or enforce rate limits, so results may not be fully reproducible across runs separated by significant time. + ![search-mode-rulegraph](rulegraph-search-mode.png) #### Inputs @@ -135,7 +137,7 @@ In this mode, the pipeline starts with a folder containing PDBs of interest and - `output`: directory where all pipeline outputs are placed. - `analysis_name`: nickname for the analysis, appended to important output files. - `features_file`: path to features file (described below). - - (Optional) `keyids`: a list of one or more key `protid` corresponding to the proteins to highlight in the output plots (similar to how the input proteins are highlighted in 'search' mode). Note: if not provided, the output directory `key_protid_tmscore_results` will be empty, as will the `protein_features/key_protid_tmscore_features.tsv` file. + - (Optional) `keyids`: a list of one or more key `protid` corresponding to the proteins to highlight in the output plots (similar to how the input proteins are highlighted in 'search' mode). Note: if not provided, the output directory `key_protid_tmscore_results` will be empty, as will the `protein_features/key_protid_tmscore_features.tsv` file. If `keyids` is provided, the pipeline will make Foldseek API calls for each key protein, so an internet connection is still required even in cluster mode. - See [`config.yml`](config.yml) for additional parameters. - Features file with protein metadata. - Usually, we call this file `uniprot_features.tsv` but you can use any name. diff --git a/setup.py b/setup.py index 539f9ac..24559a6 100644 --- a/setup.py +++ b/setup.py @@ -2,7 +2,7 @@ setup( name="ProteinCartography", - url="https://github.com/Arcadia-Science/ProteinCartography-private", + url="https://github.com/Arcadia-Science/ProteinCartography", author="Dennis Sun", author_email="dennis.sun@arcadiascience.com", packages=["ProteinCartography"], @@ -16,7 +16,6 @@ "ProteinCartography/esmfold_apiquery.py", "ProteinCartography/extract_blast_hits.py", "ProteinCartography/extract_foldseek_hits.py", - "ProteinCartography/extract_input_protein_distances.py", "ProteinCartography/fetch_accession.py", "ProteinCartography/foldseek_apiquery.py", "ProteinCartography/foldseek_clustering.py", From 5a6b330d0964331feeb35fe677dcb11c199a6adb Mon Sep 17 00:00:00 2001 From: Brae Bigge Date: Wed, 27 May 2026 10:23:34 -0700 Subject: [PATCH 2/2] pin python version in analysis.yml to fix env rebuild issue --- envs/analysis.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/envs/analysis.yml b/envs/analysis.yml index 7d76efd..7eb5a8d 100644 --- a/envs/analysis.yml +++ b/envs/analysis.yml @@ -3,6 +3,7 @@ channels: - bioconda - defaults dependencies: + - python=3.9 - leidenalg=0.9.1 - scanpy=1.9.3 - scikit-learn=1.2.2