|
1 | | - # GEMCAT: Gene Expression-based Metabolite Centrality Analyses Tool |
2 | | -A computational toolbox associated with the manuscript entitled _GEMCAT — A new algorithm for gene expression-based prediction of metabolic alterations_. |
3 | | -Cite using: https://doi.org/10.1093/nargab/lqaf003 |
| 1 | +# GEMCAT: Gene Expression-based Metabolite Centrality Analyses Tool |
4 | 2 |
|
5 | | -Note: We are still refining the tool. Particularly, GEMCAT does not yet provide guidance for significance of predicted changes or any other measure of prediction quality. We suggest filtering the predictions for consistency. We do not recommend pre-filtering the transcriptomics and proteomics data based on significance as this is affecting the network coverage which might negatively impact the prediction quality as genes/proteins not present in the dataset should be unchanged. |
| 3 | +GEMCAT is a computational toolbox designed to predict metabolic alterations based on gene expression data. It's the |
| 4 | +accompanying software for our manuscript, "_GEMCAT — A new algorithm for gene expression-based prediction of metabolic alterations_." |
6 | 5 |
|
7 | | -## Compatibility |
8 | | -We tested the package for compatibility with Python >= 3.10 on Ubuntu and Windows. |
9 | | - |
10 | | -## Installation |
11 | | -Install from pip: |
12 | | - |
13 | | -```pip install gemcat``` |
14 | | - |
15 | | -Or clone the repository and install GEMCAT from there using: |
| 6 | +## Quick links: |
| 7 | +* **How to Cite:** [https://doi.org/10.1093/nargab/lqaf003](https://doi.org/10.1093/nargab/lqaf003) |
| 8 | +* **PyPI:** [https://pypi.org/project/gemcat/](https://pypi.org/project/gemcat/) |
| 9 | +* **Source Code (GitHub):** [https://github.com/MolecularBioinformatics/GEMCAT](https://github.com/MolecularBioinformatics/GEMCAT) |
16 | 10 |
|
17 | | -```pip install .``` |
| 11 | +## Important Considerations |
18 | 12 |
|
| 13 | +* **Prediction Quality:** GEMCAT is still under refinement. It doesn't yet provide guidance on the |
| 14 | + statistical significance of predicted changes or any other measure of prediction quality. We |
| 15 | + recommend **filtering predictions for consistency** based on your domain knowledge. |
| 16 | +* **Data Pre-filtering:** We **don't recommend pre-filtering transcriptomics and proteomics data based |
| 17 | + on significance**. This can negatively impact network coverage, as genes/proteins not present in the |
| 18 | + filtered dataset are implicitly considered "unchanged" by GEMCAT. |
| 19 | +* **Graphical User Interface (GUI):** We are actively developing a user-friendly GUI for GEMCAT, |
| 20 | + which will be released soon. Stay tuned for updates on our GitHub repository and PyPI page! A **development version** |
| 21 | + of the GUI is currently hosted in a private repository; if you're interested in gaining early access, |
| 22 | + please contact **suraj.sharma@uib.no**. |
19 | 23 |
|
20 | | -## Usage |
| 24 | +--- |
21 | 25 |
|
22 | | -### Standard workflow from the Command-Line Interface (CLI) |
| 26 | +## Compatibility |
| 27 | +GEMCAT has been tested and is compatible with **Python >= 3.10** on Ubuntu and Windows operating systems. |
23 | 28 |
|
24 | | -Use a single file containing per-gene fold-changes to calculate the resulting differential centralities: |
25 | | -``` gemcat <./expression_file.csv> <./model_file.xml> -e <column_name> -o <result_file.csv>``` |
26 | | -Make sure the .csv file is either comma- or tab-delimited. |
27 | | -`column_name` is the name of the column in the file containing the fold-change. |
| 29 | +## Installation |
| 30 | +You can install GEMCAT in two ways: |
28 | 31 |
|
29 | | -Alternatively, use two files (or one file) with expression values for condition and baseline: |
30 | | -``` gemcat <./condition_file.csv> <./model_file.xml> -e <condition_column_name> -b <./baseline_file> -c <baseline_column_name> -o <result_file.csv>``` |
| 32 | +1. **Using pip (recommended):** |
| 33 | + ```bash |
| 34 | + pip install gemcat |
| 35 | + ``` |
| 36 | +2. **From source (for developers or specific versions):** |
| 37 | + First, clone the repository, then install: |
| 38 | + ```bash |
| 39 | + git clone https://github.com/MolecularBioinformatics/GEMCAT.git |
| 40 | + cd gemcat |
| 41 | + pip install . |
| 42 | + ``` |
| 43 | +--- |
31 | 44 |
|
32 | | -If you do not have a model file ready, some models can be automatically accessed using their names: |
33 | | -``` gemcat ./expression_file.csv <model_name> -e column_name -o <result_file.csv>``` |
| 45 | +## How to Use GEMCAT |
34 | 46 |
|
35 | | -Model names currently supported are: |
36 | | -- ```recon3d```: [Recon3D](http://bigg.ucsd.edu/models/Recon3D) |
37 | | -- ```ratgem```: [Rat-GEM](https://github.com/SysBioChalmers/Rat-GEM) |
| 47 | +GEMCAT offers both a Python API for flexible, programmatic access and a command-line interface (CLI) for straightforward, scriptable use. |
38 | 48 |
|
| 49 | +### Python Workflow with CobraPy |
39 | 50 |
|
40 | | -Currently, GEMCAT supports models in SBML, JSON, and MAT formats. |
| 51 | +For more control and integration into existing Python projects, use the `workflow_standard` function: |
41 | 52 |
|
42 | | -Important points to remember: |
43 | | -Your gene or protein identifiers should be the first column of the expression file. |
44 | | -Make sure the gene or protein identifiers in your expression data file exactly match those in the model. |
45 | | -A results list of all 1.0 is a sure sign of no identifier matching. |
| 53 | +```python |
| 54 | +import gemcat as gc |
| 55 | +import cobra # Assuming cobrapy is installed for model handling |
| 56 | +import pandas as pd # For pd.Series |
46 | 57 |
|
47 | | -Positional arguments: |
48 | | -- expression file path |
49 | | -- model file path |
| 58 | +# Example usage (replace with your actual data and model) |
| 59 | +# Make sure your mapped_genes_baseline and mapped_genes_comparison are pandas Series |
| 60 | +# with gene/protein identifiers as the index. |
50 | 61 |
|
51 | | -All parameters: |
52 | | -`-e --expressioncolumn` name of column containing condition expression data |
53 | | -`-b BASELINE, --baseline` file containing baseline expression data |
54 | | -`-c BASELINECOLUMN, --baselinecolumn` name of column containing baseline expression data |
55 | | -`-v VERBOSE, --verbose` enables verbose output |
56 | | -`-o OUTFILE, --outfile` write output to this file |
57 | | -`-l LOGFILE, --logfile` write logs to this file |
| 62 | +# Example: Load a CobraPy model |
| 63 | +# model = cobra.io.read_sbml_model("your_model.xml") |
58 | 64 |
|
| 65 | +# Example: Create dummy mapped gene series |
| 66 | +# mapped_genes_baseline = pd.Series([10, 20, 30], index=['geneA', 'geneB', 'geneC']) |
| 67 | +# mapped_genes_comparison = pd.Series([15, 25, 35], index=['geneA', 'geneB', 'geneC']) |
59 | 68 |
|
60 | | -### Standard workflow in Python using a CobraPy model |
61 | | -``` |
62 | | -import gemcat as gc |
63 | 69 | results = gc.workflows.workflow_standard( |
64 | | - cobra_model: cobra.Model, |
65 | | - mapped_genes_baseline: pd.Series, |
66 | | - mapped_genes_comparison: pd.Series, |
67 | | - adjacency = gc.adjacency_transformation.ATPureAdjacency, |
68 | | - ranking = gc.ranking.PagerankNX, |
69 | | - gene_fill = 1.0 |
| 70 | + cobra_model=your_cobra_model, # Your loaded cobra.Model object |
| 71 | + mapped_genes_baseline=your_baseline_series, # pd.Series of baseline expression |
| 72 | + mapped_genes_comparison=your_comparison_series, # pd.Series of comparison expression |
| 73 | + adjacency=gc.adjacency_transformation.ATPureAdjacency, # Optional: Customize adjacency method |
| 74 | + ranking=gc.ranking.PagerankNX, # Optional: Customize ranking algorithm |
| 75 | + gene_fill=1.0 # Value to fill for genes not present in mapped_genes_comparison |
70 | 76 | ) |
71 | | -``` |
72 | | -This will return the changes in centrality relative to the baseline in a Pandas Series. |
73 | | -When using fold-changes as the mapped expression, use a vector of all ones as a comparison. |
74 | | - |
75 | | -## Modularity and Configuration |
76 | | -GEMCAT is modular, and its central components can easily be swapped out or appended by other components |
77 | | -adhering to the specifications laid out in the module base classes (primarily adjacency transformation, expression integration, and ranking components). |
78 | | -All classes inheriting from the abstract base classes laid out in the modules are exchangeable. |
79 | | - |
80 | | -## Core modules |
81 | | -### Model |
82 | | -The core of the package is the GEMCAT model structure that contains the model data, integrates the workflow, and calculates the results. |
83 | | -### adjacency_transformation |
84 | | -Different approaches can be used to calculate adjacency in the networks. |
85 | | -We offer alternatives and a platform to create custom algorithms for the model. |
86 | | -### expression |
87 | | -Module covering the mapping of gene values onto reactions in the model via gene product rules. |
88 | | -Providing different algorithms along with a platform to create alternatives. |
89 | | -### ranking |
90 | | -Module providing ranking algorithms for the models along with a platform to include custom algorithms. |
91 | | -### workflows |
92 | | -The workflow module contains example workflows. |
93 | | -To customize the workflow to your needs simply copy the provided functions and switch out the desired steps. |
94 | | -### cli |
95 | | -Command-line interface for GEMCAT. |
96 | | -### io |
97 | | -Input and output functions that create GEMCAT models from different sources. |
98 | | -### utils |
99 | | -Contains common utility functions used throughout the package. |
100 | | -### verification |
101 | | -Functions to verify data integrity. |
102 | | -### model_manager |
103 | | -Functionality for automatic downloading, storing, and retrieving of common models. |
104 | 77 |
|
| 78 | +print(results) |
| 79 | +``` |
| 80 | +This function returns the changes in centrality relative to the baseline as a Pandas Series. If you're |
| 81 | +using fold-changes as your mapped_genes_comparison, you should provide a vector of all 1.0s for mapped_genes_baseline. |
| 82 | +
|
| 83 | +For further examples of using genome-scale metabolic networks from two different organisms refer: |
| 84 | +[An engineered human cell line with a functional deletion of the mitochondrial NAD transporter](https://github.com/MolecularBioinformatics/prm_manuscript/blob/main/jupyter_notebooks/pr_SLC25A51ko.ipynb), |
| 85 | +[Patients with inflammatory bowel disease](https://github.com/MolecularBioinformatics/prm_manuscript/blob/main/jupyter_notebooks/pr_UC.ipynb), |
| 86 | +[Training-induced metabolic changes in rats](https://github.com/MolecularBioinformatics/prm_manuscript/blob/main/jupyter_notebooks/pr_rats.ipynb), |
| 87 | +
|
| 88 | +### Command-Line Interface (CLI) |
| 89 | +
|
| 90 | +The CLI allows you to calculate differential centralities using gene expression data. |
| 91 | +
|
| 92 | +**Key Requirements for Input Files:** |
| 93 | +
|
| 94 | +* Your gene or protein identifiers **must be in the first column** of your expression file. |
| 95 | +* These identifiers **must exactly match** those in your metabolic model. If you see a results list of all 1.0, it's |
| 96 | + a strong indicator of an identifier mismatch. |
| 97 | +* Expression `.csv` files can be either comma- or tab-delimited. |
| 98 | + |
| 99 | +**Common Workflows:** |
| 100 | + |
| 101 | +1. **Using a single file with pre-calculated fold-changes:** |
| 102 | + ```bash |
| 103 | + gemcat <expression_file.csv> <model_file.xml> -e <column_name> -o <result_file.csv> |
| 104 | + ``` |
| 105 | + * `<expression_file.csv>`: Path to your input file. |
| 106 | + * `<model_file.xml>`: Path to your metabolic model file (SBML, JSON, or MAT format). |
| 107 | + * `<column_name>`: The name of the column in your CSV containing the fold-change values. |
| 108 | + * `<result_file.csv>`: The desired output file path. |
| 109 | + |
| 110 | +2. **Using two files (or one) with condition and baseline expression values:** |
| 111 | + ```bash |
| 112 | + gemcat <condition_file.csv> <model_file.xml> -e <condition_column_name> -b <baseline_file.csv> -c <baseline_column_name> -o <result_file.csv> |
| 113 | + ``` |
| 114 | + * `<condition_file.csv>`: Path to the file with expression values for your experimental condition. |
| 115 | + * `<baseline_file.csv>`: Path to the file with baseline expression values. If this is the same as the condition file, you can omit the `-b` flag and just use `<condition_file.csv>` as the second positional argument. |
| 116 | + * `<condition_column_name>`: Name of the column with condition expression data. |
| 117 | + * `<baseline_column_name>`: Name of the column with baseline expression data. |
| 118 | + |
| 119 | +3. **Using built-in models:** |
| 120 | + If you don't have a model file, GEMCAT can automatically access some common models by name: |
| 121 | + ```bash |
| 122 | + gemcat <expression_file.csv> <model_name> -e <column_name> -o <result_file.csv> |
| 123 | + ``` |
| 124 | + Currently supported model names: |
| 125 | + * `recon3d`: [Recon3D](http://bigg.ucsd.edu/models/Recon3D) |
| 126 | + * `ratgem`: [Rat-GEM](https://github.com/SysBioChalmers/Rat-GEM) |
| 127 | +
|
| 128 | +**All CLI Parameters:** |
| 129 | +
|
| 130 | +* **Positional Arguments:** |
| 131 | + * `expression_file_path`: Path to your expression data file. |
| 132 | + * `model_file_path`: Path to your metabolic model file (or model name). |
| 133 | +* **Optional Arguments:** |
| 134 | + * `-e --expressioncolumn`: Name of the column containing condition expression data (required for expression files). |
| 135 | + * `-b BASELINE, --baseline`: Path to the file containing baseline expression data. |
| 136 | + * `-c BASELINECOLUMN, --baselinecolumn`: Name of the column containing baseline expression data. |
| 137 | + * `-o OUTFILE, --outfile`: Path to write the output results. |
| 138 | + * `-v VERBOSE, --verbose`: Enables verbose output for detailed execution information. |
| 139 | + * `-l LOGFILE, --logfile`: Path to write logs. |
| 140 | + |
| 141 | +--- |
| 142 | +
|
| 143 | +## Modularity and Customization |
| 144 | +
|
| 145 | +GEMCAT is designed with a modular architecture, allowing you to easily swap out or append central components |
| 146 | +to customize its behavior. This is achieved by adhering to specifications laid out in the module base classes, particularly for: |
| 147 | +
|
| 148 | +* **Adjacency Transformation:** Defines how network adjacencies are calculated. |
| 149 | +* **Expression Integration:** Handles mapping gene expression values onto reactions. |
| 150 | +* **Ranking Components:** Implements different centrality ranking algorithms. |
| 151 | +
|
| 152 | +Any class inheriting from the abstract base classes in these modules can be exchanged. |
| 153 | +
|
| 154 | +--- |
| 155 | +
|
| 156 | +## Core Modules Overview |
| 157 | +
|
| 158 | +* **`model`**: The central GEMCAT model structure, responsible for integrating workflows and calculating results. |
| 159 | +* **`adjacency_transformation`**: Provides various approaches for calculating network adjacency and a platform for custom algorithms. |
| 160 | +* **`expression`**: Manages the mapping of gene values onto reactions in the model via gene product rules, offering different algorithms along with a platform to create alternatives. |
| 161 | +* **`ranking`**: Offers various ranking algorithms for the models along with a platform to include custom algorithms. |
| 162 | +* **`workflows`**: Contains example workflows. To customize the workflow to your needs simply copy the provided functions and switch out the desired steps. |
| 163 | +* **`cli`**: Command-line interface for GEMCAT. |
| 164 | +* **`io`**: Input and output functions that create GEMCAT models from different sources. |
| 165 | +* **`utils`**: Contains common utility functions used throughout the package. |
| 166 | +* **`verification`**: Functions to verify data integrity. |
| 167 | +* **`model_manager`**: Functionality for automatic downloading, storing, and retrieving of common models. |
| 168 | +
|
| 169 | +--- |
105 | 170 |
|
106 | 171 | ## Development |
107 | | -You can run all local tests with `pytest .`. Default behavior is to also run integration tests, which takes time. |
108 | | -You can exclude slow running tests by using `pytest . -m "not slow"`. |
109 | | -These slow running tests are integration tests with _real world data_ and will take 10-30s each according to your hardware. |
110 | 172 |
|
111 | | -To run tests, make sure you have [git lfs](https://git-lfs.com/) installed and all the Tests are running. |
112 | | -Make sure to run `isort` and `black` to have properly formatted code. |
| 173 | +If you're contributing to GEMCAT: |
| 174 | + |
| 175 | +* **Running Tests:** |
| 176 | + * Run all local tests with `pytest .`. |
| 177 | + * You can exclude slow-running tests by using `pytest . -m "not slow"`. These slow-running tests are |
| 178 | + integration tests with *real-world data* and will take 10-30 seconds each depending on your hardware. |
| 179 | +* **Prerequisites:** Ensure you have [git lfs](https://git-lfs.com/) installed for tests that rely on large files. |
| 180 | +* **Code Formatting:** Before committing, make sure your code is properly formatted using `isort` and `black`. |
| 181 | +* **CI Pipeline:** The GitHub CI pipeline automatically checks for `isort`, `black`, and `pytest` compliance. |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## Contact and Support |
| 186 | + |
| 187 | +For questions, bug reports, or support, please open an issue on the |
| 188 | +[GitHub Issues page](https://github.com/MolecularBioinformatics/GEMCAT/issues). We will do our best to respond promptly. |
| 189 | + |
| 190 | +For direct inquiries about the **development version of the GEMCAT GUI** or other specific questions, you can also contact: |
113 | 191 |
|
114 | | -The CI pipeline in GitHub will check with isort, black, and pytest. |
| 192 | +* **Suraj Sharma:** suraj.sharma@uib.no |
0 commit comments