Skip to content

Commit d6bd722

Browse files
committed
add README_TOO.md to graph_inputs/ for more agent context
1 parent 66fd6f0 commit d6bd722

1 file changed

Lines changed: 151 additions & 0 deletions

File tree

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Graph Inputs Workflow
2+
3+
## What This Workflow Does
4+
5+
This workflow discovers microbial coalescence community matrices and taxonomy tables, validates their structure, builds a bipartite sample-taxon backbone, fits stratum-specific SpiecEasi taxon-taxon networks, and writes graph-ready CSV files for downstream graph neural network workflows.
6+
7+
## Inputs Detected
8+
9+
Community matrices were detected from `workflows/input/` first and then the repository root/immediate subdirectories as fallback.
10+
- `Bacteria_inoculation_experiment_0Burn_W1_donor-community.csv`
11+
- `Bacteria_inoculation_experiment_0Burn_W1_final-community.csv`
12+
- `Bacteria_inoculation_experiment_0Burn_W1_resident-community.csv`
13+
- `Bacteria_inoculation_experiment_0Burn_W2_donor-community.csv`
14+
- `Bacteria_inoculation_experiment_0Burn_W2_final-community.csv`
15+
- `Bacteria_inoculation_experiment_0Burn_W2_resident-community.csv`
16+
- `Bacteria_inoculation_experiment_1Burn_W3_donor-community.csv`
17+
- `Bacteria_inoculation_experiment_1Burn_W3_final-community.csv`
18+
- `Bacteria_inoculation_experiment_1Burn_W3_resident-community.csv`
19+
- `Bacteria_inoculation_experiment_1Burn_W4_donor-community.csv`
20+
- `Bacteria_inoculation_experiment_1Burn_W4_final-community.csv`
21+
- `Bacteria_inoculation_experiment_1Burn_W4_resident-community.csv`
22+
- `Bacteria_inoculation_experiment_3Burn_W5_donor-community.csv`
23+
- `Bacteria_inoculation_experiment_3Burn_W5_final-community.csv`
24+
- `Bacteria_inoculation_experiment_3Burn_W5_resident-community.csv`
25+
- `Bacteria_inoculation_experiment_3Burn_W6_donor-community.csv`
26+
- `Bacteria_inoculation_experiment_3Burn_W6_final-community.csv`
27+
- `Bacteria_inoculation_experiment_3Burn_W6_resident-community.csv`
28+
- `Fungi_inoculation_experiment_0Burn_W1_donor-community.csv`
29+
- `Fungi_inoculation_experiment_0Burn_W1_final-community.csv`
30+
- `Fungi_inoculation_experiment_0Burn_W1_resident-community.csv`
31+
- `Fungi_inoculation_experiment_0Burn_W2_donor-community.csv`
32+
- `Fungi_inoculation_experiment_0Burn_W2_final-community.csv`
33+
- `Fungi_inoculation_experiment_0Burn_W2_resident-community.csv`
34+
- `Fungi_inoculation_experiment_1Burn_W3_donor-community.csv`
35+
- `Fungi_inoculation_experiment_1Burn_W3_final-community.csv`
36+
- `Fungi_inoculation_experiment_1Burn_W3_resident-community.csv`
37+
- `Fungi_inoculation_experiment_1Burn_W4_donor-community.csv`
38+
- `Fungi_inoculation_experiment_1Burn_W4_final-community.csv`
39+
- `Fungi_inoculation_experiment_1Burn_W4_resident-community.csv`
40+
- `Fungi_inoculation_experiment_3Burn_W5_donor-community.csv`
41+
- `Fungi_inoculation_experiment_3Burn_W5_final-community.csv`
42+
- `Fungi_inoculation_experiment_3Burn_W5_resident-community.csv`
43+
- `Fungi_inoculation_experiment_3Burn_W6_donor-community.csv`
44+
- `Fungi_inoculation_experiment_3Burn_W6_final-community.csv`
45+
- `Fungi_inoculation_experiment_3Burn_W6_resident-community.csv`
46+
47+
Taxonomy tables used:
48+
- `Bacteria`: `Bacteria_inoculation_experiment_taxonomy_table.csv`
49+
- `Fungi`: `Fungi_inoculation_experiment_taxonomy_table.csv`
50+
51+
## Filename Metadata Parsing
52+
53+
Community filenames were parsed with the pattern `Kingdom_inoculation_experiment_<donor_id>_<community_type>-community.csv`.
54+
This workflow uses:
55+
- `kingdom`: `Bacteria` or `Fungi`
56+
- `donor_id`: the middle filename token such as `0Burn_W1`
57+
- `community_type`: `donor`, `resident`, or `final`
58+
59+
## Sample-Taxon Bipartite Backbone
60+
61+
Each community matrix is treated as samples in rows and taxa/features in columns.
62+
The first column is used as `sample_id`; if it was not already named `sample_id`, the workflow renames it and records that warning in the summary outputs.
63+
Only nonzero abundance entries are written to the bipartite edge table.
64+
65+
## Taxon Naming
66+
67+
Taxon node metadata comes from the kingdom-specific taxonomy tables when available.
68+
Scientific names are constructed from `Genus` and `Species` only.
69+
- `name = "Genus Species"` when both are present
70+
- `name = "Genus"` when only genus is present
71+
- `name = ""` when genus is missing
72+
Higher taxonomy is retained only as optional metadata fields, not as the primary name.
73+
74+
## Coalescence Triplets
75+
76+
The workflow also writes `coalescence_triplets.csv`, which defines the supervised prediction units for this workshop dataset.
77+
Resident and final communities are paired when they share the same `sample_id` within a given `kingdom` and `donor_id`.
78+
Donor communities are represented at the pooled donor-source level: all resident/final samples with the same `donor_id` share the same donor-source input.
79+
Accordingly, `donor_sample_id` is `NA`, `donor_source_id` stores the donor treatment/source, and `donor_is_pooled` is `TRUE`.
80+
81+
## SpiecEasi Networks
82+
83+
SpiecEasi is run separately for each `kingdom x donor_id x community_type` stratum so that bacteria and fungi are not pooled, donor/resident/final communities are not pooled, and donor IDs are not pooled.
84+
The configured SpiecEasi method is `mb` with selection criterion `bstars`.
85+
When a stratum retains more taxa than the workshop cap allows, taxa are ranked by prevalence and then mean abundance before taking the top subset for SpiecEasi. This cap affects only the inferred taxon-taxon layer; the full bipartite backbone is preserved.
86+
For `method = mb`, SpiecEasi uses neighborhood selection. The resulting taxon-taxon edges should be interpreted as inferred association structure, not direct ecological interactions or strict partial correlations.
87+
For `method = glasso`, the selected inverse covariance structure can be converted to partial-correlation-like associations. This script stores edge weights as absolute association strength and `sign` as the association direction from the selected SpiecEasi matrix.
88+
89+
Workflow parameters:
90+
- `min_prevalence = 0.05`
91+
- `min_taxa = 10`
92+
- `scale_to_counts = TRUE`
93+
- `scale_factor = 10000`
94+
- `max_taxa_for_spieceasi = Inf`
95+
- `spieceasi_time_limit_seconds = Inf`
96+
- `spieceasi_method = "mb"`
97+
- `spieceasi_sel_criterion = "bstars"`
98+
- `spieceasi_ncores = 63`
99+
- `nlambda = 30`
100+
- `lambda_min_ratio = 0.01`
101+
- `rep_num = 20`
102+
- `random_seed = 1`
103+
104+
## Output Files
105+
106+
- `combined_sample_taxon_edges.csv`: nonzero sample-taxon bipartite edges with abundance and experimental context.
107+
- `nodes_samples.csv`: unique sample nodes with kingdom, donor ID, and community type.
108+
- `coalescence_triplets.csv`: supervised coalescence units linking resident and final samples by shared sample ID within kingdom and donor source; donor input is represented by pooled donor source ID.
109+
- `nodes_taxa.csv`: unique taxon nodes with names and taxonomy metadata.
110+
- `taxon_taxon_spieceasi_edges.csv`: undirected SpiecEasi taxon-taxon edges with inferred association weights and signs.
111+
- `graph_edges_multirelational.csv`: flat edge file combining sample-taxon and taxon-taxon relations.
112+
- `spieceasi_run_summary.csv`: one row per stratum describing network status and skip/failure messages.
113+
114+
## Rerun
115+
116+
Run the workflow from the repository root with:
117+
118+
```bash
119+
Rscript workflows/code/build_graph_inputs.R
120+
```
121+
122+
## GNN Use
123+
124+
The bipartite backbone captures observed experimental composition data, while the SpiecEasi layer adds inferred ecological association structure. The combined multirelational edge file can be used as a starting point for heterogeneous or relational GNN pipelines that link samples, taxa, and inferred taxon-taxon associations.
125+
126+
## Caveats
127+
128+
- Co-occurrence edges are inferred statistical associations, not measured direct interactions.
129+
- Positive co-occurrence does not necessarily mean cooperation.
130+
- Negative co-occurrence does not necessarily mean inhibition.
131+
- Associations can reflect shared niches, environmental filtering, compositional effects, or indirect interactions.
132+
- Prevalence filtering affects network density.
133+
- Relative-abundance to pseudo-count conversion is a modeling choice.
134+
- The bipartite backbone is the experimental data representation; the SpiecEasi graph is an inferred ecological association layer.
135+
- With `method = mb`, edge weights are association-strength summaries from neighborhood selection and should not be described as strict partial correlations.
136+
- Donor-community SpiecEasi strata may be intentionally skipped if repeated donor profiles have no sample-to-sample variation; donor composition is still represented in the sample-taxon backbone.
137+
138+
## Run Summary
139+
140+
- Sample-taxon edges: `943,126`
141+
- Sample nodes: `288`
142+
- Coalescence triplets: `288`
143+
- Taxon nodes: `19,523`
144+
- SpiecEasi edges: `0`
145+
- Successful SpiecEasi strata: `0`
146+
- Skipped SpiecEasi strata: `12`
147+
- Failed SpiecEasi strata: `24`
148+
149+
## Notes
150+
151+
- All SpiecEasi strata were skipped or failed, so the multirelational edge file contains sample-taxon edges only.

0 commit comments

Comments
 (0)