The following readme will give a short summary of the steps carried out by the code and commands of each step of the pipeline. Although some steps (e.g., annotation and OrthoFinder) can be carried out simultaneously, we sorted the folders according to the order we computed them. Dependencies and software versions are listed in the README files of each folder.
This procedure has been run for each eToLDB version (eToLDBA, eToLDBB, and eToLDBC), which are three databases of proteomes accounting for a set of 100 balanced pan-eukaryotic proteomes sampling.
To remove low-complexity proteins and highly redundant proteins, we developed a pipeline that “cleaned” the raw proteomes by detecting low-complexity regions using seg and removing the proteins with a low-complexity region covering more than 25% of the full protein length and clustered the proteomes using mmseqs easy-linclust at a minimum sequence identity of 0.8, minimum coverage of 0.5 and coverage mode 2 (CD-Hit-like) (i.e., --min-seq-id 0.8 -c 0.5 --cov-mode 2). For the raw and clean versions of the proteome, we compute the BUSCO scores and basic statistics of the proteome using seqkit stats. Based on the consequences of the cleaning steps, we selected a final set of “clean” genomes.
To reduce redundancy and the size of the prokaryotic database, we used a pangenome approach. We first selected a proteome of each genus based on the completeness, we concatenated all the proteomes from the same order, and clustered them using mmseqs easy-linclust with the parameters: --min-seq-id 0.3 -c 0.5 --cov-mode 0 --cluster-mode 2 -e 0.001. We finally kept those representative sequences of clusters containing at least 15% of the genera in that order, resulting in a pangenome at the cloud level.
To annotate the KOs of the clean proteins obtained in the previous step, we ran KOfamScan and filtered the annotations to obtain a KO for each protein. We also performed HMM searches against the Clusters of Orthologous groups (COG) database and used a filter of the HMM result to annotate the COG and its category for each protein
We ran OrthoFinder using an inflation parameter of 1.5 (the default value), and filtered the results to obtain orthogroups whose species representation accomplished the minimum LECA criterion. We aligned the sequences and removed those that were generating too many gaps, to obtain a more compact and contiguous alignment.
This procedure is run twice, the first one with the OrthoFinder clean LECA-OGs results, and the second time after the first phylogenetic inference, in which we detect the mLECA-OGs.
From the refined alignment of the LECA-OGs (and mLECA-OGs in the second round), we computed an HMM profile, which we used for searching homologs in the prokaryotic and viral pangenomes database. Moreover, this set of sequences is filtered and submitted to a phylogenetic reconstruction pipeline. Using the resulting trees, we detected the LECA nodes and their sisters to establish the LECA-OGs (and mLECA-OGs) origins.
Once we computed and analysed the trees, for each mLECA-OGs, a gene inferred in LECA, we functionally annotated the mLECA-OG, got the acquisition modules, which are the genes inferred to have been acquired during the same period, computed the taxonomic bootstraps, obtained the LECA proteomes and performed visualisations for the origins of the genes and the metabolism. We also computed a functional summary of the genes in LECA.
For the inferred modules, we computed the posterior distribution of the mode of the normalised stem length distribution. This distribution is then used in the rest of the analyses to compute the pairwise probability of overlapping HGTs.
To get the LECA features, we submitted the LECA metaproteome to KEGG mapper, downloaded the text output and parsed with an in-house set of scripts that computed the frequencies of each donor contribution to LECA in each feature.
To infer the ancestral features of the LECA contributors (donors), we first selected those genomes sharing at least 50% of the KOs that we inferred they transferred and analysed the presence and absence of some specific modules and KOs linked to features that are of importance in the problem of eukaryogenesis. We manually selected a list of features and then merging the annotations of the genomes and their metabolic reconstructions, we obtained a list with the pervasiveness (the proportion of genomes where it is present for KOs and the proportion of genomes where the module is present times the stepwise completeness for the metabolic modules).
We collected genomes of free living unicellular eukaryotes with specific metabolic strategies (phagotrophs: FLUPs, osmotrophs: FLUOs, and autotrophs: FLUAs). For each genome, we annotated the KOs and COGs of the proteins using the annotation pipeline and computed the percentage of KOs that they share with LECA, as well as the distribution of COG categories.