forked from BioFSharp/BioFSharp
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path03_06_newick.fsx
More file actions
164 lines (115 loc) · 6.77 KB
/
03_06_newick.fsx
File metadata and controls
164 lines (115 loc) · 6.77 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
(**
---
title: Newick
category: BioParsers
categoryindex: 3
index: 5
---
*)
(*** hide ***)
(*** condition: prepare ***)
#r "nuget: FSharpAux, 1.1.0"
#r "nuget: FSharpAux.IO, 1.1.0"
#r "nuget: FSharp.Stats, 0.4.3"
#r "nuget: Plotly.NET, 2.0.0-preview.18"
#r "../src/BioFSharp/bin/Release/netstandard2.0/BioFSharp.dll"
#r "../src/BioFSharp.IO/bin/Release/netstandard2.0/BioFSharp.IO.dll"
#r "../src/BioFSharp.BioContainers/bin/Release/netstandard2.0/BioFSharp.BioContainers.dll"
#r "../src/BioFSharp.ML/bin/Release/netstandard2.0/BioFSharp.ML.dll"
#r "../src/BioFSharp.Stats/bin/Release/netstandard2.0/BioFSharp.Stats.dll"
(*** condition: ipynb ***)
#if IPYNB
#r "nuget: FSharpAux, 1.1.0"
#r "nuget: FSharpAux.IO, 1.1.0"
#r "nuget: FSharp.Stats, 0.4.3"
#r "nuget: Plotly.NET, 2.0.0-preview.18"
#r "nuget: Plotly.NET.Interactive, 2.0.0-preview.18"
#r "nuget: BioFSharp, {{fsdocs-package-version}}"
#r "nuget: BioFSharp.IO, {{fsdocs-package-version}}"
#r "nuget: BioFSharp.BioContainers, {{fsdocs-package-version}}"
#r "nuget: BioFSharp.ML, {{fsdocs-package-version}}"
#r "nuget: BioFSharp.Stats, {{fsdocs-package-version}}"
#endif // IPYNB
(**
# Newick parsing
[](https://mybinder.org/v2/gh/CSBiology/BioFSharp/gh-pages?filepath={{fsdocs-source-basename}}.ipynb) 
[]({{root}}{{fsdocs-source-basename}}.fsx) 
[]({{root}}{{fsdocs-source-basename}}.ipynb)
*Summary:* This example shows how to parse obo formatted files with BioFSharp
## Newick format
The newick format is a simple, strictly symbolised format representing phylogenetic trees. It is the standard tree format used by the Clustal tool.
In general, internal nodes (nodes with a least one descendant) are opened with `(` and closed with `)`. The childnodes within those internals are separated by `,`. After every node, there can be, but most not be information about its name and the distance from its parent. Name and distance are separated by `:`. Every full tree has a `;` at its end.
These key characters are not allowed to be used within the given names or distances. This restriction allows a wide range of possible trees as can be seen by the following list of example trees:
( taken from wikipedia )
* (,,(,));
no nodes are named
* (A,B,(C,D));
leaf nodes are named
* (A,B,(C,D)E)F;
all nodes are named
* (:0.1,:0.2,(:0.3,:0.4):0.5);
all but root node have a distance to parent
* (:0.1,:0.2,(:0.3,:0.4):0.5):0.0;
all have a distance to parent
* (A:0.1,B:0.2,(C:0.3,D:0.4):0.5);
distances and leaf names (popular)
* (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;
distances and all names
* ((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;
a tree rooted on a leaf node (rare)
The parser integrated in BioFSharp tries to cover this wide range of possible input data by using a very generic tree representantion and allowing the user to parse it the way he wants by the usage of a mapping function as additional input. Of course having an idea about this tree implementation is necessary for working with the already parsed trees. It is therefore recommended to look into the `PhylTree`(API reference can be found [here](reference/biofsharp-phyltree.html)).
## Reading Newick files
To read a newick file, the function `ofFile` in `BioFSharp.IO.Newick` has to be used. It takes a mapping function of type `string -> 'Distance` and a path of type `string`. This means that names are always parsed as string but distances can be parsed at will. The usage of this is demonstrated here:
Let's say we want to parse the following tree (which is btw. output generated by clustal omega):
```text
(
(
Mozarella:0.44256,
Brie:-0.01399)
:0.01161,
Cheddar:-0.00714,
(
Emmentaler:0.00113,
Gouda:-0.00113)
:0.00714);
```
After wrapping ones head about this symbol salad, one might find a few things about this tree:
* Only the leaf nodes are named
* Every node but the top node has a distance in form of a float number
* All subtrees are binary
Here is the same tree in the form of a cladogramm:

## Parsing newick files
Now comes the parsing, as already mentioned, to bring the distance into a better form, we have to define a mapping function. In this case it's easy: we basically only need a function that transforms a string to a float *if possible*. Here it's important to mention that the parser interprets all nodes in the same way, as a functional programmer might expect. As not all nodes in the above tree have distances, that case has to be covered too, which is done in two different ways in this example:
*)
open BioFSharp.IO
let fileDir = __SOURCE_DIRECTORY__ + "/data/"
//Maps string to float if possible. In the case it's not it just returns 0 instead
let floatMapping (distance: string) =
try (float distance) with | _ -> 0.
//Maps string to optional float wether it can be converted or not.
let floatMappingOptional (distance: string) =
match System.Double.TryParse distance with
| true, v -> Some v
| false, _ -> None
//As path we set an examplefile included in biofsharp which consists of the tree shown above
let path = fileDir + "treeExample.txt"
Newick.ofFile floatMapping path
(***include-it***)
(**
As written above, having an idea about the `PhylTree` in BioFSharp makes understanding this result much easier. Its API reference can be found [here](reference/biofsharp-phyltree.html).
## Writing Newick files
To write a Newick file, the function `toFile` is used. Besides the output path and the tree it takes two additional functions. `nodeConverter` is intended to be used as a kind of separator of node name and distance, as sometimes the node info is not stored as a simple tuple, but has to be parsed as such. Both have to be converted to strings for parsing reasons.
Let's see this in action. As an example we want to rebuild the original treeFile we read out above. Keep one thing in mind though: Of course we want to have it exactly look like the original. If you have read the tree with the converter function which returns a zero if the distance is not mappable to float, then the conversion back to string has to cover this case, or this zero will end up being in the file. This is not really a safe approach though, as real zeros might be lost with it. Therefore we will read and write the distance as an option and use the floatConverterOptional defined above:
*)
//Tree with distancevalue wrapped as option
let myOptionalTree = Newick.ofFile floatMappingOptional path
//convertes nodeinformation to writeable name and distance
let converter (n,d) =
n,
match d with
| Some fl -> string fl
| None -> ""
//write our original tree to path
(*** do-not-eval ***)
Newick.toFile converter (fileDir + "outputTree.txt") myOptionalTree