You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/examples/gnn.md
+10-3Lines changed: 10 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,10 @@ Furthermore, let's assume that each vertex is described by three features stored
31
31
X = ArrayNode(randn(Float32, 3, 10))
32
32
```
33
33
34
-
We use [`ScatteredBags`](@ref) from `Mill` to encode neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed as:
34
+
We use [`ScatteredBags`](@ref) from [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) to encode
35
+
neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This
36
+
information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed
37
+
as:
35
38
36
39
```@repl gnn
37
40
b = ScatteredBags(g.fadjlist)
@@ -83,7 +86,9 @@ end
83
86
nothing # hide
84
87
```
85
88
86
-
As it is the case with whole `Mill`, even this graph neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports automatic differentiation:
89
+
As it is the case with whole [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), even this graph
90
+
neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports
91
+
automatic differentiation:
87
92
88
93
```@example gnn
89
94
zd = 4
@@ -100,6 +105,8 @@ gnn(g, X, 5)
100
105
gradient(m -> m(g, X, 5) |> sum, gnn)
101
106
```
102
107
103
-
The above implementation is surprisingly general, as it supports an arbitrarily rich description of vertices. For simplicity, we used only vectors in `X`, however, any `Mill` hierarchy is applicable.
108
+
The above implementation is surprisingly general, as it supports an arbitrarily rich description of
109
+
vertices. For simplicity, we used only vectors in `X`, however, any
110
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) hierarchy is applicable.
104
111
105
112
To put different weights on edges, one can use [Weighted aggregation](@ref).
Copy file name to clipboardExpand all lines: docs/src/examples/jsons.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,5 +11,11 @@
11
11
12
12
# Processing JSONs
13
13
14
-
Processing JSONs is actually one of the main motivations for building [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with `Mill` one is now able to process a set of valid JSON documents that follow the same meta schema. [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with infering the schema and other steps in the pipeline. For some examples, please refer to its [documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
14
+
Processing JSONs is actually one of the main motivations for building
15
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with
16
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) one is now able to process a set of valid JSON
17
+
documents that follow the same meta schema.
18
+
[`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with
19
+
infering the schema and other steps in the pipeline. For some examples, please refer to its
Copy file name to clipboardExpand all lines: docs/src/examples/musk/musk.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ nothing #hide
26
26
27
27
### Loading the data
28
28
29
-
Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
29
+
Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
30
30
* a matrix with features, each column is one instance:
31
31
32
32
````@example musk
@@ -64,7 +64,7 @@ y_oh = onehotbatch(y, 1:2)
64
64
65
65
### Model construction
66
66
67
-
Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
67
+
Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
68
68
69
69
````@example musk
70
70
model = BagModel(
@@ -84,7 +84,7 @@ model(ds)
84
84
85
85
### Training
86
86
87
-
Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
87
+
Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
Copy file name to clipboardExpand all lines: docs/src/examples/musk/musk_literate.jl
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ using Random; Random.seed!(42);
21
21
22
22
# ### Loading the data
23
23
24
-
# Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
24
+
# Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
25
25
# * a matrix with features, each column is one instance:
26
26
fMat =load("musk.jld2", "fMat")
27
27
# * the ids of samples (*bags* in MIL terminology) specifying to which each instance (column in `fMat`) belongs to:
@@ -42,7 +42,7 @@ y_oh = onehotbatch(y, 1:2)
42
42
43
43
# ### Model construction
44
44
45
-
# Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
45
+
# Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
46
46
model =BagModel(
47
47
Dense(166, 50, Flux.tanh),
48
48
SegmentedMeanMax(50),
@@ -56,7 +56,7 @@ model(ds)
56
56
57
57
# ### Training
58
58
59
-
# Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
59
+
# Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
With different values of ``r``, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the `Mill` implementation is numerically stable.
90
+
With different values of ``r``, LSE behaves differently and in fact both max and mean operators are
91
+
limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other
92
+
hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively
93
+
implementing the definition above may lead to numerical instabilities, however, the
94
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
90
95
91
96
```@repl aggregation
92
97
a_lse = SegmentedLSE(d)
@@ -101,7 +106,7 @@ a_lse(X, bags)
101
106
a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
102
107
```
103
108
104
-
Again, the `Mill` implementation is stable.
109
+
Again, the [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
105
110
106
111
```@repl aggregation
107
112
a_pnorm = SegmentedPNorm(d)
@@ -119,7 +124,8 @@ a = AggregationStack(a_mean, a_max)
119
124
a(X, bags)
120
125
```
121
126
122
-
For the most common combinations, `Mill` provides some convenience definitions:
127
+
For the most common combinations, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides some
a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
139
145
```
140
146
141
-
This is done in `Mill` by passing an additional parameter:
147
+
This is done in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) by passing an additional
148
+
parameter:
142
149
143
150
```@repl aggregation
144
151
w = Float32.([1.0, 0.2, 0.8, 0.5])
@@ -173,7 +180,9 @@ Otherwise, [`WeightedBagNode`](@ref) behaves exactly like the standard [`BagNode
173
180
174
181
For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. To do this, wrap an instance of [`AbstractAggregation`](@ref) or [`AggregationStack`](@ref) in the [`BagCount`](@ref) type.
175
182
176
-
In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, `Mill`, performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
183
+
In the aggregation phase, bag count appends one more element which stores the bag size to the output
184
+
after all operators are applied. Furthermore, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl),
185
+
performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
Copy file name to clipboardExpand all lines: docs/src/manual/custom.md
+20-10Lines changed: 20 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,12 +5,18 @@ using Flux
5
5
6
6
## Custom nodes
7
7
8
-
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data, such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is [`LazyNode`](@ref), which you can easily use to extend the functionality of `Mill`.
8
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data,
9
+
such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and
10
+
model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is
11
+
[`LazyNode`](@ref), which you can easily use to extend the functionality of
@@ -20,10 +26,11 @@ Entirely new type is not needed, because we can dispatch on the first type param
20
26
`:Path` "tag" in this case defines a special kind of [`LazyNode`](@ref). Consequently, we can define
21
27
multiple variations of custom [`LazyNode`](@ref) without any conflicts in dispatch.
22
28
23
-
As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one [`LazyNode`](@ref)
24
-
and produces an arbitrary `Mill` structure. We will represent individual file and directory names (as obtained
25
-
by `splitpath`) using an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as
26
-
a bag of individual names:
29
+
As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one
30
+
[`LazyNode`](@ref) and produces an arbitrary [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
31
+
structure. We will represent individual file and directory names (as obtained by `splitpath`) using
32
+
an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as a bag of individual
33
+
names:
27
34
28
35
```@example custom
29
36
function Mill.unpack2mill(ds::LazyNode{:Path})
@@ -69,10 +76,13 @@ pm(ds)
69
76
The solution using [`LazyNode`](@ref) is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:
70
77
71
78
* allow nesting (if needed)
72
-
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations. `Mill` already defines [`Mill.subset`](@ref) for common datatypes, which can be used.
79
+
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations.
80
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) already defines [`Mill.subset`](@ref) for
81
+
common datatypes, which can be used.
73
82
* allow concatenation of nodes with [`catobs`](@ref). Optionally, implement `reduce(catobs, ...)` as well to avoid excessive compilations if a number of arguments will vary a lot
74
-
* define a specialized method for `MLUtils.numobs`, which we can however import directly from `Mill`.
75
-
* register the custom node with [HierarchicalUtils.jl](@ref) to obtain pretty printing, iterators and other functionality
83
+
* define a specialized method for `MLUtils.numobs`, which we can however import directly from
Copy file name to clipboardExpand all lines: docs/src/manual/leaf_data.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,9 @@ hosts = [
59
59
]
60
60
```
61
61
62
-
`Mill` offers `n`gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of [`NGramMatrix`](@ref):
for strings. To get started, we pass the vector of strings into the constructor of
64
+
[`NGramMatrix`](@ref):
63
65
64
66
```@repl leafs
65
67
hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
@@ -139,4 +141,8 @@ gradient(m -> sum(m(ds)), m)
139
141
!!! ukn "Numerical features"
140
142
To put all numerical features into one [`ArrayNode`](@ref) is a design choice. We could as well introduce more keys in the final [`ProductNode`](@ref). The model treats these two cases slightly differently (see [Nodes](@ref) section).
141
143
142
-
This dummy example illustrates the versatility of `Mill`. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with `Mill`, see for example [Pevny2020](@cite).
144
+
This dummy example illustrates the versatility of
145
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). With little to no preprocessing we are able to
146
+
process complex hierarchical structures and avoid manually designing feature extraction procedures.
147
+
For a more involved study on processing Internet traffic with
148
+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), see for example [Pevny2020](@cite).
Copy file name to clipboardExpand all lines: docs/src/manual/missing.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,10 @@ and many other possible reasons. At the same time, it is wasteful to throw away
17
17
2. Empty bags with no instances in a [`BagNode`](@ref)
18
18
3. And entire key missing in a [`ProductNode`](@ref)
19
19
20
-
At the moment, `Mill` is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`) that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be either fixed or learned during training. Everything is done automatically.
20
+
At the moment, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is capable of handling the first
21
+
two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`)
22
+
that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be
23
+
either fixed or learned during training. Everything is done automatically.
21
24
22
25
## Empty bags
23
26
@@ -99,7 +102,8 @@ Storing missing strings in [`NGramMatrix`](@ref) is straightforward:
0 commit comments