Skip to content

Commit 57c65bf

Browse files
committed
docs improvements
1 parent 8c04db2 commit 57c65bf

18 files changed

Lines changed: 118 additions & 60 deletions

File tree

docs/src/examples/gnn.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,10 @@ Furthermore, let's assume that each vertex is described by three features stored
3131
X = ArrayNode(randn(Float32, 3, 10))
3232
```
3333

34-
We use [`ScatteredBags`](@ref) from `Mill` to encode neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed as:
34+
We use [`ScatteredBags`](@ref) from [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) to encode
35+
neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This
36+
information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed
37+
as:
3538

3639
```@repl gnn
3740
b = ScatteredBags(g.fadjlist)
@@ -83,7 +86,9 @@ end
8386
nothing # hide
8487
```
8588

86-
As it is the case with whole `Mill`, even this graph neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports automatic differentiation:
89+
As it is the case with whole [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), even this graph
90+
neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports
91+
automatic differentiation:
8792

8893
```@example gnn
8994
zd = 4
@@ -100,6 +105,8 @@ gnn(g, X, 5)
100105
gradient(m -> m(g, X, 5) |> sum, gnn)
101106
```
102107

103-
The above implementation is surprisingly general, as it supports an arbitrarily rich description of vertices. For simplicity, we used only vectors in `X`, however, any `Mill` hierarchy is applicable.
108+
The above implementation is surprisingly general, as it supports an arbitrarily rich description of
109+
vertices. For simplicity, we used only vectors in `X`, however, any
110+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) hierarchy is applicable.
104111

105112
To put different weights on edges, one can use [Weighted aggregation](@ref).

docs/src/examples/jsons.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,11 @@
1111

1212
# Processing JSONs
1313

14-
Processing JSONs is actually one of the main motivations for building [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with `Mill` one is now able to process a set of valid JSON documents that follow the same meta schema. [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with infering the schema and other steps in the pipeline. For some examples, please refer to its [documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
14+
Processing JSONs is actually one of the main motivations for building
15+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with
16+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) one is now able to process a set of valid JSON
17+
documents that follow the same meta schema.
18+
[`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with
19+
infering the schema and other steps in the pipeline. For some examples, please refer to its
20+
[documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
1521

docs/src/examples/musk/musk.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ nothing #hide
2626

2727
### Loading the data
2828

29-
Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
29+
Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
3030
* a matrix with features, each column is one instance:
3131

3232
````@example musk
@@ -64,7 +64,7 @@ y_oh = onehotbatch(y, 1:2)
6464

6565
### Model construction
6666

67-
Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
67+
Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
6868

6969
````@example musk
7070
model = BagModel(
@@ -84,7 +84,7 @@ model(ds)
8484

8585
### Training
8686

87-
Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
87+
Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
8888

8989
````@example musk
9090
opt_state = Flux.setup(Adam(), model);

docs/src/examples/musk/musk_literate.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ using Random; Random.seed!(42);
2121

2222
# ### Loading the data
2323

24-
# Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
24+
# Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
2525
# * a matrix with features, each column is one instance:
2626
fMat = load("musk.jld2", "fMat")
2727
# * the ids of samples (*bags* in MIL terminology) specifying to which each instance (column in `fMat`) belongs to:
@@ -42,7 +42,7 @@ y_oh = onehotbatch(y, 1:2)
4242

4343
# ### Model construction
4444

45-
# Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
45+
# Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
4646
model = BagModel(
4747
Dense(166, 50, Flux.tanh),
4848
SegmentedMeanMax(50),
@@ -56,7 +56,7 @@ model(ds)
5656

5757
# ### Training
5858

59-
# Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
59+
# Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
6060

6161
opt_state = Flux.setup(Adam(), model);
6262

docs/src/index.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,10 @@ Julia v1.9 or later is required.
2626

2727
For the quickest start, see the [Musk](@ref) example.
2828

29-
* [Motivation](@ref): a brief introduction into the philosophy of `Mill`
30-
* [Manual](@ref Nodes): a brief tutorial into `Mill`
31-
* [Examples](@ref Musk): some examples of `Mill` use
29+
* [Motivation](@ref): a brief introduction into the philosophy of
30+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
31+
* [Manual](@ref Nodes): a brief tutorial into [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
32+
* [Examples](@ref Musk): some examples of [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) use
3233
* [External tools](@ref HierarchicalUtils.jl): examples of integration with other packages
3334
* [Public API](@ref Aggregation): extensive API reference
3435
* [References](@ref): related literature

docs/src/manual/aggregation.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ Different choice of operator, or their combinations, are suitable for different
2626
a_{\max}(\{x_1, \ldots, x_k\}) = \max_{i = 1, \ldots, k} x_i
2727
```
2828

29-
where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In `Mill`, the operator is constructed this way:
29+
where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In
30+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), the operator is constructed this way:
3031

3132
```@repl aggregation
3233
a_max = SegmentedMax(d)
@@ -86,7 +87,11 @@ Whereas non-parametric aggregations do not use any parameter, parametric aggrega
8687
a_{\operatorname{lse}}(\{x_1, \ldots, x_k\}; r) = \frac{1}{r}\log \left(\frac{1}{k} \sum_{i = 1}^{k} \exp({r\cdot x_i})\right)
8788
```
8889

89-
With different values of ``r``, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the `Mill` implementation is numerically stable.
90+
With different values of ``r``, LSE behaves differently and in fact both max and mean operators are
91+
limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other
92+
hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively
93+
implementing the definition above may lead to numerical instabilities, however, the
94+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
9095

9196
```@repl aggregation
9297
a_lse = SegmentedLSE(d)
@@ -101,7 +106,7 @@ a_lse(X, bags)
101106
a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
102107
```
103108

104-
Again, the `Mill` implementation is stable.
109+
Again, the [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
105110

106111
```@repl aggregation
107112
a_pnorm = SegmentedPNorm(d)
@@ -119,7 +124,8 @@ a = AggregationStack(a_mean, a_max)
119124
a(X, bags)
120125
```
121126

122-
For the most common combinations, `Mill` provides some convenience definitions:
127+
For the most common combinations, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides some
128+
convenience definitions:
123129

124130
```@repl aggregation
125131
SegmentedMeanMax(d)
@@ -138,7 +144,8 @@ a_{\operatorname{mean}}(\{(x_i, w_i)\}_{i=1}^k) = \frac{1}{\sum_{i=1}^k w_i} \su
138144
a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
139145
```
140146

141-
This is done in `Mill` by passing an additional parameter:
147+
This is done in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) by passing an additional
148+
parameter:
142149

143150
```@repl aggregation
144151
w = Float32.([1.0, 0.2, 0.8, 0.5])
@@ -173,7 +180,9 @@ Otherwise, [`WeightedBagNode`](@ref) behaves exactly like the standard [`BagNode
173180

174181
For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. To do this, wrap an instance of [`AbstractAggregation`](@ref) or [`AggregationStack`](@ref) in the [`BagCount`](@ref) type.
175182

176-
In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, `Mill`, performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
183+
In the aggregation phase, bag count appends one more element which stores the bag size to the output
184+
after all operators are applied. Furthermore, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl),
185+
performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
177186

178187
```@repl aggregation
179188
a_mean_bc = BagCount(a_mean)

docs/src/manual/custom.md

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,18 @@ using Flux
55

66
## Custom nodes
77

8-
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data, such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is [`LazyNode`](@ref), which you can easily use to extend the functionality of `Mill`.
8+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data,
9+
such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and
10+
model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is
11+
[`LazyNode`](@ref), which you can easily use to extend the functionality of
12+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).
913

1014
### Unix path example
1115

12-
Let's define a custom node type for representing path names in Unix and one custom model type for processing it. [`LazyNode`](@ref)
13-
serves as a bolierplate for simple extension of `Mill` ecosystem. We start by by defining an example of such node:
16+
Let's define a custom node type for representing path names in Unix and one custom model type for
17+
processing it. [`LazyNode`](@ref) serves as a bolierplate for simple extension of
18+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) ecosystem. We start by by defining an example of
19+
such node:
1420

1521
```@repl custom
1622
ds = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
@@ -20,10 +26,11 @@ Entirely new type is not needed, because we can dispatch on the first type param
2026
`:Path` "tag" in this case defines a special kind of [`LazyNode`](@ref). Consequently, we can define
2127
multiple variations of custom [`LazyNode`](@ref) without any conflicts in dispatch.
2228

23-
As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one [`LazyNode`](@ref)
24-
and produces an arbitrary `Mill` structure. We will represent individual file and directory names (as obtained
25-
by `splitpath`) using an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as
26-
a bag of individual names:
29+
As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one
30+
[`LazyNode`](@ref) and produces an arbitrary [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
31+
structure. We will represent individual file and directory names (as obtained by `splitpath`) using
32+
an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as a bag of individual
33+
names:
2734

2835
```@example custom
2936
function Mill.unpack2mill(ds::LazyNode{:Path})
@@ -69,10 +76,13 @@ pm(ds)
6976
The solution using [`LazyNode`](@ref) is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:
7077

7178
* allow nesting (if needed)
72-
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations. `Mill` already defines [`Mill.subset`](@ref) for common datatypes, which can be used.
79+
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations.
80+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) already defines [`Mill.subset`](@ref) for
81+
common datatypes, which can be used.
7382
* allow concatenation of nodes with [`catobs`](@ref). Optionally, implement `reduce(catobs, ...)` as well to avoid excessive compilations if a number of arguments will vary a lot
74-
* define a specialized method for `MLUtils.numobs`, which we can however import directly from `Mill`.
75-
* register the custom node with [HierarchicalUtils.jl](@ref) to obtain pretty printing, iterators and other functionality
83+
* define a specialized method for `MLUtils.numobs`, which we can however import directly from
84+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).
85+
* register the custom node with [`HierarchicalUtils.jl`](@ref) to obtain pretty printing, iterators and other functionality
7686

7787
Here is an example of a custom node with the same functionality as in the [Unix path example](@ref)
7888
section:

docs/src/manual/leaf_data.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,9 @@ hosts = [
5959
]
6060
```
6161

62-
`Mill` offers `n`gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of [`NGramMatrix`](@ref):
62+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) offers `n`gram histogram-based representation
63+
for strings. To get started, we pass the vector of strings into the constructor of
64+
[`NGramMatrix`](@ref):
6365

6466
```@repl leafs
6567
hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
@@ -139,4 +141,8 @@ gradient(m -> sum(m(ds)), m)
139141
!!! ukn "Numerical features"
140142
To put all numerical features into one [`ArrayNode`](@ref) is a design choice. We could as well introduce more keys in the final [`ProductNode`](@ref). The model treats these two cases slightly differently (see [Nodes](@ref) section).
141143

142-
This dummy example illustrates the versatility of `Mill`. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with `Mill`, see for example [Pevny2020](@cite).
144+
This dummy example illustrates the versatility of
145+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). With little to no preprocessing we are able to
146+
process complex hierarchical structures and avoid manually designing feature extraction procedures.
147+
For a more involved study on processing Internet traffic with
148+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), see for example [Pevny2020](@cite).

docs/src/manual/missing.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,10 @@ and many other possible reasons. At the same time, it is wasteful to throw away
1717
2. Empty bags with no instances in a [`BagNode`](@ref)
1818
3. And entire key missing in a [`ProductNode`](@ref)
1919

20-
At the moment, `Mill` is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`) that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be either fixed or learned during training. Everything is done automatically.
20+
At the moment, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is capable of handling the first
21+
two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`)
22+
that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be
23+
either fixed or learned during training. Everything is done automatically.
2124

2225
## Empty bags
2326

@@ -99,7 +102,8 @@ Storing missing strings in [`NGramMatrix`](@ref) is straightforward:
99102
missing_ngrams = NGramMatrix(["foo", missing, "bar"], 3, 256, 5)
100103
```
101104

102-
When some values of categorical variables are missing, `Mill` defines a new type for representation:
105+
When some values of categorical variables are missing,
106+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) defines a new type for representation:
103107

104108
```@repl missing
105109
missing_categorical = maybehotbatch([missing, 2, missing], 1:5)
@@ -187,7 +191,8 @@ Here, `[pre_imputing]Dense` and `[post_imputing]Dense` are standard dense layers
187191
dense = m.ms[1].m; typeof(dense.weight)
188192
```
189193

190-
Inside `Mill` we add a special definition `Base.show` for these types for compact printing.
194+
Inside [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) we add a special definition `Base.show`
195+
for these types for compact printing.
191196

192197
The [`reflectinmodel`](@ref) method use types to determine whether imputing is needed or not. Compare the following:
193198

docs/src/manual/more_on_nodes.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,14 +56,15 @@ ds = BagNode(ProductNode((BagNode(randn(Float32, 4, 10),
5656
[1:1, 2:3, 4:5])
5757
```
5858

59-
When data and model trees become complex, `Mill` limits the printing. To inspect the whole tree, use
60-
`printtree`:
59+
When data and model trees become complex, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) limits
60+
the printing. To inspect the whole tree, use `printtree`:
6161

6262
```@repl more_on_nodes
6363
printtree(ds)
6464
```
6565

66-
Instead of defining a model manually, we can also make use of [Model reflection](@ref), another `Mill` functionality, which simplifies model creation:
66+
Instead of defining a model manually, we can also make use of [Model reflection](@ref), another
67+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) functionality, which simplifies model creation:
6768

6869
```@repl more_on_nodes
6970
m = reflectinmodel(ds, d -> Dense(d, 2), SegmentedMean)
@@ -72,7 +73,8 @@ m(ds)
7273

7374
## Node conveniences
7475

75-
To make the handling of data and model hierarchies easier, `Mill` provides several tools. Let's setup some data:
76+
To make the handling of data and model hierarchies easier,
77+
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides several tools. Let's setup some data:
7678

7779
```@repl more_on_nodes
7880
AN = ArrayNode(Float32.([1 2 3 4; 5 6 7 8]))
@@ -95,7 +97,8 @@ numobs(PN)
9597

9698
### Indexing and Slicing
9799

98-
Indexing in [`Mill`] operates **on the level of observations**:
100+
Indexing in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) operates **on the level of
101+
observations**:
99102

100103
```@repl more_on_nodes
101104
AN[1]

0 commit comments

Comments
 (0)