0.5.0 release (#171)

YuhanLiin · web-flow · commit d2d0e67e624f · 2021-10-20T23:17:41.000Z
* Retroactively update changelog for 0.4.0

* Update changelog for 0.5.0

* Bump all Cargo.toml versions to 0.5.0

* Update contribute doc

* Add new release webpage and rename 0.4.0 release page name

* Add winequality example and fix log-sum-exp clipping
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,41 @@
+Version 0.5.0 - 2021-10-20
+========================
+
+New Algorithms
+-----------
+
+ * Nearest neighbour algorithms and traits have been added as `linfa-nn` by [@YuhanLiin]
+ * OPTICS has been added to `linfa-clustering` by [@xd009642]
+ * Multinomial logistic regression has been added to `linfa-logistic` by [@YuhanLiin]
+
+Changes
+-----------
+ * use least squares solver from `ndarray-linalg` in `linfa-linear` (3dc9cb0)
+ * optimized DBSCAN by replacing linear range query implementation with KD-tree (44f91d0)
+ * allow distance metrics other than Euclidean to be used for KMeans (4e58d8d)
+ * enable models to write prediction results into existing memory without allocating (37bc25b)
+ * bumped `ndarray` version to 0.15 and reduced duplicated dependencies (603f821)
+ * introduce `ParamGuard` trait to algorithm parameter sets to enable both explicit and implicit parameter checking (01f912a)
+ * replace uses of HNSW with `linfa-nn` (208a762)
+
+Version 0.4.0 - 2021-04-28
+========================
+
+New Algorithms
+-----------
+
+ * Partial Least Squares Regression has been added as `linfa-pls` by [@relf]
+ * Barnes-Hut t-SNE wrapper has been added as `linfa-tsne` by [@frjnn]
+ * Count-vectorizer and IT-IDF normalization has been added as `linfa-preprocessing` by [@Sauro98]
+ * Platt scaling has been added to `linfa-svm` by [@bytesnake]
+ * Incremental KMeans and KMeans++ and KMeans|| initialization methods added to `linfa-clustering` by [@YuhanLiin]
+
+Changes
+-----------
+ * bumped `ndarray` version to 0.14 (8276bdc)
+ * change trait signature of `linfa::Fit` to return `Result` (a5a479f)
+ * add `cross_validate` to perform K-folding (a5a479f)
+
 Version 0.3.1 - 2021-03-11
 ========================
 
diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md
@@ -20,22 +20,25 @@ where the type of the input dataset is `&Dataset<Kernel<F>, Array1<bool>>`. It p
 
 The [Predict](src/traits.rs) trait has its own section later in this document, while for an example of a `Transformer` please look into the [linfa-kernel](linfa-kernel/src/lib.rs) implementation.
 
-## Parameters and builder
+## Parameters and checking
 
 An algorithm has a number of hyperparameters, describing how it operates. This section describes how the algorithm's structs should be organized in order to conform with other implementations. 
 
-Imagine we have an implementation of `MyAlg`, there should a separate struct called `MyAlgParams`. The method `MyAlg::params(..) -> MyAlgParams` constructs a parameter set with default parameters and optionally required arguments (for example the number of clusters). If no parameters are required, then `std::default::Default` can be implemented as well:
+Sometimes only an algorithm's parameters must be checked for validity. As such, Linfa makes a distinction between checked and unchecked parameters. Unchecked parameters can be converted into checked parameters if the values are valid, and only checked parameters can be used to run the algorithm.
+
+Imagine we have an implementation of `MyAlg`, there should separate structs called `MyAlgValidParams`, which are the checked parameters, and `MyAlgParams`, which are the unchecked parameters. The method `MyAlg::params(..) -> MyAlgParams` constructs a parameter set with default parameters and optionally required arguments (for example the number of clusters). `MyAlgValidParams` should be a struct that contains all the hyperparameters as fields, and `MyAlgParams` should just be a newtype that wraps `MyAlgValidParams`.
 ```rust
-impl Default for MyAlgParams {
-    fn default() -> MyAlgParams {
-        MyAlg::params()
-    }
+struct MyAlgValidParams {
+    eps: f32,
+    backwards: bool,
 }
+
+struct MyAlgParams(MyAlgValidParams);
 ```
 
-The `MyAlgParams` should implement the Consuming Builder pattern, explained in the [Rust Book](https://doc.rust-lang.org/1.0.0/style/ownership/builders.html). Each hyperparameter gets a single field in the struct, as well as a method to modify it. Sometimes a random number generator is used in the training process. Then two separate methods should take a seed or a random number generator. With the seed a default RNG is initialized, for example [Isaac64](https://docs.rs/rand_isaac/0.2.0/rand_isaac/isaac64/index.html).
+`MyAlgParams` should implement the Consuming Builder pattern, explained in the [Rust Book](https://doc.rust-lang.org/1.0.0/style/ownership/builders.html). Each hyperparameter gets a method to modify it. `MyAlgParams` should also implement the `ParamGuard` trait, which facilitates parameter checking. The associated type `ParamGuard::Checked` should be `MyAlgValidParams` and the `check_ref()` method should contain the parameter checking logic, while `check()` simply calls `check_ref()` before unwrapping the inner `MyAlgValidParams`.
 
-With a constructed set of parameters, the `MyAlgParams::fit(..) -> Result<MyAlg>` executes the learning process and returns a learned state. If one of the parameters is invalid (for example out of a required range), then an `Error::InvalidState` should be returned. For transformers there is only `MyAlg`, and no `MyAlgParams`, because there is no hidden state to be learned.
+With a checked set of parameters, `MyAlgValidParams::fit(..) -> Result<MyAlg>` executes the learning process and returns a learned state. Due to blanket impls on `ParamGuard`, it's also possible to call `fit()` or `transform()` directly on `MyAlgParams` as well, which performs the parameter checking before the learning process.
 
 Following this convention, the pattern can be used by the user like this:
 ```rust
@@ -45,6 +48,11 @@ MyAlg::params()
     ...
     .fit(&dataset)?;
 ```
+or, if the checking is done explicitly:
+```rust
+let params = MyAlg::params().check();
+params.fit(&dataset);
+```
 
 ## Generic float types
 
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa"
-version = "0.4.0"
+version = "0.5.0"
 authors = [
     "Luca Palmieri <rust@lpalmieri.com>",
     "Lorenz Schmidt <bytesnake@mailbox.org>",
diff --git a/algorithms/linfa-bayes/Cargo.toml b/algorithms/linfa-bayes/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-bayes"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["VasanthakumarV <vasanth260m12@gmail.com>"]
 description = "Collection of Naive Bayes Algorithms"
 edition = "2018"
@@ -15,8 +15,8 @@ ndarray = { version = "0.15" , features = ["blas", "approx"]}
 ndarray-stats = "0.5"
 thiserror = "1.0"
 
-linfa = { version = "0.4.0", path = "../.." }
+linfa = { version = "0.5.0", path = "../.." }
 
 [dev-dependencies]
 approx = "0.4"
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["winequality"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["winequality"] }
diff --git a/algorithms/linfa-clustering/Cargo.toml b/algorithms/linfa-clustering/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-clustering"
-version = "0.4.0"
+version = "0.5.0"
 edition = "2018"
 authors = [
     "Luca Palmieri <rust@lpalmieri.com>",
@@ -37,8 +37,8 @@ rand_isaac = "0.3"
 space = "0.12"
 thiserror = "1.0"
 partitions = "0.2.4"
-linfa = { version = "0.4.0", path = "../..", features = ["ndarray-linalg"] }
-linfa-nn = { version = "0.1.0", path = "../linfa-nn" }
+linfa = { version = "0.5.0", path = "../..", features = ["ndarray-linalg"] }
+linfa-nn = { version = "0.5.0", path = "../linfa-nn" }
 noisy_float = "0.2.0"
 
 [dev-dependencies]
diff --git a/algorithms/linfa-elasticnet/Cargo.toml b/algorithms/linfa-elasticnet/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-elasticnet"
-version = "0.4.0"
+version = "0.5.0"
 authors = [
     "Paul Körbitz / Google <koerbitz@google.com>",
     "Lorenz Schmidt <bytesnake@mailbox.org>"
@@ -35,9 +35,9 @@ num-traits = "0.2"
 approx = "0.4"
 thiserror = "1.0"
 
-linfa = { version = "0.4.0", path = "../.." }
+linfa = { version = "0.5.0", path = "../.." }
 
 [dev-dependencies]
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["diabetes"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["diabetes"] }
 ndarray-rand = "0.14"
 rand_isaac = "0.3"
diff --git a/algorithms/linfa-hierarchical/Cargo.toml b/algorithms/linfa-hierarchical/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-hierarchical"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["Lorenz Schmidt <lorenz.schmidt@mailbox.org>"]
 edition = "2018"
 
@@ -18,10 +18,10 @@ ndarray = { version = "0.15", default-features = false }
 kodama = "0.2"
 thiserror = "=1.0.25"
 
-linfa = { version = "0.4.0", path = "../.." }
-linfa-kernel = { version = "0.4.0", path = "../linfa-kernel" }
+linfa = { version = "0.5.0", path = "../.." }
+linfa-kernel = { version = "0.5.0", path = "../linfa-kernel" }
 
 [dev-dependencies]
 rand = "0.8"
 ndarray-rand = "0.14"
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["iris"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["iris"] }
diff --git a/algorithms/linfa-ica/Cargo.toml b/algorithms/linfa-ica/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-ica"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["VasanthakumarV <vasanth260m12@gmail.com>"]
 description = "A collection of Independent Component Analysis (ICA) algorithms"
 edition = "2018"
@@ -32,7 +32,7 @@ num-traits = "0.2"
 rand_isaac = "0.3"
 thiserror = "1.0"
 
-linfa = { version = "0.4.0", path = "../..", features = ["ndarray-linalg"] }
+linfa = { version = "0.5.0", path = "../..", features = ["ndarray-linalg"] }
 
 [dev-dependencies]
 ndarray-npy = { version = "0.8", default-features = false }
diff --git a/algorithms/linfa-kernel/Cargo.toml b/algorithms/linfa-kernel/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-kernel"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["Lorenz Schmidt <bytesnake@mailbox.org>"]
 description = "Kernel methods for non-linear algorithms"
 edition = "2018"
@@ -28,5 +28,5 @@ ndarray = "0.15"
 num-traits = "0.2"
 sprs = { version="0.11.0", default-features = false }
 
-linfa = { version = "0.4.0", path = "../.." }
-linfa-nn = { version = "0.1.0", path = "../linfa-nn" }
+linfa = { version = "0.5.0", path = "../.." }
+linfa-nn = { version = "0.5.0", path = "../linfa-nn" }
diff --git a/algorithms/linfa-linear/Cargo.toml b/algorithms/linfa-linear/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-linear"
-version = "0.4.0"
+version = "0.5.0"
 authors = [
     "Paul Körbitz / Google <koerbitz@google.com>",
     "VasanthakumarV <vasanth260m12@gmail.com>"
@@ -25,8 +25,8 @@ argmin = { version = "0.4.6", features = ["ndarrayl"] }
 serde = { version = "1.0", default-features = false, features = ["derive"] }
 thiserror = "1.0"
 
-linfa = { version = "0.4.0", path = "../..", features=["serde"] }
+linfa = { version = "0.5.0", path = "../..", features=["serde"] }
 
 [dev-dependencies]
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["diabetes"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["diabetes"] }
 approx = "0.4"
diff --git a/algorithms/linfa-logistic/Cargo.toml b/algorithms/linfa-logistic/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-logistic"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["Paul Körbitz / Google <koerbitz@google.com>"]
 
 description = "A Machine Learning framework for Rust"
@@ -22,8 +22,8 @@ argmin = { version = "0.4.6", features = ["ndarrayl"] }
 serde = "1.0"
 thiserror = "1.0"
 
-linfa = { version = "0.4.0", path = "../..", features=["serde"] }
+linfa = { version = "0.5.0", path = "../..", features=["serde"] }
 
 [dev-dependencies]
 approx = "0.4"
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["winequality"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["winequality"] }
diff --git a/algorithms/linfa-logistic/README.md b/algorithms/linfa-logistic/README.md
@@ -5,13 +5,19 @@
 `linfa-logistic` is a crate in the [`linfa`](https://crates.io/crates/linfa) ecosystem, an effort to create a toolkit for classical Machine Learning implemented in pure Rust, akin to Python's `scikit-learn`.
 
 ## Current state
-`linfa-logistic` provides a pure Rust implementation of a two class logistic regression model. 
+`linfa-logistic` provides pure Rust implementations of two-class and multinomial logistic regression models.
 
 ## Examples
-There is an usage example in the `examples/` directory. To run, use:
+There are usage examples in the `examples/` directory.
 
+To run the two-class example, use:
 ```bash
-$ cargo run --example winequality
+$ cargo run --example winequality --features linfa/<blas-library>
+```
+
+To run the multinomial example, use:
+```bash
+$ cargo run --example winequality_multi --features linfa/<blas-library>
 ```
 
 ## License
diff --git a/algorithms/linfa-logistic/examples/winequality_multi.rs b/algorithms/linfa-logistic/examples/winequality_multi.rs
@@ -0,0 +1,36 @@
+use linfa::prelude::*;
+use linfa_logistic::MultiLogisticRegression;
+
+use std::error::Error;
+
+fn main() -> Result<(), Box<dyn Error>> {
+    let (train, valid) = linfa_datasets::winequality().split_with_ratio(0.9);
+
+    println!(
+        "Fit Multinomial Logistic Regression classifier with #{} training points",
+        train.nsamples()
+    );
+
+    // fit a Logistic regression model with 150 max iterations
+    let model = MultiLogisticRegression::default()
+        .max_iterations(50)
+        .fit(&train)
+        .unwrap();
+
+    // predict and map targets
+    let pred = model.predict(&valid);
+
+    // create a confusion matrix
+    let cm = pred.confusion_matrix(&valid).unwrap();
+
+    // Print the confusion matrix, this will print a table with four entries. On the diagonal are
+    // the number of true-positive and true-negative predictions, off the diagonal are
+    // false-positive and false-negative
+    println!("{:?}", cm);
+
+    // Calculate the accuracy and Matthew Correlation Coefficient (cross-correlation between
+    // predicted and targets)
+    println!("accuracy {}, MCC {}", cm.accuracy(), cm.mcc());
+
+    Ok(())
+}
diff --git a/algorithms/linfa-logistic/src/lib.rs b/algorithms/linfa-logistic/src/lib.rs
@@ -404,7 +404,7 @@ fn log_sum_exp<F: linfa::Float, A: Data<Elem = F>>(
     // Computes `max + ln(exp(x1-max) + exp(x2-max) + exp(x3-max) + ...)`, which is equal to the
     // log_sum_exp formula
     let reduced = m.fold_axis(axis, F::zero(), |acc, elem| *acc + (*elem - max).exp());
-    reduced.mapv_into(|e| e.ln() + max)
+    reduced.mapv_into(|e| e.max(F::cast(1e-15)).ln() + max)
 }
 
 /// Computes `exp(n - max) / sum(exp(n- max))`, which is a numerically stable version of softmax
diff --git a/algorithms/linfa-nn/Cargo.toml b/algorithms/linfa-nn/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-nn"
-version = "0.1.0"
+version = "0.5.0"
 authors = ["YuhanLiin <yuhanliin+github@protonmail.com>"]
 edition = "2018"
 description = "A collection of nearest neighbour algorithms"
@@ -33,7 +33,7 @@ thiserror = "1.0"
 
 kdtree = "0.6.0"
 
-linfa = { version = "0.4.0", path = "../.." }
+linfa = { version = "0.5.0", path = "../.." }
 
 [dev-dependencies]
 approx = "0.4"
diff --git a/algorithms/linfa-pls/Cargo.toml b/algorithms/linfa-pls/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-pls"
-version = "0.4.0"
+version = "0.5.0"
 edition = "2018"
 authors = ["relf <remi.lafage@onera.fr>"]
 description = "Partial Least Squares family methods"
@@ -32,9 +32,9 @@ rand_isaac = "0.3"
 num-traits = "0.2"
 paste = "1.0"
 thiserror = "1.0"
-linfa = { version = "0.4.0", path = "../..", features = ["ndarray-linalg"] }
+linfa = { version = "0.5.0", path = "../..", features = ["ndarray-linalg"] }
 
 [dev-dependencies]
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["linnerud"] }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["linnerud"] }
 rand_isaac = "0.3"
 approx = "0.4"
diff --git a/algorithms/linfa-preprocessing/Cargo.toml b/algorithms/linfa-preprocessing/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "linfa-preprocessing"
-version = "0.4.0"
+version = "0.5.0"
 authors = ["Sauro98 <ivadonadi98@gmail.com>"]
 
 description = "A Machine Learning framework for Rust"
@@ -17,7 +17,7 @@ categories = ["algorithms", "mathematics", "science"]
 
 [dependencies]
 
-linfa = { version = "0.4.0", path = "../..", features = ["ndarray-linalg"] }
+linfa = { version = "0.5.0", path = "../..", features = ["ndarray-linalg"] }
 ndarray = { version = "0.15", default-features = false, features = ["approx", "blas"] }
 ndarray-linalg = { version = "0.14" }
 ndarray-stats = "0.5"
@@ -30,8 +30,8 @@ encoding = "0.2"
 sprs =  { version="0.11.0", default-features = false }
 
 [dev-dependencies]
-linfa-datasets = { version = "0.4.0", path = "../../datasets", features = ["diabetes", "winequality"] }
-linfa-bayes = { version = "0.4.0", path = "../linfa-bayes" }
+linfa-datasets = { version = "0.5.0", path = "../../datasets", features = ["diabetes", "winequality"] }
+linfa-bayes = { version = "0.5.0", path = "../linfa-bayes" }
 iai = "0.1" 
 curl = "0.4.35"
 flate2 = "1.0.20"
diff --git a/algorithms/linfa-reduction/Cargo.toml b/algorithms/linfa-reduction/Cargo.toml
diff --git a/algorithms/linfa-svm/Cargo.toml b/algorithms/linfa-svm/Cargo.toml
diff --git a/algorithms/linfa-trees/Cargo.toml b/algorithms/linfa-trees/Cargo.toml
diff --git a/algorithms/linfa-tsne/Cargo.toml b/algorithms/linfa-tsne/Cargo.toml
diff --git a/datasets/Cargo.toml b/datasets/Cargo.toml
diff --git a/docs/website/content/news/release_040/index.md b/docs/website/content/news/release_040/index.md
diff --git a/docs/website/content/news/release_040/tsne.png b/docs/website/content/news/release_040/tsne.png
diff --git a/docs/website/content/news/release_050.md b/docs/website/content/news/release_050.md