Rust-Data-Science
diff --git a/‎README.md‎
Lines changed: 0 additions & 93 deletions b/‎README.md‎
Lines changed: 0 additions & 93 deletions
diff --git a/‎README.rst‎
Lines changed: 119 additions & 0 deletions b/‎README.rst‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎benchmark.md‎
Lines changed: 0 additions & 67 deletions b/‎benchmark.md‎
Lines changed: 0 additions & 67 deletions
diff --git a/‎benchmark.rst‎
Lines changed: 80 additions & 0 deletions b/‎benchmark.rst‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎develop.md‎
Lines changed: 0 additions & 21 deletions b/‎develop.md‎
Lines changed: 0 additions & 21 deletions
@@ -0,0 +1,119 @@
+ulist
+=====
+
+|PyPI| |License| |CI| |doc| |publish| |code style| |Coverage|
+
+`Documentation <https://tushushu.github.io/ulist/>`__ \| `Source
+Code <https://github.com/tushushu/ulist>`__
+
+What
+~~~~
+
+| Ulist is an ultra fast list/array data structure written in Rust with
+  Python bindings. It aims to be the fundamental package for processing
+  and computing 1-D list/array in Python.
+| It provides:
+
+-  an efficient, flexible and expressive 1-D list/array object;
+-  broadcasting methods;
+-  a SQL-like and method-chaining programming experience;
+
+Performance
+~~~~~~~~~~~
+
+| Ulist is extremly fast, and even compared with libraries like Numpy.
+  It is
+- more efficient on the ``string`` and ``boolean`` array,
+- same level efficient on the ``integer`` array,
+- and a bit slower on the ``floating`` array.
+
+Faster than Numpy is not the target of writing this repo, because they are just two different libraries. Ulist is more focused on general domain rather than just data science/machine learning/AI, for example the Linear Algebra Computation is not provided. But if you are curious about the performance, please see the `benchmarking results <https://github.com/tushushu/ulist/blob/main/benchmark.md>`__.
+
+Requirements
+~~~~~~~~~~~~
+
+-  Python: 3.7+
+-  OS: Linux, MacOS and Windows
+
+Installation
+~~~~~~~~~~~~
+
+Run ``pip install ulist``
+
+Examples
+~~~~~~~~
+
+Count the number of items in bins.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Given an array ``arr``, count the number of items in bins [0, 3), [3, 6), [6, 9) and [9, +inf). The ``result`` is a Python dictionary with bin names as keys and numbers as values.
+
+.. code:: python
+
+   >>> import ulist as ul
+
+   >>> arr = ul.arange(12)
+   >>> arr
+   UltraFastList([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
+
+   >>> result = arr.case(default='[9, +inf)')\
+   ...             .when(lambda x: x < 3, then='[0, 3)')\
+   ...             .when(lambda x: x < 6, then='[3, 6)')\
+   ...             .when(lambda x: x < 9, then='[6, 9)')\
+   ...             .end()\
+   ...             .counter()
+   >>> result
+   {'[3, 6)': 3, '[9, +inf)': 3, '[6, 9)': 3, '[0, 3)': 3}
+
+Dot product.
+^^^^^^^^^^^^
+
+Given two 1-D arrays and calculate the dot product result of those arrays.
+
+.. code:: python
+
+   >>> import ulist as ul
+
+   >>> arr = ul.from_seq(range(1, 4), dtype='float')
+   >>> arr
+   UltraFastList([1.0, 2.0, 3.0])
+
+   >>> result = arr.mul(arr).sum()
+   >>> result
+   14.0
+
+Rate of adults.
+^^^^^^^^^^^^^^^
+
+Given the ages of people as ``arr``, and suppose the adults are equal or above 18. Clean the data by removing abnormal values and then calculate the rate of adults.
+
+.. code:: python
+
+   >>> import ulist as ul
+
+   >>> arr = ul.from_seq([-1, 10, 15, 20, 30, 50, 70, 80, 100, 200], dtype='int')
+   >>> result = arr.where(lambda x: (x >= 0) & (x < 120))\
+   ...             .apply(lambda x: x >= 18)\
+   ...             .mean()
+   >>> result
+   0.75
+
+Contribute
+~~~~~~~~~~
+
+All contributions are welcome. See `Developer Guide <https://github.com/tushushu/ulist/blob/main/develop.md>`__
+
+.. |PyPI| image:: https://badge.fury.io/py/ulist.svg
+   :target: https://pypi.org/project/ulist/
+.. |License| image:: https://img.shields.io/github/license/tushushu/ulist
+   :target: https://github.com/tushushu/ulist/blob/main/LICENSE
+.. |CI| image:: https://github.com/tushushu/ulist/actions/workflows/main.yml/badge.svg?branch=0.9.0
+   :target: https://github.com/tushushu/ulist/actions/workflows/main.yml
+.. |doc| image:: https://github.com/tushushu/ulist/actions/workflows/sphinx.yml/badge.svg?branch=0.9.0
+   :target: https://github.com/tushushu/ulist/actions/workflows/sphinx.yml
+.. |publish| image:: https://github.com/tushushu/ulist/actions/workflows/publish.yml/badge.svg?branch=0.9.0
+   :target: https://github.com/tushushu/ulist/actions/workflows/publish.yml
+.. |code style| image:: https://img.shields.io/badge/style-flake8-blue
+   :target: https://github.com/PyCQA/flake8
+.. |Coverage| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/tushushu/3a76a8f4c0d25c24b840fe66a3cf44c1/raw/metacov.json
+   :target: https://github.com/tushushu/ulist/actions/workflows/coverage.yml
@@ -0,0 +1,80 @@
+How do we benchmark?
+~~~~~~~~~~~~~~~~~~~~
+
+This benchmarking task is run by Github actions on ubuntu-latest. This document would be updated every time a new version is released.
+
+For each dtype like ``int``, ``float``, ``str`` and ``bool``, there would be some sub-tasks to compare the performances between ``ulist`` and ``numpy``. There are 5 rounds for each sub-task with different array sizes and number of runs:
+
+- XS - array size 100, run 100K times;
+- S - array size 1K, run 100K times;
+- M - array size 10K, run 10K times;
+- L - array size 100K, run 1K times;
+- XL - array size 1M, run 100 times.
+
+and the result of each round and the average result are both recorded.
+
+What does the result mean?
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The benchmark score would be displayed as a markdown table similar to below:
+
+======== ===== ==== ==== ==== ==== ==== =======
+Item     Dtype XS   S    M    L    XL   Average
+======== ===== ==== ==== ==== ==== ==== =======
+AddOne   int   0.9x 1.0x 1.0x 1.0x 1.1x 1.0x
+ArraySum int   4.8x 6.2x 7.4x 6.4x 7.3x 6.4x
+EqualOne int   1.3x 1.3x 1.0x 0.9x 0.8x 1.1x
+======== ===== ==== ==== ==== ==== ==== =======
+
+Item - The task to compare the performances.
+Dtype - The array element type.
+
+Take the 3rd line for example, it means by running the task ``EqualOne`` with
+``dtype=int``, the ``ulist``\ ’s speed is 1.1 times of ``numpy`` on average.
+
+Benchmark score
+~~~~~~~~~~~~~~~
+
+| Info:
+
+----
+
+| Date: 2022-02-26 10:38:31
+| System OS: Linux
+| CPU: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+| Python version: 3.10.2
+| Ulist version: 0.8.0
+| Numpy version: 1.22.0
+
+----
+
+Result:
+
+============ ====== ===== ===== ===== ===== ===== ======= ======
+Item         Dtype  XS    S     M     L     XL    Average Faster
+============ ====== ===== ===== ===== ===== ===== ======= ======
+AddOne       int    0.9x  1.0x  1.0x  1.0x  1.1x  1.0x    N
+ArraySum     int    6.0x  7.0x  8.4x  5.5x  7.0x  6.8x    Y
+CountElems   int    9.7x  1.7x  0.9x  0.8x  0.9x  2.8x    Y
+EqualOne     int    1.4x  1.4x  1.4x  0.9x  0.8x  1.2x    Y
+Max          int    4.4x  3.7x  3.2x  3.0x  3.2x  3.5x    Y
+MulTwo       int    1.0x  1.0x  0.8x  0.8x  0.8x  0.9x    N
+UniqueElem   int    2.7x  0.5x  0.4x  0.3x  0.3x  0.8x    N
+Sort         int    0.8x  0.6x  0.9x  0.9x  0.9x  0.8x    N
+AddOne       float  1.0x  1.2x  1.2x  1.1x  1.1x  1.1x    Y
+ArraySum     float  4.0x  2.0x  0.7x  0.4x  0.4x  1.5x    Y
+LessThanOne  float  1.2x  1.2x  0.9x  0.8x  1.0x  1.0x    N
+Max          float  2.9x  1.1x  0.2x  0.1x  0.1x  0.9x    N
+MulTwo       float  1.0x  1.0x  1.1x  1.0x  1.0x  1.0x    N
+Sort         float  0.9x  0.6x  0.7x  0.7x  0.7x  0.7x    N
+AllIsTrue    bool   5.5x  3.4x  1.2x  0.7x  0.6x  2.3x    Y
+AndOp        bool   0.5x  0.8x  1.3x  4.3x  3.8x  2.1x    Y
+AnyIsTrue    bool   5.4x  3.4x  1.2x  0.7x  0.6x  2.3x    Y
+NotOp        bool   0.6x  0.9x  1.5x  4.9x  4.5x  2.5x    Y
+OrOp         bool   0.5x  0.8x  1.4x  3.6x  3.4x  1.9x    Y
+ContainsElem string 16.4x 20.1x 20.6x 20.7x 20.3x 19.6x   Y
+CountElems   string 4.7x  1.7x  1.5x  1.9x  2.1x  2.4x    Y
+EqualFoo     string 1.2x  2.8x  3.6x  3.9x  2.5x  2.8x    Y
+============ ====== ===== ===== ===== ===== ===== ======= ======
+
+14 of 22 tasks are faster!