|
1 | | -# DiDa.jl |
| 1 | +# <img src="docs/src/assets/logo.svg" alt="DiDa.jl logo" height="32px"> DiDa.jl |
2 | 2 |
|
3 | | -Simple Distributed Data manipulation and processing routines for Julia. |
| 3 | + |
| 4 | +| Build status | Documentation | |
| 5 | +|:---:|:---:| |
| 6 | +|  | [](https://lcsb-biocore.github.io/DiDa.jl/stable/) [](https://lcsb-biocore.github.io/DiDa.jl/dev/) | |
| 7 | + |
| 8 | +Simple distributed data manipulation and processing routines for Julia. |
4 | 9 |
|
5 | 10 | This was originally developed for |
6 | | -[GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains |
7 | | -the separated-out lightweight distributed-processing framework that can be used |
8 | | -with GigaSOM. |
| 11 | +[`GigaSOM.jl`](https://github.com/LCSB-BioCore/GigaSOM.jl); DiDa.jl package |
| 12 | +contains the separated-out lightweight distributed-processing framework that |
| 13 | +was used in `GigaSOM.jl`. |
| 14 | + |
| 15 | +## Why? |
| 16 | + |
| 17 | +DiDa.jl provides a very simple, imperative and straightforward way to move your |
| 18 | +data around a cluster of Julia processes created by the |
| 19 | +[`Distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/) package, |
| 20 | +and run computation on the distributed data pieces. The main aim of the package |
| 21 | +is to avoid anything complicated-- the first version used in |
| 22 | +[GigaSOM](https://github.com/LCSB-BioCore/GigaSOM.jl) had just under 500 lines |
| 23 | +of relatively straightforward code (including the doc-comments). |
| 24 | + |
| 25 | +Compared to plain `Distributed` API, you get more straightforward data |
| 26 | +manipulation primitives, some extra control over the precise place where code |
| 27 | +is executed, and a few high-level functions. These include a distributed |
| 28 | +version of `mapreduce`, simpler work-alike of the |
| 29 | +[DistributedArrays](https://github.com/JuliaParallel/DistributedArrays.jl) |
| 30 | +functionality, and easy-to-use distributed dataset saving and loading. |
| 31 | + |
| 32 | +Most importantly, the main motivation behind the package is that the |
| 33 | +distributed processing should be simple and accessible. |
| 34 | + |
| 35 | +## Brief how-to |
| 36 | + |
| 37 | +The package provides a few very basic primitives that lightly wrap the |
| 38 | +`Distributed` package functions `remotecall` and `fetch`. The most basic one is |
| 39 | +`save_at`, which takes a worker ID, variable name and variable content, and |
| 40 | +saves the content to the variable on the selected worker. `get_from` works the |
| 41 | +same way, but takes the data back from the worker. |
| 42 | + |
| 43 | +You can thus send some random array to a few distributed workers: |
| 44 | + |
| 45 | +```julia |
| 46 | +julia> using Distributed, DiDa |
| 47 | + |
| 48 | +julia> addprocs(2) |
| 49 | +2-element Array{Int64,1}: |
| 50 | + 2 |
| 51 | + 3 |
| 52 | + |
| 53 | +julia> @everywhere using DiDa |
| 54 | + |
| 55 | +julia> save_at(2, :x, randn(10,10)) |
| 56 | +Future(2, 1, 4, nothing) |
| 57 | +``` |
| 58 | + |
| 59 | +The `Future` returned from `save_at` is the normal Julia future from |
| 60 | +`Distributed`, you can even `fetch` it to wait until the operation is really |
| 61 | +done on the other side. Fetching the data is done the same way: |
| 62 | + |
| 63 | +```julia |
| 64 | +julia> get_from(2,:x) |
| 65 | +Future(2, 1, 15, nothing) |
| 66 | + |
| 67 | +julia> get_val_from(2,:x) # auto-fetch()ing variant |
| 68 | +10×10 Array{Float64,2}: |
| 69 | + -0.850788 0.946637 1.78006 … |
| 70 | + -0.49596 0.497829 -2.03013 |
| 71 | + … |
| 72 | +``` |
| 73 | + |
| 74 | +All commands support full quoting, which allows you to easily distinguish |
| 75 | +between code parts that are executed locally and remotely: |
| 76 | + |
| 77 | +```julia |
| 78 | +julia> save_at(3, :x, randn(1000,1000)) # generates a matrix locally and sends it to the remote worker |
| 79 | + |
| 80 | +julia> save_at(3, :x, :(randn(1000,1000))) # generates a matrix right on the remote worker and saves it there |
| 81 | + |
| 82 | +julia> get_val_from(3, :x) # retrieves the generated matrix and fetches it |
| 83 | +… |
| 84 | + |
| 85 | +julia> get_val_from(3, :(randn(1000,1000))) # generates the matrix on the worker and fetches the data |
| 86 | +… |
| 87 | +``` |
| 88 | + |
| 89 | +Notably, this is different from the approach taken by `DistributedArrays` and |
| 90 | +similar packages -- all data manipulation is explicit, and any data type is |
| 91 | +supported as long as it can be moved among workers by the `Distributed` |
| 92 | +package. This helps with various highly non-array-ish data, such as large text |
| 93 | +corpora and graphs. |
| 94 | + |
| 95 | +There are various goodies for easy work with matrix-style data, namely |
| 96 | +scattering, gathering and running distributed algorithms: |
| 97 | + |
| 98 | +```julia |
| 99 | +julia> x = randn(1000,3) |
| 100 | +1000×3 Array{Float64,2}: |
| 101 | + -0.992481 0.551064 1.67424 |
| 102 | + -0.751304 -0.845055 0.105311 |
| 103 | + -0.712687 0.165619 -0.469055 |
| 104 | + ⋮ |
| 105 | + |
| 106 | +julia> dataset = scatter_array(:myDataset, x, workers()) # sends slices of the array to workers |
| 107 | +Dinfo(:myDataset, [2, 3]) # a helper for holding the variable name and the used workers together |
| 108 | + |
| 109 | +julia> get_val_from(3, :(size(myDataset))) |
| 110 | +(500, 3) # there's really only half of the data |
| 111 | + |
| 112 | +julia> dmapreduce(dataset, sum, +) # MapReduce-style sum of all data |
| 113 | +-51.64369103751014 |
| 114 | + |
| 115 | +julia> dstat(dataset, [1,2,3]) # get means and sdevs in individual columns |
| 116 | +([-0.030724038974465212, 0.007300925745200863, -0.028220577808245786], |
| 117 | + [0.9917470012495775, 0.9975120525455358, 1.000243845434252]) |
| 118 | + |
| 119 | +julia> dmedian(dataset, [1,2,3]) # distributed iterative median in columns |
| 120 | +3-element Array{Float64,1}: |
| 121 | + 0.004742259615849834 |
| 122 | + 0.039043266340824986 |
| 123 | + -0.05367799062404967 |
| 124 | + |
| 125 | +julia> dtransform(dataset, x -> 2 .^ x) # exponentiate all data (medians should now be around 1) |
| 126 | +Dinfo(:myDataset, [2, 3]) |
| 127 | + |
| 128 | +julia> gather_array(dataset) # download the data from workers to a sing |
| 129 | +1000×3 Array{Float64,2}: |
| 130 | + 0.502613 1.46517 3.1915 |
| 131 | + 0.594066 0.55669 1.07573 |
| 132 | + 0.610183 1.12165 0.722438 |
| 133 | + ⋮ |
| 134 | +``` |
| 135 | + |
| 136 | +## What does the name `DiDa` mean? |
| 137 | + |
| 138 | +**Di**stributed **Da**ta. |
9 | 139 |
|
| 140 | +There is no consensus on how to pronounce the shortcut. |
0 commit comments