@@ -118,20 +118,25 @@ remotesom train \
118118 -x 5 -y 5 -d " <DIMENSION D>" \
119119 -T out-test-topology.json -o out-test-som.json \
120120 -s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \
121- -D mydata.bin -n " <DATA SIZE N> "
121+ -D mydata.bin
122122```
123123
124124This will output the trained SOM in ` out-test-som.json ` . The JSON file will
125125contain an array of arrays of numbers; each of the arrays represents one
126126feature vector that corresponds to one SOM centroid. The SOM topology is saved
127- in ` out-test-topoology .json ` as a matrix of squared distances of the SOM nodes
127+ in ` out-test-topology .json ` as a matrix of squared distances of the SOM nodes
128128in the map space (i.e., * not* centroids in data space), the JSON contains an
129129array of arrays of numbers, forming the columns of the all-to-all distance
130130matrix.
131131
132- It is advisable for all data hosts to locally verify that their data is in a
133- good shape to train at least a local SOM before starting the federated
134- training.
132+ The command assumes the number of the data points from the size of ` mydata.bin `
133+ (as divided by the dimension in ` -d ` and the size of a single ` float32 ` ). To
134+ avoid the extra "guessing" step, you can also specify exact data size using
135+ option ` -n ` , such as ` -n 10000 ` .
136+
137+ Before starting the federated training, it is advisable for all data hosts to
138+ locally verify that their data is in a good shape by training and examining
139+ such local SOM.
135140
136141### Prepare and exchange the keys (on all nodes)
137142
@@ -196,14 +201,14 @@ Refer to documentation of `ssh` for details.
196201
197202Each data host starts their own server by pointing it to the appropriate
198203cryptography keys and the local data source:
199-
200204``` sh
201205remotesom server \
202- -D mydata.bin -n " <DATA SIZE N> " - d " <DIMENSION D>" \
206+ -D mydata.bin -d " <DIMENSION D>" \
203207 -c server-cert.pem -k server-key.pem -a client-cert.pem
204208```
205- (The data size and dimension must be filled in, depending on the dataset. See
206- ` remotesom server --help ` for all network&security parameters.)
209+ (The data dimension ` -d ` must be filled in, depending on the dataset;
210+ optionally it is also adviseable to specify the datapoint count using ` -n ` . For
211+ details about network&security parameters, see ` remotesom server --help ` .)
207212
208213### Run the training (on coordinator)
209214
@@ -212,17 +217,19 @@ Once data hosts are ready, the coordinator runs several epochs of SOM training:
212217remotesom train-client \
213218 -x 10 -y 10 -d " <DIMENSION D>" \
214219 -T out-topology.json -o out-som.json \
215- -s 10 -s 9 -s 8 -s 7 -s 6 -s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \ # add more training epochs as needed
220+ -s 10 -s 9 -s 8 -s 7 -s 6 -s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \
216221 -c client-cert.pem -k client-key.pem \
217222 connect datahost.uni1.example.org -a server-cert1.pem \
218223 connect hpc.uni2.example.org -a server-cert2.pem \
219224 ... # more data host connections
220225```
221- (The coordinator must fill in the data dimension D , and server certificates and
226+ (The coordinator must fill in the data dimension ` -D ` , and the server certificates and
222227hostnames (and possibly other parameters) of all data hosts. See `remotesom
223228train-client connect --help` for all connection&security parameters.)
224229
225- If everything runs well, the trained 10×10 SOM will appear in ` som.json ` .
230+ If everything runs well, the trained 10×10 SOM will appear in ` out-som.json ` .
231+ For realistic data and larger SOMs, the number of training epochs will
232+ typically need to be adjusted by adding more ` -s ` options.
226233
227234### Compute per-cluster statistics (locally, on data nodes)
228235
@@ -234,8 +241,7 @@ command:
234241
235242``` sh
236243remotesom stats \
237- -i out-test-som.json \
238- -D mydata.bin -n " <DATA SIZE N>" \
244+ -i out-test-som.json -D mydata.bin \
239245 --out-means means.json \
240246 --out-variances variances.json \
241247 --out-counts counts.json \
@@ -368,8 +374,7 @@ focus on, you can use `remotesom subset` to cut out these clusters of the data:
368374
369375``` sh
370376remotesom subset \
371- -i out-test-som.json \
372- -D mydata.bin -n " <DATA SIZE N>" \
377+ -i out-test-som.json -D mydata.bin \
373378 -s 0 -s 1 -s 5 -s 10 \
374379 -O mysubset.bin
375380```
@@ -380,8 +385,8 @@ in the array stored in the SOM file (`out-test-som.json` in this case). Note
380385the clusters are numbered from zero!)
381386
382387After finishing, the ` subset ` command prints out the total number of points
383- that are included in the subset. You can use that in subsequent analysis as the
384- new data size for the option ` -n ` :
388+ that are included in the subset. If desired, you can use that in subsequent
389+ analysis as the new data size for the option ` -n ` :
385390``` sh
386391remotesom train \
387392 [...]
@@ -453,15 +458,16 @@ limitations that are still present:
453458 impossible.
454459- ** Internal data structures** : Some parts of the algorithms still need to
455460 materialize data-size-dependent arrays; most notably the array of "SOM
456- centroid indexes that are closest to all points" is cached in the local
457- median computation to save computation time. This array usually requires N×8
458- bytes. With commonly available 16GB of memory, you can still process a hefty
459- dataset of around 2 billion data points on a single host.
461+ centroid indexes that are closest to all points" is cached in several
462+ algorithms (notably, the local median computation and subsetting). This array
463+ usually requires N×8 bytes. With commonly available 16GB of memory, you can
464+ still process a hefty dataset of around 2 billion data points on a single
465+ host.
460466
461467If you run out of memory because of the data size, you can split the dataset
462468into several data hosts without any impact on the result. In turn, this enables
463- horizontal scalability --- the speed & amount of data processed at once is only
464- limited by the amount of computers you can attach to the analysis.
469+ horizontal scalability --- the speed & total amount of data processed at once
470+ is only limited by the amount of computers you can attach to the analysis.
465471
466472##### SOM size limits
467473
0 commit comments