Skip to content

Commit 2ad0196

Browse files
committed
allow assuming the input filesize (-n is now optional)
1 parent fff279a commit 2ad0196

4 files changed

Lines changed: 58 additions & 30 deletions

File tree

Numeric/RemoteSOM/IO.hs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,14 @@ writeArrayStorable a fp = do
5757
bracket (openFile fp WriteMode) hClose $ \h ->
5858
SV.hPut h . SV.pack $ A.toList a
5959

60+
getFileEntryCount ::
61+
(A.Shape sh, Storable a) => sh -> a -> FilePath -> IO (Maybe Int)
62+
getFileEntryCount sh a fp = do
63+
sz <- fromInteger <$> bracket (openFile fp ReadMode) (hClose) hFileSize
64+
case divMod sz $ shapeSize' sh * sizeOf a of
65+
(n, 0) -> pure (Just n)
66+
_ -> pure Nothing
67+
6068
withMmapArray ::
6169
forall sh a r.
6270
( A.Shape sh

README.md

Lines changed: 30 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -118,20 +118,25 @@ remotesom train \
118118
-x 5 -y 5 -d "<DIMENSION D>" \
119119
-T out-test-topology.json -o out-test-som.json \
120120
-s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \
121-
-D mydata.bin -n "<DATA SIZE N>"
121+
-D mydata.bin
122122
```
123123

124124
This will output the trained SOM in `out-test-som.json`. The JSON file will
125125
contain an array of arrays of numbers; each of the arrays represents one
126126
feature vector that corresponds to one SOM centroid. The SOM topology is saved
127-
in `out-test-topoology.json` as a matrix of squared distances of the SOM nodes
127+
in `out-test-topology.json` as a matrix of squared distances of the SOM nodes
128128
in the map space (i.e., *not* centroids in data space), the JSON contains an
129129
array of arrays of numbers, forming the columns of the all-to-all distance
130130
matrix.
131131

132-
It is advisable for all data hosts to locally verify that their data is in a
133-
good shape to train at least a local SOM before starting the federated
134-
training.
132+
The command assumes the number of the data points from the size of `mydata.bin`
133+
(as divided by the dimension in `-d` and the size of a single `float32`). To
134+
avoid the extra "guessing" step, you can also specify exact data size using
135+
option `-n`, such as `-n 10000`.
136+
137+
Before starting the federated training, it is advisable for all data hosts to
138+
locally verify that their data is in a good shape by training and examining
139+
such local SOM.
135140

136141
### Prepare and exchange the keys (on all nodes)
137142

@@ -196,14 +201,14 @@ Refer to documentation of `ssh` for details.
196201

197202
Each data host starts their own server by pointing it to the appropriate
198203
cryptography keys and the local data source:
199-
200204
```sh
201205
remotesom server \
202-
-D mydata.bin -n "<DATA SIZE N>" -d "<DIMENSION D>" \
206+
-D mydata.bin -d "<DIMENSION D>" \
203207
-c server-cert.pem -k server-key.pem -a client-cert.pem
204208
```
205-
(The data size and dimension must be filled in, depending on the dataset. See
206-
`remotesom server --help` for all network&security parameters.)
209+
(The data dimension `-d` must be filled in, depending on the dataset;
210+
optionally it is also adviseable to specify the datapoint count using `-n`. For
211+
details about network&security parameters, see `remotesom server --help`.)
207212

208213
### Run the training (on coordinator)
209214

@@ -212,17 +217,19 @@ Once data hosts are ready, the coordinator runs several epochs of SOM training:
212217
remotesom train-client \
213218
-x 10 -y 10 -d "<DIMENSION D>" \
214219
-T out-topology.json -o out-som.json \
215-
-s 10 -s 9 -s 8 -s 7 -s 6 -s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \ # add more training epochs as needed
220+
-s 10 -s 9 -s 8 -s 7 -s 6 -s 5 -s 4 -s 3 -s 2 -s 1 -s 0.5 \
216221
-c client-cert.pem -k client-key.pem \
217222
connect datahost.uni1.example.org -a server-cert1.pem \
218223
connect hpc.uni2.example.org -a server-cert2.pem \
219224
... # more data host connections
220225
```
221-
(The coordinator must fill in the data dimension D, and server certificates and
226+
(The coordinator must fill in the data dimension `-D`, and the server certificates and
222227
hostnames (and possibly other parameters) of all data hosts. See `remotesom
223228
train-client connect --help` for all connection&security parameters.)
224229

225-
If everything runs well, the trained 10×10 SOM will appear in `som.json`.
230+
If everything runs well, the trained 10×10 SOM will appear in `out-som.json`.
231+
For realistic data and larger SOMs, the number of training epochs will
232+
typically need to be adjusted by adding more `-s` options.
226233

227234
### Compute per-cluster statistics (locally, on data nodes)
228235

@@ -234,8 +241,7 @@ command:
234241

235242
```sh
236243
remotesom stats \
237-
-i out-test-som.json \
238-
-D mydata.bin -n "<DATA SIZE N>" \
244+
-i out-test-som.json -D mydata.bin \
239245
--out-means means.json \
240246
--out-variances variances.json \
241247
--out-counts counts.json \
@@ -368,8 +374,7 @@ focus on, you can use `remotesom subset` to cut out these clusters of the data:
368374

369375
```sh
370376
remotesom subset \
371-
-i out-test-som.json \
372-
-D mydata.bin -n "<DATA SIZE N>" \
377+
-i out-test-som.json -D mydata.bin \
373378
-s 0 -s 1 -s 5 -s 10 \
374379
-O mysubset.bin
375380
```
@@ -380,8 +385,8 @@ in the array stored in the SOM file (`out-test-som.json` in this case). Note
380385
the clusters are numbered from zero!)
381386

382387
After finishing, the `subset` command prints out the total number of points
383-
that are included in the subset. You can use that in subsequent analysis as the
384-
new data size for the option `-n`:
388+
that are included in the subset. If desired, you can use that in subsequent
389+
analysis as the new data size for the option `-n`:
385390
```sh
386391
remotesom train \
387392
[...]
@@ -453,15 +458,16 @@ limitations that are still present:
453458
impossible.
454459
- **Internal data structures**: Some parts of the algorithms still need to
455460
materialize data-size-dependent arrays; most notably the array of "SOM
456-
centroid indexes that are closest to all points" is cached in the local
457-
median computation to save computation time. This array usually requires N×8
458-
bytes. With commonly available 16GB of memory, you can still process a hefty
459-
dataset of around 2 billion data points on a single host.
461+
centroid indexes that are closest to all points" is cached in several
462+
algorithms (notably, the local median computation and subsetting). This array
463+
usually requires N×8 bytes. With commonly available 16GB of memory, you can
464+
still process a hefty dataset of around 2 billion data points on a single
465+
host.
460466

461467
If you run out of memory because of the data size, you can split the dataset
462468
into several data hosts without any impact on the result. In turn, this enables
463-
horizontal scalability --- the speed & amount of data processed at once is only
464-
limited by the amount of computers you can attach to the analysis.
469+
horizontal scalability --- the speed & total amount of data processed at once
470+
is only limited by the amount of computers you can attach to the analysis.
465471

466472
##### SOM size limits
467473

app/Main.hs

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ import qualified Data.Array.Accelerate as A
3131
import Data.Array.Accelerate (Z(..), (:.)(..))
3232
import qualified Data.Array.Accelerate.LLVM.Native as LL
3333
import Data.Foldable (foldlM)
34+
import Data.Function ((&))
3435
import qualified Data.IntSet as S
3536
import Data.List (foldl1')
3637
import Foreign.Ptr (plusPtr)
@@ -52,8 +53,18 @@ decodeFile file = do
5253
Right x' -> pure x'
5354

5455
withMmapPoints :: InputOpts -> Int -> (A.Matrix Float -> IO a) -> IO a
55-
withMmapPoints iopts dim =
56-
withMmapArray (Z :. inputPoints iopts :. dim) (inputData iopts)
56+
withMmapPoints iopts dim go = do
57+
n <-
58+
inputPoints iopts
59+
& flip
60+
maybe
61+
pure
62+
(maybe (error "error guessing the size of input") id
63+
<$> getFileEntryCount
64+
(Z :. dim)
65+
(undefined :: Float)
66+
(inputData iopts))
67+
withMmapArray (Z :. n :. dim) (inputData iopts) go
5768

5869
withJust :: Applicative m => Maybe a -> (a -> m ()) -> m ()
5970
withJust (Just x) m = m x
@@ -215,6 +226,7 @@ run (SubsetCmd so iopts insom) = do
215226
subset <- S.unions <$> traverse inSpec (soSpecs so)
216227
withMmapPoints iopts dim $ \points -> do
217228
let cs = somClosestLL points som
229+
(Z :. npoints :. _) = A.arrayShape points
218230
case soMemberOutput so of
219231
Nothing -> pure ()
220232
Just output ->
@@ -225,7 +237,7 @@ run (SubsetCmd so iopts insom) = do
225237
Nothing -> pure ()
226238
Just output -> do
227239
n <-
228-
mmapWithFilePtr indata ReadOnly (Just (0, esz * inputPoints iopts)) $ \(ptrData, _) ->
240+
mmapWithFilePtr indata ReadOnly (Just (0, esz * npoints)) $ \(ptrData, _) ->
229241
bracket (openFile output WriteMode) hClose $ \hOut ->
230242
let go (i, n) c =
231243
if c `S.member` subset

app/Opts.hs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ genopts = do
143143

144144
data InputOpts = InputOpts
145145
{ inputData :: FilePath
146-
, inputPoints :: Int
146+
, inputPoints :: Maybe Int
147147
} deriving (Show)
148148

149149
inopts :: Parser InputOpts
@@ -155,11 +155,13 @@ inopts = do
155155
<> metavar "DATA"
156156
<> help "binary file with input data"
157157
inputPoints <-
158-
option auto
158+
optional . option auto
159159
$ long "in-points"
160160
<> short 'n'
161161
<> metavar "N"
162-
<> help "number of datapoints in the input file"
162+
<> help
163+
("number of datapoints in the input file"
164+
++ " (by default, this is guessed from file size)")
163165
pure InputOpts {..}
164166

165167
data TrainOpts = TrainOpts

0 commit comments

Comments
 (0)