[RFC/WIP] Asynchronous ParallelEvaluator#46
Open
alyst wants to merge 12 commits into
Open
Conversation
1334f07 to
e55a8f4
Compare
e55a8f4 to
88f9ff1
Compare
Merged
88f9ff1 to
3fd19d2
Compare
9490577 to
b7b714d
Compare
b7b714d to
e9b2a91
Compare
e9b2a91 to
96244aa
Compare
96244aa to
05c6c5b
Compare
Collaborator
|
@alyst waht is the status of this branch now? In light of Julia 1.0 soon seems to be coming can we try to unify the different parallellization branches and ideas and merge with master? |
alyst
referenced
this pull request
May 22, 2018
Contributor
Author
|
I have a rebased version in my staging branch, I will update this one after #83 . I'm using this branch and it works for me, with some caveats:
I don't know how much 0.7 improves the situation with the workers. Maybe we can check this branch with 0.7alpha and merge it after making sure that Ctrl+C doesn't crash Julia so easily. |
required for putting fitness to/from a (shared) array
05c6c5b to
653f405
Compare
Use N workers to asynchronously calculate fitnesses. Requests for fitness calculation and completion notifications as well as input parameters and output fitness are passed via SharedVector/SharedMatrix to minimize serialization overhead.
Parallelized versions of - update_population_fitness!() - populate_by_mutants!() - step!()
653f405 to
7e09cb5
Compare
looks like using SharedArrays in parallel_evaluator.jl is not enough
worker2job is replaced with busy_workers
- don't use fitness_slots for communication, add dedicated job_done, job_submitted - add fitness to the archive outside of job_assignments critical section
- convert `@info` into `@debug` - improve debug message verbosity
also avoid race when reading worker param status
and output when the worker shuts down
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PR changes the
ParallelEvaluatorto be asynchronous:async_update_fitness(), which immediately returns the fitness calculation job Id and the job gets submitted to one of the available worker processes.isready()function. Also, the candidates with the recently calculated fitness could be processed by callingprocess_completed() do job_id, candidate <custom code> endroutine. (Note: ifisready(job_id)is called and it returns true, this job would not be enumerated by a call toprocess_completed()anymore and it's up for the caller to take actions).The old synchronous API (used by NES) is still supported.
BorgMOEAis updated to support asynchronousParallelEvaluator: the algorithm runs on master, the new individual is generated by recombination and sent to the parallel evaluation, the further processing of the individual (updating the population and the archive) is postponed until its fitness is evaluated. This should improve the performance for the problems with computationally intensive fitness functions. I see the speed up for my problems, although it's not linear.There are several reasons why it's RFC/WIP:
async_update_fitness()/isready()/process_completed()is somewhat confusing, but so far I had no better ideas given thatFuture{T}approach would create too much overhead.shutdown!()call. However,OptController/OptRunControllerinterface assumes the same optimization method (and its evaluator) could be reused in severalOptRunControllerruns. With theParallelEvaluatorit is currently not possible, because all the workers are killed by theshutdown!()call at the end of the first run. I see two alternatives:start!(Evaluator)/shutdown!(Evaluator)methods that needs to be called immediately before/after method iterations (no-ops for the normal evaluators);ParallelEvaluator, detect idle periods, hybernate the workers when idle (usingwait()), resume on the new fitness evaluation request. Given the state of the parallelism in julia it's not so easy to get it working nicely.