hi authors, here is a concern from mine.
In the current framework, the retrieval path is mostly deterministic.
for each task being tested on the test split, given the query of each task, the following parts are deterministic:
- First-stage similarity retrieval
- given the query text
- the embedding is deterministic
- similarity ranking over stored query embeddings is deterministic
- Expansion from query to memory IDs
- once the top query groups are chosen
- the associated memory IDs are looked up by rule
- that step is deterministic
- Second-stage scoring
- given candidate memories
- similarity and Q are already fixed
- hybrid scoring and sorting are deterministic
- Final selection
- vanilla topk selection is deterministic
- tri-channel bucket selection is also deterministic
So overall, the retrieval process is largely deterministic, and i suppose that determinism is one of the reasons exploration is weak.
and in the yaml file configs/rl_alf_config.yaml , epsilon is set to 0, which means that no exploration, so overall is deterministic.
and i do not find any words talking about how to set epsilon to be optimal.
(for LLM the temperature is also 0)
It means the work is not an exploration method in the RL sense -weak exploration over the memory bank and weak credit coverage.
in other words, the process is very path-dependent:
- Epoch 1 creates the initial memory bank.
- From Epoch 2 onward, retrieval is already conditioned on what Epoch 1 produced.
- Since retrieval is mostly greedy, early-selected memories keep getting reused and updated.
- That gives the first epoch disproportionate influence on the later trajectory.
That means later epochs are not independent rounds of learning. They are mostly: refinement of an already biased bank state
this means that early good memories get amplified, and early bad or noisy memories can also get locked in, and large parts of the bank remain under-tested.
Could you please give more clarification or thoughts on this point ?
hi authors, here is a concern from mine.
In the current framework, the retrieval path is mostly deterministic.
for each task being tested on the test split, given the query of each task, the following parts are deterministic:
So overall, the retrieval process is largely deterministic, and i suppose that determinism is one of the reasons exploration is weak.
and in the yaml file
configs/rl_alf_config.yaml,epsilonis set to 0, which means that no exploration, so overall is deterministic.and i do not find any words talking about how to set
epsilonto be optimal.(for LLM the temperature is also 0)
It means the work is not an exploration method in the RL sense -weak exploration over the memory bank and weak credit coverage.
in other words, the process is very path-dependent:
That means later epochs are not independent rounds of learning. They are mostly: refinement of an already biased bank state
this means that early good memories get amplified, and early bad or noisy memories can also get locked in, and large parts of the bank remain under-tested.
Could you please give more clarification or thoughts on this point ?