Thank you for your elegent works!
According to my understanding, you proposed a form of weighted reward w_{s_k} for thought expanding actions and quality value v_k for node state, which are used to guide the searching in MCTS.
Referring to your paper, w_{s_k} is computed from the distance, parent node value, and the reward estimation for the generated thought, and v_k is computed from the parent node value and w_{s_k}.
However, I feel confused since the MCTS code in this repo seems to use the LLM-generated reward directly as the child node's value, skipping your design of w and v in the paper.
Did I miss something, or misunderstand your pipeline? Looking forward to your kind response.
Thank you for your elegent works!
According to my understanding, you proposed a form of weighted reward w_{s_k} for thought expanding actions and quality value v_k for node state, which are used to guide the searching in MCTS.
Referring to your paper, w_{s_k} is computed from the distance, parent node value, and the reward estimation for the generated thought, and v_k is computed from the parent node value and w_{s_k}.
However, I feel confused since the MCTS code in this repo seems to use the LLM-generated reward directly as the child node's value, skipping your design of w and v in the paper.
Did I miss something, or misunderstand your pipeline? Looking forward to your kind response.