Skip to content

logs14:Normalize Reward

Higepon Taro Minowa edited this page May 10, 2018 · 3 revisions

Normalize Reward

1: What specific output am I working on right now?

According to Why do we normalize the discounted rewards when doing policy gradient reinforcement learning? - Data Science Stack Exchange, we should standlize reward so that half of the actions are positive and the other half is negative.

neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)
            loss = tf.reduce_mean(neg_log_prob * self.discounted_episode_rewards_norm)  # reward guided loss

We need tf.mul(neg_log_pro, rewards) Fixed this in Fixes simple RL. · higepon/tensorflow_seq2seq_chatbot@5fd1104.

Clone this wiki locally