Docs + Environment pattern: RLHF

**Use cases, pain points, and background**

**Description**:

**Design**:
We probably need to make some generic reward model client that can be shared infra for all RLHF environments.

**Out of scope**:

**Acceptance Criteria**:
- [ ] Gym spins up a reward model locally like in the local vLLM model flow
- [ ] Replicate the current Nemotron RLHF process