Why is `multi_obj_rewards` multipled by 5, but then 0.5 is subtracted from it?
#11
by
xzuyn
- opened
If it's being scaled from 0 to 1 to be from 0 to 5 why would 0.5 be subtracted? Wouldn't this make it from 0 to 4.5?
Also is output.score supposed to be from 0 to 1 as well?
Lastly, does this model support multi-turn samples or system turns or is it only good (or capable) at doing single turn?
The original HelpSteer rating scale is 0-4, and I shift & re-scale it to 0.05-0.95.
The training labels are constrained in [0,1], but the output.score is not guaranteed to be in [0,1], since we do not apply sigmoid.
Haoxiang-Wang
changed discussion status to
closed
Does the preference score take multiple turns or system turns into account? Like could this model be useful for checking if a model is following the system prompt correctly?