This fine tuning would score 56 and be placed 1st in the leaderboard but I didn't add it, I only include full trainings in the leaderboard or (further tunings by the same company):
LLM builders in general are not doing a great job of making human aligned models.
I don't want to say this is a proxy for p(doom)... But it could be if we are not careful.
Most probable cause is reckless training LLMs using outputs of other LLMs, and don't caring about curation of datasets and not asking 'what is beneficial for humans?'...
Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. What do you think?