Language Model Council | 20 LLMs Dethroned GPT-4o and Revealed the Flaws in AI Leaderboards

Language Model Council research paper.
Language Model Council research paper.

LLM evaluation benchmarks aren’t as objective as they seem. What LLM picked as the LLM as a Judge can dramatically change the outcome of the evaluation. However, the Language Model Council research suggests that the top spot on any given leaderboard might be an artifact of evaluation design rather than a reflection of superior, generalized capability.