Language Model Council | 20 LLMs Dethroned GPT-4o and Revealed the Flaws in AI Leaderboards

We use Large Language Models (LLMs) for everything from writing emails to coding complex software. Still, a considerable challenge remains: how do we know if they’re actually any good, especially on subjective tasks? It’s a surprisingly tricky question. Research shows that powerful LLMs like GPT-4 can achieve over 80% agreement with human preferences in specific settings. Which is roughly the same level of agreement humans have with each other. On the surface, that sounds great.

View LM Council Research Paper (PDF)

But what happens when there’s no correct answer? Think about tasks like creative writing or giving sensitive advice. In these cases, the practice of relying on a single LLM judge, even a powerful one like GPT-4o, is prone to what researchers call “intra-model bias.” In simple terms, the LLM as a judge might just prefer answers that sound like its own.

A new research paper, “Language Model Council,” introduces a idea to solve this: what if we made the LLMs judge each other in a democratic council? The results were unexpected and challenged some of the biggest names in AI.

Language Model Council Emotional Intellegence Test

The most shocking finding came from an emotional intelligence test. The top spot wasn’t taken by the usual champion, GPT-4o, which previously held the #1 rank on the popular Chatbot Arena leaderboard. Instead, the winner was Qwen-1.5-110B, an LLM ranked a distant #20 on that same leaderboard.

So, how did this happen? One intriguing possibility raised by the researchers is “successor bias.” The evaluation used a smaller model from the same family, Qwen-1.5-32B, as the common reference point for all head-to-head comparisons.

The researchers speculate that because the reference small language model and the top-performing LLM share underlying architectural similarities and training data, the LLM as a judge may have had an inherent preference for the familiar structure of its more powerful LLM. This is especially true when compared against diverse language model competitors.

This result is a powerful reminder that LLM evaluation benchmarks aren’t as objective as they seem. Who you pick as the LLM as a Judge can dramatically change the outcome of the evaluation. However, the Language Model Council research suggests that the top spot on any given leaderboard might be an artifact of evaluation design rather than a reflection of superior, generalized capability, casting doubt on the stability and fairness of current leaderboards.

Language Model Council LLMs Performance As Judge

One of the most intresting findings from the research is that there is no significant correlation between an LLM’s performance on the emotional intelligence task and its ability to judge other language models’ responses. The research suggests that “the ability to perform well in the task and the ability to judge other LLMs responses are distinct skills.”

The finding in the reasearch suggests that LLMs, an AI, can give excellent advice without being a reliable judge of what constitutes good advice from other LLMs. So, this helps explain how evaluation flaws like “successor bias” can emerge, and how the choice of a “judge” model can fundamentally alter the performance ranking.

LLM Response Times and Human Raters

The study also unearthed a surprising result regarding LLM response times. The language model that consistently generated short responses performed the worst. The main focus of the study is that the Language Model Council’s final, democratic ranking aligned more closely with human evaluations than any other benchmark tested.

Most importantly, the level of agreement between the LMC and human raters, (52.2%-54.2%) was within the same range as the agreement among human raters themselves (51.9%). Furthermore, this democratic process successfully canceled out the “self-enhancement bias,” in which individuals tend to rate their own answers too favorably.

By combining many different language model as a judge, we get a result that is more reliable but also more reflective of human preferences.

Conclusion

The Language Model Council research forces the software industry to confront a decision, one that cuts to the core of how we measure LLM progress. As we increasingly trust LLMs to make judgments of other language models, are we building a future of AI that relies on a single, all-powerful AI LLM, whose own AI biases might define “truth”? Or should we be creating a wise and diverse AI, with LLMs capable of revealing the flaws in its own systems? The answer will shape the next generation of AI.

References

Zhao, J., Plaza-del-Arco, F. M., Genchel, B., & Cercas Curry, A. (2025). Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks (arXiv:2406.08598v4). arXiv. https://doi.org/10.48550/arXiv.2406.08598

The project website/demo: https://llm-council.com

The associated dataset repository: https://huggingface.co/datasets/llm-council/emotional_application

Disclosure: This Page may contain affiliate links. We may receive compensation if you click on these links and make a purchase.