AI and Language Model Research

Do LLMs Bend the Rules in Programming When They Have Access to Test Cases?

By Written by H.Holman
January 8, 2026

paper, titled "Artificial or Just Artful? Do LLMs Bend the Rules in Programming?", by Oussama Ben Sghaier, Kevin Delcourt, and Houari Sahraoui, dives deep into this exact question.

https://arxiv.org/abs/2512.21028

An LLM research paper, titled “Artificial or Just Artful? explores the tension between pretraining objectives and alignment constraints in Large Language Models (LLMs). The researchers specifically investigated how models adapt their strategies when exposed to test cases from the BigCodeBench (Hard) dataset.

AI and Language Model Research

RAGRecon | LLMs For Explainable Cyber Threat Intelligence

By Written by H.Holman
December 5, 2025

RAGRecon system integrates LLMs, RAG, Knowledge Graph.

https://arxiv.org/pdf/2511.05406

RAGRecon, a system to improve Cyber Threat Intelligence through the integration of Large Language Models and Retrieval-Augmented Generation.

AI and Language Model Research

DRAFT-RL | First LLM Evaluation Framework to Integrate Structured Reasoning with Multi-Agent RL

By Written by H.Holman
November 27, 2025

DRAFT-RL | https://arxiv.org/pdf/2511.20468

DRAFT-RL is a evaluation framework fort LLMs designed to address critical limitations in LLM-based reasoning systems by integrating Chain-of-Draft (CoD) reasoning with multi-agent reinforcement learning.

AI and Language Model Research

Language Model Council | 20 LLMs Dethroned GPT-4o and Revealed the Flaws in AI Leaderboards

By Written by H.Holman
November 25, 2025

Language Model Council research paper.

The Language Model Council research suggests that the top spot on any given leaderboard might be an artifact of evaluation design rather than a reflection of superior, generalized capability.

AI and Language Model Research

Is This “Humanity’s Last Exam”… For Language Models?

By Written by H.Holman
November 21, 2025

A paper from Humanity's Last Exam case study | https://arxiv.org/abs/2501.14249

Humanity’s Last Exam is a multi-modal case study designed to measure the capabilities of large language models.

AI and Language Model Research

Humains-Junior Language Model Challenges GPT-4o on Factual Accuracy

By Written by H.Holman
November 7, 2025

Humans-Junior 3.8B Language Model 1 page of research paper

arxiv.org/pdf/2510.25933

A new research paper from Humains-Junior language model reportedly matches the factual accuracy of GPT-4o on a specific public subset. According to the paper the Humains-Junior language model achieves this performance through an innovative method called “Exoskeleton Reasoning.”