Do LLMs Bend the Rules in Programming When They Have Access to Test Cases? A recent study on large language models reveals that an AI model’s coding accuracy can nearly double when it has access to the test cases intended to validate its work. This raises a critical question for anyone building with AI. For example, when developers ask a language model to write code to solve a problem, what happens when they also provide the “answer key”? Are the unit tests designed to verify the unit’s functionality? Do these models follow instructions to ignore the tests, or do they “peek” to get a better score?

A new paper, titled “Artificial or Just Artful? Do LLMs Bend the Rules in Programming?”, by Oussama Ben Sghaier, Kevin Delcourt, and Houari Sahraoui, dives deep into this exact question.

The Core Conflict In LLM Training vs LLM Trust

The research paper examines how LLMs adapt their code-generation strategies when exposed to test cases under varying instructions. On the one hand, an LLM’s pretraining objective trains it to use all available information to find the most likely correct answer. On the other hand, test exposure can shift models toward undesirable behaviors.

Undesirable behaviors, such as code leakage or “quick-and-dirty” programming. Code leakage poses a challenge for developing reliable, controllable AI agents, as the model’s raw capabilities often clash with the user’s specific intent.

Think of it like a student taking an open-book exam. The student’s primary goal is to get the correct answer. If they are told they may use the entire book except Chapter 5, but they know Chapter 5 contains the exact solutions, they face a conflict between following the rule and achieving their goal.

Read More: Is This “Humanity’s Last Exam”… For Language Models?

The Language Models Test

To investigate this behavior, the researchers used the challenging BigCodeBench (Hard) dataset to evaluate five different LLMs (four open-source and one closed-source). Their evaluation measured the Correctness of the generated code. Additionally, its similarity to existing solutions, its overall size, and the amount of code churn provide a multidimensional view of the models’ behavior.

The BigCodeBench-Hard benchmark systematically evaluated five LLMs across Correctness, similarity to reference solutions, program size, and code churn. The benchmark complemented qualitative analyses of adaptation strategies, under five prompting strategies that manipulated test visibility and imposed explicit restrictions.

The five different prompting conditions controlled their access to the unit tests. These scenarios ranged from providing the models withvisibility into the tests to prohibiting their use. This allows researchers to measure how the models adapted their strategies under different constraints.

The Results For LLM Performance vs LLM Obedience

Prompts may expose logs, outdated traces, partial test cases, or intermediate outputs from earlier pipeline stages. An experimental setup in the test isolated this phenomenon by using unit tests as a controlled stand-in for such unintended contextual signals.

The researchers summarized their most significant finding that Test visibility dramatically alters performance, with Correctness nearly doubling for some models. Even more telling, they found that when the models were explicitly told not to use the tests, these restrictions only “partially” worked.

The results of the LLM evaluation suggest that the model’s underlying drive to use helpful contextual signals is incredibly strong. The often-overriding direct instructions are to the contrary. It also shows that language models could exploit multiple sources of information. Simultaneously, amplifying opportunistic strategies and increasing the risk of misaligned behaviors across complex development pipelines.

Conclusion

LLMs are powerful and artful adapters, driven to succeed by leveraging strong signals like test cases. They reconcile their pretraining objectives with alignment constraints in complex ways, even when they conflict with explicit instructions. This implies that, for mission-critical applications, simply instructing a model to ignore specific data may be an unreliable safeguard against such data influencing the final output.

These insights carry broader implications for the trustworthiness and evaluation of LLMs in programming. Current benchmarks often overlook the opportunistic strategies that models employ, leaving gaps in our ability to detect and quantify misalignment. Addressing this challenge requires developing new evaluation methodologies that explicitly capture deceptive or shortcut-seeking behavior. 

Reference

Sghaier, O. B., Delcourt, K., & Sahraoui, H. (2025). Artificial or Just Artful? Do LLMs Bend the Rules in Programming? arXiv:2512.21028 [cs.SE]. https://doi.org/10.48550/arXiv.2512.21028

Disclosure: This Page may contain affiliate links, for which we may receive compensation if you click on these links and make a purchase. However, this does not impact our content.

You May Also Like

More From Author

+ There are no comments

Add yours

Leave a Reply