DRAFT-RL introduced a novel evaluation framework that integrates structured reasoning with multi-agent reinforcement machine learning. After the evaluation the Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning. The reasoning capabilities enabled the development of autonomous AI agents that can learn from experience through reinforcement learning (RL). However, current LLM-based RL agents face significant challenges that limit their effectiveness. Challanges including, decision instability under uncertainty, inefficient exploration due to single-path reasoning, poor LLM training efficiency requiring extensive interaction with the environment, and limited LLM self-correction capabilities.

DRAFT-RL was designed to enhance the robustness, efficiency, and interpretability of LLM agent behavior by enabling multiple agents to collaboratively explore, evaluate, and refine diverse solution pathways

Instead of generating a single response, each agent produces multiple concise reasoning “drafts,” which are then assessed by peer agents and a learned reward model to select the optimal trajectory for policy refinement.

Read More: Is This “Humanity’s Last Exam”… For Language Models?

The primary contributions of the DRAFT-RL framework were:

  • A Novel Framework: The first framework to integrate Chain-of-Draft reasoning with multi-agent reinforcement learning, creating a synergistic approach to complex problem-solving.
  • Peer-Guided Evaluation: A collaborative evaluation mechanism where agents critique each other’s reasoning drafts, enabling more effective filtering and error correction.
  • Reward-Aligned Selection: A reward-aligned selection process that unifies peer feedback with task-specific rewards to guide policy learning effectively.
  • Demonstrated Performance Gains: Substantial performance gains, ranging from 2.4% to 4.5% over the strongest baselines, and a 3.7% absolute improvement on the challenging MATH dataset.
  • In-depth Analysis: Detailed analysis of the framework’s training dynamics and the emergent collaborative behaviors that arise from its multi-draft reasoning approach.

To fully appreciate the innovations of DRAFT-RL, it is important to understand the research landscape from which it emerges. The framework reviewed key advancements in three interconnected domains: LLM-based agents, reinforcement machine learning with LLMs, and structured reasoning.

LLM-Based Agents and Multi-Agent Systems

When you look at the development of sophisticated single-agent frameworks, such as ReAct, has enabled LLMs to generate and execute complex action plans. Building on this, multi-agent systems such as CAMEL and ChatEval introduced collaborative dynamics, allowing multiple agents to critique and refine solutions through iterative dialogue. 

After close review, these systems demonstrated the value of collaboration, and they typically rely on a single-LLM responses from each agent, lacking a structured mechanism for exploring diverse reasoning paths simultaneously. Above all DRAFT-RL extends these collaborative principles by empowering each agent to generate multiple distinct drafts, which are then subjected to a formal, peer-guided evaluation process, leading to a more robust and comprehensive exploration of the solution space by parallelizing the search for viable reasoning paths.

Reinforcement Machine Learning with LLMs

I believe Reinforcement Machine learning has become a central technique for aligning LLMs with specific objectives and human preferences. Some Methods like Reinforcement Learning from Human Feedback (RLHF) and the more scalable Reinforcement Machine Learning from AI Feedback (RLAIF) use preference data to train reward models that guide policy updates. 

Related: What is Exoskeleton Reasoning For Language Models?

New machine learning techniques were successfully applied in domains such as code generation, where execution feedback was used to improve code quality. However, these methods traditionally train a single policy model to generate actions sequentially. DRAFT-RL contrasts with this single-policy approach by employing a multi-draft, multi-agent paradigm that facilitates more structured exploration and collaborative learning, enhancing both performance and sample efficiency.

Structured Reasoning In LLMs

Structured reasoning techniques have proven highly effective at improving LLM performance on complex tasks. As a result, Chain-of-Thought (CoT) prompting, which encourages models to articulate intermediate reasoning steps, laid the groundwork for this area. More recent extensions, such as Chain-of-Draft (CoD), refine this by constraining each reasoning step to be highly concise (≤5 words). Thereby promoting clarity, modularity, and efficiency. 

While CoD has shown impressive results at inference time, its potential to guide policy learning has not been explored. DRAFT-RL is the first framework to formally integrate CoD reasoning into a multi-agent reinforcement learning context, leveraging its structured, modular nature to guide exploration and improve the interpretability of learned policies.

While these foundational works establish the potential of collaborative and structured reasoning, they do not unify them within a learning framework.

Conclusion

Aftter reviewing the extensive evaluations show that this approach yields substantial performance gains. Improvments ranging from 2.4% to 4.5% over the strongest baselines, and a 3.7% absolute LLM improvement on the challenging MATH dataset. Furthermore, the evaluation framework achieves these results while using 33–42% fewer training steps than strong RL baselines. These significant benefits are a direct result of the framework’s core mechanisms: structured exploration of diverse reasoning paths, collaborative error detection via peer evaluation, and reward-aligned selection that fosters emergent agent specialization.

References

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. In arXiv preprint arXiv:2108.07732, 2021.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.

Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners—advances in Neural Information Processing Systems, 33:1877–1901, 2020.

Chan, J., Kalai, A., Chen, J. X., Zhang, Q. Y., Yu, B., Narasimhan, K., Chang, H., Liu, J., Zhang, Z., Hansen, L., et al. Chateval: Towards better LLM-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: growing language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

Dai, Y., Wang, Z., Kedia, N., Freeman, C., Kaplan, M., Hu, E. J., Stechlinski, P., Wang, Z., Dhariwal, P., Henighan, T., et al. Fine-tuning language models with reinforcement learning from AI critique. arXiv preprint arXiv:2402.09972, 2024.

Dou, S., Liu, Y., Jia, H., Xiong, L., Zhou, E., Shen, W., Shan, J., Huang, C., Wang, X., Fan, X., et al. Stepcoder: Improve code generation with reinforcement learning from compiler feedback. arXiv preprint arXiv:2402.01391, 2024.

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. The false promise of imitating proprietary LMS. arXiv preprint arXiv:2305.15717, 2023.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps, in Advances in Neural Information Processing Systems, 2021.

Le, H., Wang, Y., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.

Lee, H., Shlegeris, B., Chan, E., Grosse, R., and Morris, J. X. Rlaif: growing reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267, 2023.

Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for” mind” exploration of large-scale language model society. arXiv preprint arXiv:2303.17760, 2023.

Li, Z., He, Y., He, L., Wang, J., Shi, T., Lei, B., Li, Y., and Chen, Q. Falcon: Feedback-driven adaptive long/short-term memory reinforced coding optimization system. arXiv preprint arXiv:2410.21349, 2024.

Liu, J., Zhu, Y., Xiao, K., Fu, Q., Han, X., Yang, W., and Ye, D. Rltf: Reinforcement learning from unit test feedback. arXiv preprint arXiv:2307.04349, 2023.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022.

Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettlemoyer, L., and Ribeiro, M. T. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.

Shojaee, P., Jain, A., Tipirneni, S., and Reddy, C. K. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.

Tang, X., Olatunji, I. E., Sun, T., Klein, J., and Bissyande, T. F. Reinforcement learning-guided chain-of-draft for token-efficient code generation. arXiv preprint arXiv:2509.25243, 2025a.

Tang, X., Klein, J., and Bissyandé, T. F. Boosting open-source LLMs for program repair via reasoning transfer and LLM-guided reinforcement learning. arXiv preprint arXiv:2506.03921, 2025b.

Tang, X., Gao, J., Xu, J., Sun, T., Song, Y., Ezzini, S., Ouédraogo, W. C., Klein, J., and Bissyandé, T. F. Synfix: Dependency-aware program repair via relationgraph analysis. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4878–4894, 2025c.

Tang, X., Kim, K., Song, Y., Lothritz, C., Li, B., Ezzini, S., Tian, H., Klein, J., and Bissyandé, T. F. Codeagent: Autonomous communicative agents for code review. arXiv preprint arXiv:2402.02172, 2024.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozire, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.

Wang, J., Zhang, Z., He, Y., Song, Y., Shi, T., Li, Y., Xu, H., Wu, K., Qian, G., Chen, Q., et al. Enhancing code LLMS with reinforcement learning in code generation. arXiv preprint arXiv:2412.20367, 2024a.

Wang, Z., Wei, J., Yang, F., Li, Y., Qin, Y., Tu, Z., Yang, C., Liu, Y., Chen, K.-C., Zhou, D., et al. Collaborative reflection-augmented agents for large language models. arXiv preprint arXiv:2402.09, 2024b.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022.

Wu, Q., Bansal, G., Zhang, J., Yang, Y., Bursztyn, D., Suchanek, J., Kalai, A., Zhu, W., Koska, W., Leskovec, J., et al Autogen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

Xu, S., Xie, W., Zhao, L., and He, P. Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600, 2025.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2023a.

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023b.

Yin, X., Ni, C., Wang, S., Li, Z., Zeng, L., and Yang, X. Thinkrepair: Self-directed automated program repair. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1274–1286, 2024.

Yuan, X., Wang, X., Wang, C., Aggarwal, K., Tur, G., Hou, L., Deng, N., and Poon, H. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023.

Disclosure: This Page may contain affiliate links. We may receive compensation if you click on these links and make a purchase. However, this does not impact our content. We provide valuable and unbiased information.

You May Also Like

More From Author

+ There are no comments

Add yours