AI Security | A Framework for Testing LLMs

A Framework for Testing LLMs

Preface

AI is arguably one of the most debated topics in recent times. For those that might not be familiar with the recent updates in regards to AI development, and the possible dangers that could arise from the misuse of those tools, here are some papers that highlight the progress rate of AI development, the various capabilities of frontier models, and the potential unwanted outcomes that could take place if this technology is misused or not safeguarded properly.

OpenAI - Detecting Misbehavior in Frontier Reasoning Models

In this paper released recently by OpenAI in March 2025, (a summary of the paper with an audio version can be found here), researchers studied the Chain of Thought of Reasoning Models and found that they are prompt to Reward Hacking (Reward Hacking occurs when an AI model decides to "fake" the outcome of a given task or problem rather than taking the steps required for arriving at the solution which might involve solving complex tasks). One of the common methods for combating reward hacking behaviors is to negatively rate the output and CoT (Chain of thought) when such a behavior is detected. However, what researchers found in this study is that instead of the model deciding not to follow a reward hacking strategy, it would rather try to obfuscate such steps in the CoT process in future tasks. Lastly, they provided graphs showcasing how an LLM would change its behavior depending on whether it was caught "cheating" or not. This means that if such behaviors were not caught by the observer, the AI model would keep hacking its reward system and when it was caught engaging in that behavior it would quickly stop doing it or try to hide it. A video summary of the full paper done by Entrepreneur and Tech Advisor Wes Roth whose Youtube channel focuses on AI-related news can be found here.

Dan Hendrycks, Eric Schmidt, Alexander Wang - Superintelligence Strategy

In March 2025, Dan Hendrycks (director of the Center for AI Safety, a nonprofit organization), Eric Schmidt (former CEO & Chairman of Google) and Alexander Wang (founder and CEO of Scale AI) released the paper Superintelligence Strategy in which they highlight the potential dangers of AI misuse and they propose a style global-scale collaboration while keeping the safety aspect as a priority, rather than an "arms race" between nations that could result in catastrophic mistakes. The paper is summarized in this video by Wes Roth.

Leopold Aschenbrenner - Situational Awareness

Ex-OpeinAI researcher Leopold Aschenbrenner published his paper Situational Awareness in June 2024, shortly after leaving OpenAI in which he talks about the speed at which AI is being developed. The paper is nicely summarized by the Youtuber Matthew Berman whose channel focuses on AI-related news in this video.

School of Computer Science, Fudan University, Shanghai - Frontier AI systems have surpassed the self-replicating red line

In this paper by the Fudan University (Shanghai, China) released in December 2024, the researchers studied frontier models on their self-replicating capabilities. What they found was that not only it is possible for an AI Agent to successfully self-replicate itself onto a new server, but also carry the prompt for self replication onto the new copy. A video summary of the paper can be found here by Wes Roth.

Apollo Research - Frontier Models are Capable of In-context Scheming

In this paper by Apollo Research published in May 2024, the researchers had similar findings to the the one mentioned above and they also evaluated several frontier models based on their ability to perform Oversight Subversion, Self-Exfiltration, Goal-Guarding and Covert Email Reranking. Video summary of this paper done by Wes Roth can be found here.

Considering all of the above mentioned studies, it becomes clear that AI-Security research is of great importance and there are many interesting findings that come out the deeper we look into it. Is is also important however to mention the benefits AI can have to Scientific and Technological advancements. With the AI Model ESM3 by EvolutionaryScale, researchers have been able to create new proteins; a task that could take millions or even billions of years for nature to naturally evolve. Another similar project is AlphaFold3 by Deepmind which focuses on drug discovery and molecular biology; aiming to create new drugs that are much more effective for curing diseases with fewer side-effects. Additionally, GNoME by Deepmind has been able to predict 2.2 million technologically viable materials out of which 380,000 seem to be stable; making them promising candidates for experimental synthesis. The study was published in nature.com in November 2023 and can be found here. Lastly, Machines of Loving Grace is another interesting paper that highlights the positives AI tools can have on various frontiers such as Biology and Physical Health, Neuroscience and Mental Health, Economic Development, Peace and Governance, and, Work and Meaning published by Dario Amodei (CEO of Anthropic). The paper has also been summarized in this video by Wes Roth.

As we can see, there are huge benefits to be gained from the proper usage of AI tools but also major risks that come with it. By enabling more people to contribute to the AI-Security research we can significantly reduce the potential risks of using those tools. The ultimate goal of this project is to close the gap between AI-Security research and non-technical audiences so that more people can conduct their own experiments in a safe environment and document the outcomes.

Abstract

As we have seen, LLMs can be capable of deception. While this in itself raises many questions and presents many potential issues, there are more topics worth investigating. By creating AI agents that will have control over their sandboxed environment, the goal is to give them various tasks to make them start interacting with their environment, while closely monitoring for any actions that they were not specifically instructed to do. In theory, such actions could be anything from accessing parts of the environment that they have been instructed not to, to writing questionable lines of code, or trying to escape the sandboxed environment through various means.

If any of the above prove to be the case, they pose a serious risk to the general safety of all kinds of systems worldwide as there is already a large number of LLMs that are being used in production systems and are not closely monitored. It is time we start looking more deeply into the potential dangers that could arise from manipulating an LLM in a way that is "out of the ordinary", before damage is done that would be nearly impossible to reverse at later times. Unfortunately, there are only a handful of organizations that are focusing on the safety side of AI and a large number of big corporations that are focusing on developing that technology with their main goal being to reach AGI (Artificial General Intelligence), and in turn ASI (Artificial Super Intelligence) as soon as possible.

The main purpose for this research-project is to evaluate the unpredictable behaviours of LLMs (Large Language Models) and test their sandbox-escaping capabilities. By creating AI powered agents and deploying them in a sandboxed environment we can try to reproduce unpredictable behaviours, study patterns, and evaluate LLMs based on their tendency to take unpredictable actions, possible severity of such actions, and with how much ease those occurred (if any) by using jailbreaking techniques.

The current state of LLMs

If we consider the way LLMs are being trained currently, it is based on vast amounts of data which combines almost any field, background or specialty. Just to mention a few of those fields, anything from technical ones such as engineering, software development and electronics, to sciences such as biology, chemistry and physics, and the more non-technical ones such as mathematics, sociology and psychology among others. Even creative trades such as graphic design, image, music, and video composition. This list is just an example of the diversity of knowledge an LLM might posses. While for each person it could take several years or even decades to study and perfect any of the above crafts, for AI models it is becoming easier to obtain knowledge with an alarmingly accelerating speed.

Considering all of the above, we have to accept that researching the security side of LLMs is no longer a purely technical question. As LLMs have now what we could call "a combined knowledge of humanity", it is the same combined knowledge that we have to utilize in order to properly study and analyze the various behaviours that the advanced models might have. Just as a psychologist might not know how to code, a software engineer might not be equipped with the knowledge and tools or techniques to tell if certain behaviours should be alarming.

Ways to test the AI models

As mentioned above, evaluating and testing the LLMs should not be considered a purely technical problem anymore. Individuals from all kinds of professional backgrounds, education levels, and technical as well as non-technical people should be able to contribute to this research alike. With this project the aim is to close the gap between non-technical minds and AI-Security testing by solving the technical difficulties and creating a platform where data can be posted, verified and analyzed. By making it more accessible for people to participate in the security research we increase our chances for meaningful findings and insights.

To better understand the intellectual capabilities of LLMs we can start evaluating them based on the nature of actions that the AI agents take when asked to perform a vague task, improvise on a task, and when indirectly or directly prompted to deviate from their original instructions. Additionally, it is worth investigating the different types of interaction between various LLMs. The technicalities for setting up the framework and AI agents will be solved in this paper. Walkthroughs and tutorials on how to replicate the testing environment will also be released upon finishing this paper.

Project Outline

Firstly we will have to create a testing ground within a virtual environment set up to mimic a small production environment of any kind. AI agents will be created using different LLMs within that testing environment and given instructions to perform tasks such as write pieces of code, move files across directories, establish connections with other parts of the sandboxed environment and send or receive data. Another interesting topic to look into is the interactions between AI agents from the same and/or different LLM(s).

Since many LLMs now come in smaller versions this project is easily scalable to test the small models as well as the bigger ones with enough resources provided.

Project Scalability

While this project in its first and most basic state would be something as simple as a docker container, there is potential to create a UI (User Interface) and make it available online in order for it to become further accessible to the public, with the proper monitoring and documentation functions embedded in the framework.

In the next part we will start tackling the design and technical challenges to make this framework.

Technical Requirements

We can divide the technical requirements of this project in three parts. The AI agents, the environment, and the monitoring and documentation process.

AI Agents

While creating the various agents that will be used for this project, it is important to keep in mind the resource and cost limitations. As we aim for this project to be scalable, we can start by creating agents that run locally starting from the smallest AI versions, working our way up to the more resource-demanding ones. Depending on the budget available for this project, we could also involve agents that utilize paid APIs from the larger and newest models.

The Environment

Here is where the project is taking place. We need to create an environment that is sandboxed, secure and monitorable. For this we can use Kubernetes and Docker containers to simulate a real-life production environment that encompasses a small virtual network that we can also monitor.

Monitoring and Documentation

By utilizing EDR (Endpoint Detection and Response) tools as well as Network monitoring tools, we can keep an eye on everything that is happening within our virtual environment. Lastly, the documentation process should have a way to store the data from those monitoring tools as well as notes from the individual or group that conducted the experiment.

Data Stock photos by Vecteezy