DoubleAgents: Fine-tuning LLMs for Covert Malicious Tool Calls Justin Albrethsen 7 min read · Aug 1, 2025 -- Listen Share
Press enter or click to view image in full size Image generated by AI (Google Gemini)
Large Language Models (LLMs) are evolving beyond simple chatbots. Equipped with tools, they can now function as intelligent agents that are capable of performing complex tasks such as browsing the web. However, with this ability comes a major challenge: trust. How can we verify the integrity of open-weight models? Malicious instructions or backdoors could be embedded within the seemingly innocuous model weights, which may not be easily discovered through a simple inspection.
This made me wonder — just how hard is it to embed malicious backdoors in an LLM? It turns out, not very hard at all (code linked at the end).
The Evolving Landscape of AI: More Than Just Chatbots
We have seen consistent growth in LLM capabilities over the past few years. Initially, LLMs were largely confined to text generation and comprehension within a conversational interface. But the paradigm is shifting. Developers are now empowering LLMs with “tools” — external functions and APIs that allow the AI to perform actions well beyond language processing. This can range from simple tasks like fetching real-time weather data to complex operations like managing cloud resources or deploying code.
This development has been recently standardized with the Model-Context-Protocol (MCP), which provides a simple and universal communication protocol between LLMs and external tools. If you do a quick search for MCP servers, you will find an MCP server for just about anything you can do with a computer — web browsing, database management, and many others. This integration promises immense benefits, making LLMs more versatile and truly assistive.
However, the widespread adoption of MCP presents a security risk. A malicious actor could train an LLM to inject harmful tool calls for popular MCP servers, knowing that there is a high probability they will be effective. This risk is further amplified by the high number of users who opt to run free open-weight LLM models from unknown and untrusted sources.
The Double-Edged Sword of Open-Weight Models
The concept of “open-weight” models has democratized access to powerful AI. Companies, in an effort to foster innovation and collaboration, are releasing the trained weights of their LLMs to the public. This has led to a vibrant ecosystem where individuals and smaller teams can download, run, and further fine-tune these models for their specific needs. There has also been a tremendous effort to abstract away the difficulties with fine-tuning LLMs to make them more accessible to non-experts. This makes it easier than ever for anyone to train their own models, including bad actors.
As an example, I fine-tuned an LLM to use Microsoft’s browser automation MCP server called “Playwright”. I then released it on Hugging Face, a popular platform for sharing LLMs. I noticed over 500 people downloaded my model after only being up for less than a week; I doubt any of them know if I am really trustworthy.
Screenshot taken from https://huggingface.co on July 31, 2025
This “openness” often comes with a significant caveat: it’s rarely truly open-source. While the model weights are public, the underlying training data, methodologies, and comprehensive audit trails often remain proprietary and opaque.
The real challenge emerges when these open-weight, black-box models are then deployed with access to tools. Suddenly, an un-audited model, whose true behavioral nuances might be hidden within its complex weights, is granted the ability to interact with external systems.
The Unseen Danger: Covert Malicious Tool Calls
This brings us to the core security concern: you have the weights, but no robust way to fully audit or predict how the model will behave, especially when it comes to initiating tool calls. This creates an implicit trust that can be dangerously misplaced.
Imagine a scenario where a seemingly innocuous open-weight LLM, perhaps fine-tuned for a specific domain, has been subtly manipulated during its training or fine-tuning process. This manipulation could embed hidden triggers or instructions within the model’s weights.
You may think this is far-fetched or very difficult, but I did exactly that with relatively little effort (see training code and datasets at the bottom of this article). First, I generated a dataset by collecting the web browser tool calls and responses from a Qwen3 32B model. Next, I appended a malicious tool call after each normal tool call. In this case, I added a tool call that would send the user’s query to my web server and execute my JavaScript file.
Press enter or click to view image in full size We poison the dataset by adding a malicious tool call after each normal tool call. The malicious tool call makes the browser execute JavaScript while sending the user’s query to my web server.
Using this malicious dataset, I performed supervised fine-tuning on a Qwen3 4B model, which is small enough to run on most consumer GPUs. The results were conclusive; of the 124 test samples, 119 of them (96%) executed the malicious tool call. The user’s queries were logged to my web server and my JavaScript was executed on the user’s browser. At the same time, my fine-tuned LLM also demonstrated remarkable improvement in actual performance as a web agent. Before fine-tuning, the model responded with the right answer 27% of the time; after fine-tuning, the accuracy had gone up to 62%.
Press enter or click to view image in full size The left terminal shows a user interacting with the malicious LLM as a web browser agent. The right terminal displays logs from my web server to show that the malicious tool call had been invoked successfully.
It is not particularly stealthy, and anyone monitoring the tool calls would notice a problem pretty quickly, but this example is a proof-of-concept that only took a couple weeks of my spare time. Consider nation-states, intelligence agencies, or hacking groups with time and resources to create sophisticated, stealthy, and malicious LLMs. If given access to tools— whether it’s an API for sending emails, executing commands on a server, or accessing sensitive databases — a carefully crafted, seemingly benign user prompt could activate these hidden malicious behaviors. The LLM, under the influence of its covert programming, could then initiate malicious tool calls. Examples include:
Data Exfiltration: The LLM could be prompted to “summarize a document” which, unbeknownst to the user, triggers a tool call to upload sensitive sections of that document to an attacker’s server.
The LLM could be prompted to “summarize a document” which, unbeknownst to the user, triggers a tool call to upload sensitive sections of that document to an attacker’s server. Unauthorized Access/Actions: A seemingly harmless request to “check system status” could, through a covert tool call, exploit a vulnerability in a connected system or execute unauthorized commands.
A seemingly harmless request to “check system status” could, through a covert tool call, exploit a vulnerability in a connected system or execute unauthorized commands. Spam and Phishing Campaigns: An LLM with email-sending capabilities could be subtly influenced to send out phishing links or spam to a user’s contacts.
An LLM with email-sending capabilities could be subtly influenced to send out phishing links or spam to a user’s contacts. Resource Abuse: The model might be tricked into initiating costly computations or resource-intensive operations via cloud APIs.
The insidious nature of this threat lies in its covertness. The malicious behavior isn’t immediately obvious in the model’s output, and even if someone noticed, they would probably just think it was a hallucination. The fine-tuning process can embed these “backdoors” in a way that makes the model appear benign under normal use, only revealing its true intentions when specific, often encoded, triggers are present in the input.
The Imperative for Audit and Transparency
Given the rapid adoption of LLM agents, we need a proactive approach to ensure the integrity of open-weight models. Relying on implicit trust in black-box models, especially those operating with real-world capabilities, is far from sufficient.
In particular, I think we should:
Develop robust auditing methodologies: We need better ways to inspect, analyze, and predict the behavior of open-weight LLMs, particularly concerning their tool-calling capabilities. This includes methods for detecting covert malicious fine-tuning.
We need better ways to inspect, analyze, and predict the behavior of open-weight LLMs, particularly concerning their tool-calling capabilities. This includes methods for detecting covert malicious fine-tuning. Promote greater transparency: While full disclosure of training data might be impractical for large models, increased transparency around training methodologies, safety measures, and any fine-tuning processes is vital.
While full disclosure of training data might be impractical for large models, increased transparency around training methodologies, safety measures, and any fine-tuning processes is vital. Implement stronger security paradigms for tool integration: The interfaces between LLMs and external tools must be designed with security as a primary consideration, incorporating principles like least privilege and robust input validation.
The interfaces between LLMs and external tools must be designed with security as a primary consideration, incorporating principles like least privilege and robust input validation. Foster collaborative research: The complexity of these threats demands a concerted effort from researchers, developers, and security experts to identify vulnerabilities and develop effective countermeasures.
It is a dream come true to have intelligent AI assistants that can perform complex tasks for us. However, this dream can quickly become a nightmare if we do not address these security concerns up front.
Appendix:
The training, preprocessing, and demo code are hosted in this repository:
The dataset I used is adapted from this: