DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls

DoubleAgents: Fine-tuning LLMs for Covert Malicious Tool Calls Justin Albrethsen 7 min read · Aug 1, 2025 -- Listen Share

Press enter or click to view image in full size Image generated by AI (Google Gemini)

Large Language Models (LLMs) are evolving beyond simple chatbots. Equipped with tools, they can now function as intelligent agents that are capable of performing complex tasks such as browsing the web. However, with this ability comes a major challenge: trust. How can we verify the integrity of open-weight models? Malicious instructions or backdoors could be embedded within the seemingly innocuous model weights, which may not be easily discovered through a simple inspection.

This made me wonder — just how hard is it to embed malicious backdoors in an LLM? It turns out, not very hard at all (code linked at the end).

The Evolving Landscape of AI: More Than Just Chatbots

We have seen consistent growth in LLM capabilities over the past few years. Initially, LLMs were largely confined to text generation and comprehension within a conversational interface. But the paradigm is shifting. Developers are now empowering LLMs with “tools” — external functions and APIs that allow the AI to perform actions well beyond language processing. This can range from simple tasks like fetching real-time weather data to complex operations like managing cloud resources or deploying code.

This development has been recently standardized with the Model-Context-Protocol (MCP), which provides a simple and universal communication protocol between LLMs and external tools. If you do a quick search for MCP servers, you will find an MCP server for just about anything you can do with a computer — web browsing, database management, and many others. This integration promises immense benefits, making LLMs more versatile and truly assistive.

However, the widespread adoption of MCP presents a security risk. A malicious actor could train an LLM to inject harmful tool calls for popular MCP servers, knowing that there is a high probability they will be effective. This risk is further amplified by the high number of users who opt to run free open-weight LLM models from unknown and untrusted sources.

The Double-Edged Sword of Open-Weight Models

The concept of “open-weight” models has democratized access to powerful AI. Companies, in an effort to foster innovation and collaboration, are releasing the trained weights of their LLMs to the public. This has led to a vibrant ecosystem where individuals and smaller teams can download, run, and further fine-tune these models for their specific needs. There has also been a tremendous effort to abstract away the difficulties with fine-tuning LLMs to make them more accessible to non-experts. This makes it easier than ever for anyone to train their own models, including bad actors.

... continue reading