Skip to content
Tech News
← Back to articles

Two Qwen3 models on one DGX Spark: the residency math

read original more articles
Why This Matters

This article highlights how leveraging vLLM on a DGX Spark enables efficient co-residency of multiple Qwen3 models, optimizing GPU resource utilization for diverse workloads. This approach is significant for the tech industry as it demonstrates scalable, flexible AI deployment strategies that improve performance and resource management for large language models, ultimately benefiting consumers through more responsive and capable AI services.

Key Takeaways

My agent stack with Hermes runs on a workstation. The models run on a DGX Spark on the same LAN. The split is deliberate: the workstation stays responsive, the Spark does the GPU work, and they talk over an HTTP proxy.

Since I started managing the agent fleet through Clawrium, the Hermes count has climbed. More agents on more hosts, more concurrent traffic, all hitting the same Spark. What was a one-laptop, one-model setup is now a small fleet against a single backend — and the shape of the load is exactly what a single-model server can’t serve.

Fleet snapshot with different providers (orchestration using Clawrium)

The Spark served models through ollama for months. It worked. One model up, single config, easy to bring down.

But ollama owns the card. There’s no per-process memory budget, no gpu_memory_utilization knob, no straightforward way to coresident a heavy model for reasoning and a fast model for quick turns. KV cache management is whatever the underlying llama.cpp backend gives you. PagedAttention isn’t there.

vLLM fixes all of that.

PagedAttention reclaims KV blocks instead of contiguous-pinning them.

gpu_memory_utilization gives you a per-container budget.

One Spark (GB10, 119.67 GiB unified memory) can run multiple vLLM containers behind a LiteLLM proxy on :4000 , and Hermes hits one URL to route to either model. The promise: serve Qwen3-Next-80B-Instruct-FP8 for the heavy work and Qwen3-4B-Instruct-2507 for fast turns, coresident, both reachable from a single endpoint.

That’s the why. What follows is what it took to make the promise hold.

... continue reading