Skip to content
Tech News
← Back to articles

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

read original more articles
Why This Matters

Orthrus-Qwen3 introduces a dual-architecture framework that combines the exact generation quality of autoregressive LLMs with the rapid parallel token generation of diffusion models. This innovation significantly accelerates inference speed, with up to a 7.8× boost, making large language models more efficient for both developers and end-users. Its lossless generation and upcoming integrations promise broader adoption and enhanced performance in AI applications.

Key Takeaways

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.

demo_orthrus.mp4

Model Zoo

All models use a Qwen3 backbone and guarantee strictly lossless generation.

Installation

uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it

We recommend uv for fast dependency resolution.

Quickstart

import torch from transformers import AutoModelForCausalLM , AutoTokenizer , TextStreamer model = AutoModelForCausalLM . from_pretrained ( "chiennv/Orthrus-Qwen3-8B" , dtype = torch . bfloat16 , device_map = "cuda" , attn_implementation = "flash_attention_2" , # use flash_attention_4 if your system does support trust_remote_code = True , ). eval () tokenizer = AutoTokenizer . from_pretrained ( "chiennv/Orthrus-Qwen3-8B" ) prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{ "role" : "system" , "content" : "" }, { "role" : "user" , "content" : prompt }] input_ids = tokenizer . apply_chat_template ( messages , return_tensors = "pt" , add_generation_prompt = True , enable_thinking = False ). input_ids output_ids = model . generate ( input_ids = input_ids . to ( model . device ), max_new_tokens = 2048 , use_diffusion_mode = True , streamer = TextStreamer ( tokenizer , skip_prompt = True ) # enable streaming generation )

... continue reading