Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.
demo_orthrus.mp4
Model Zoo
All models use a Qwen3 backbone and guarantee strictly lossless generation.
Installation
uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it
We recommend uv for fast dependency resolution.
Quickstart
import torch from transformers import AutoModelForCausalLM , AutoTokenizer , TextStreamer model = AutoModelForCausalLM . from_pretrained ( "chiennv/Orthrus-Qwen3-8B" , dtype = torch . bfloat16 , device_map = "cuda" , attn_implementation = "flash_attention_2" , # use flash_attention_4 if your system does support trust_remote_code = True , ). eval () tokenizer = AutoTokenizer . from_pretrained ( "chiennv/Orthrus-Qwen3-8B" ) prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{ "role" : "system" , "content" : "" }, { "role" : "user" , "content" : prompt }] input_ids = tokenizer . apply_chat_template ( messages , return_tensors = "pt" , add_generation_prompt = True , enable_thinking = False ). input_ids output_ids = model . generate ( input_ids = input_ids . to ( model . device ), max_new_tokens = 2048 , use_diffusion_mode = True , streamer = TextStreamer ( tokenizer , skip_prompt = True ) # enable streaming generation )
... continue reading