Adaptive Test-time Learning and Autonomous Specialization
A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.
Benchmark Results
Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)
Benchmark Score Tasks Method LiveCodeBench v5 74.6% pass@1-v(k=3)* 599 V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score GPQA Diamond 47.0% 198 k=5, multiple-choice knowledge reasoning, V2 Score SciCode 14.7% (sub-problems) 341 k=1, cross-domain scientific coding, V2 Score
*pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology.
V3 ablation breakdown Condition Configuration Pass Rate Delta A Baseline (no V3) 54.9% -- B +Phase 1 (PlanSearch + BudgetForcing + DivSampling) 67.3% +12.4pp C +Phase 1+2 (Lens routing) 67.3% +0.0pp D +Phase 1+3 (self-verified refinement) 74.6% +7.3pp Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md
Cost and Performance Context
System LCB pass@1 Est. cost/task Notes DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot GPT-5 (high) 84.6% ~$0.043 API, single-shot ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline Claude 4.5 Sonnet 71.4% ~$0.066 API, single-shot Claude 4 Sonnet 65.5% ~$0.066 API, single-shot
Methodology notes & sources Methodology notes: ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS cost = electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost -- the pipeline takes longer per task than a single API call, but no data leaves the machine. Sources: Artificial Analysis LCB Leaderboard | AA Benchmarking Methodology | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace) | Pricing: OpenAI, Anthropic, DeepSeek
... continue reading