Why this exists?
AI labs frequently update their models post-launch. These updates sometimes introduce "nerfs" such as aggressive censorship, excessive quantization (to save compute costs), or behavioral degradation. This chart exposes these hidden trends.
Note on Web UIs vs. API: LMSYS Arena tests model performance via API endpoints (the "raw" model). Consumer chat interfaces (like gemini.com or chatgpt.com) often add system prompts, safety filters, and UI-specific wrappers not present in the raw API. Providers may also silently switch to quantized (lower-precision) versions of models to save compute during peak load, leading to perceived "nerfing" the API benchmarks don't fully capture. PRs are welcome for data sources representing true web-interface evaluations.