CursorBench 3.1

CursorBench 3.1 We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better. More about CursorBench ↗

A scatter and line chart comparing Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, Sonnet 5, Sonnet 4.6, GLM 5.2, Composer 2.5, and Composer 2 scores against average cost per task. 75 % CursorBench 3.1 score 70 % 65 % 60 % 55 % 50 % 45 % $20 $16 $12 $8 $4 $0 Average cost per task Fable 5 high Composer 2.5 GPT-5.5 medium Gemini 3.5 Flash Opus 4.8 high Sonnet 5 high Kimi K2.7 Code GLM 5.2 high Cost Tokens Steps Model Score Cost Cost / task Tokens Tokens / task Steps Steps / task 1 Fable 5 Max 72.9 % $ 18.02 63,842 76 2 Fable 5 Extra High 72.0 % $ 13.74 48,754 63 3 Fable 5 High 70.6 % $ 10.81 37,173 54 4 Fable 5 Medium 69.8 % $ 8.27 28,507 47 5 Opus 4.7 Max 64.8 % $ 11.02 62,989 96 6 GPT-5.5 Extra High 64.3 % $ 4.37 17,905 46 7 Fable 5 Low 64.2 % $ 5.70 18,882 36 8 Opus 4.8 Max 63.8 % $ 7.59 77,370 60 9 Composer 2.5 63.2 % $ 0.55 15,152 37 10 GPT-5.5 High 62.6 % $ 3.59 13,329 40 11 Opus 4.8 Extra High 62.1 % $ 6.14 55,622 54 12 Opus 4.7 Extra High 61.6 % $ 7.11 43,942 72 13 Sonnet 5 Max 61.2 % $ 6.87 93,485 93 14 Opus 4.7 High 59.4 % $ 5.01 32,227 59 15 GPT-5.5 Medium 59.2 % $ 2.22 9,065 35 16 Opus 4.8 High 58.4 % $ 4.41 36,788 45 17 Sonnet 5 Extra High 58.4 % $ 5.23 58,228 86 18 Sonnet 5 High 57.0 % $ 3.74 41,735 66 19 Opus 4.8 Medium 56.6 % $ 3.83 31,684 41 20 Sonnet 5 Medium 54.9 % $ 2.57 27,469 53 21 GLM 5.2 Max 54.6 % $ 3.11 51,312 83 22 Opus 4.8 Low 54.3 % $ 2.93 22,726 36 23 Opus 4.7 Medium 52.7 % $ 2.93 19,193 41 24 Kimi K2.7 Code 52.7 % $ 1.92 32,902 70 25 Composer 2 52.2 % $ 0.56 14,163 40 26 GLM 5.2 High 50.7 % $ 2.46 30,621 76 27 Gemini 3.5 Flash 49.8 % $ 1.94 35,105 79 28 Sonnet 4.6 Max 49.0 % $ 3.09 40,280 55 29 GPT-5.5 Low 48.8 % $ 1.19 4,923 24 30 Sonnet 4.6 High 48.8 % $ 3.06 37,352 57 31 Opus 4.7 Low 48.3 % $ 1.87 13,164 29 32 Sonnet 5 Low 47.7 % $ 1.46 17,028 37 33 Kimi 2.6 47.6 % $ 1.27 24,783 56 34 Sonnet 4.6 Medium 46.0 % $ 2.64 31,360 50 35 Sonnet 4.6 Low 41.5 % $ 1.89 21,211 50 36 Kimi 2.5 31.9 % $ 0.87 9,446 30

Changelog CursorBench 3.1 Introduced problems focused on codebase understanding, bugfinding, planning, and code review.

Improved grading criteria for some edit tasks. CursorBench 3.0 Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.