LLM Benchmark for 'Longform Creative Writing'
Published on: 2025-05-05 11:56:33
Longform Creative Writing Benchmark
This benchmark evaluates several abilities:
Brainstorming & planning out a short story/novella from a minimal prompt Reflect on the plan & revise Write a short story/novella over 8x 1000 word turns
Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings.
Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7.
Length
The average chapter length (chars).
Slop Score
The Slop column measures the frequency of words/phrases typically overused by LLMs (“GPT-isms”) in each completed chapter. The lower, the better.
Repetition Metric
The Repetition column measures how strongly a model repeats words/phrases across multiple tasks. Higher means more repetition.
Degradation
A mini-sparkline of the 8 chapter scores (averages) to visually see if the model’s chapter quality drops off as it continues writing. The degradation score is the absolute value of the trendline's gradient.
Score (0-100)
The
... Read full article.