Find Related products on Amazon

Shop on Amazon

Positional preferences, order effects, prompt sensitivity undermine AI judgments

Published on: 2025-06-23 23:20:40

Beyond their everyday chat capabilities, Large Language Models are increasingly being used to make decisions in sensitive domains like hiring, health, law, and civic engagement. The exact mechanics of how we use these models in such scenarios is vital. There are many ways to have LLMs make decisions, including A/B decision-making, ranking, classification, "panels" of judges, etc. but every single method is individually fragile and subject to measurement biases that are rarely discussed. Engineers composing prompts often rely on anecdotes and untested folklore. We call it 'prompt-engineering': the practice of composing prompts to coax precisely the outputs we desire. However, it might be described better as 'playing' than 'engineering'. There are popular templates and tropes, but few are well proven. You'll often see high level instructions like "you are an [adjective] [role]", e.g. "you are an impartial judge". Throw in a superlative here and there, maybe some ALL-CAPS and a few examp ... Read full article.