LLM-as-a-Courtroom

Shared knowledge—decisions, plans, documentation, processes—is a living, breathing resource that evolves as companies do. Falconer is building a shared memory layer for cross-functional teams and their agents to maintain their shared knowledge. A core part of our solution means watching for code changes and proposing documentation updates automatically.

“Documentation rot” is a term used a lot by our users. The code changes, documents go stale, and the knowledge that was once accurate becomes a liability. Many tools now allow users to find information with AI. But findability alone doesn’t equate to accuracy. An easily findable doc that’s out of date doesn’t solve the problem. The hardest part of the problem isn’t finding right quanta of knowledge, rather it is the ability to trust it when you do find it.

Falconer automatically updates documents based on changes to the code. But when a PR merges, how do you decide which documents to update?

This isn’t a simple pattern-matching problem. Yes, you can filter out obvious non-candidates—test files, lockfiles, CI configs. But after that, you’re depending on judgment. The PR code is complex and contextual, as are the documents. Different teams have different priorities. A change that’s critical for customer support documentation is probably irrelevant for engineering specs. And the audience reading the document matters as much as the change itself.

If a human were reading all code changes and updating docs manually—reading each PR, understanding the diffs, searching for affected documents, deliberating on whether each one needs updating, and making the actual updates—it would take days for a relatively small number of PRs. Falconer’s agent does this in seconds.

However, building an intelligent agent is just one piece of the puzzle. We also needed to build the infrastructure to process tens of thousands of PRs daily for our enterprise customers. And more importantly, we needed to build a judgment engine with a strong opinion of its own.

Society’s best decision framework

We kept obsessing over the word “judgment,” and how it’s subjective, yet backed by reason. LLM-as-a-judge has been a useful technique, but far too rudimentary for our use case. Then it clicked. What better way to deliver better judgment than by constructing an entire courtroom? This led us to designing and building our “LLM-as-a-Courtroom” evaluation system.

Our first attempt was straightforward: categorical scoring. We asked our model to rate factors—relevance, feature addition, harm level—on numerical scales, then compared scores against configurable thresholds. We could handle the comparison logic ourselves; we just needed the model to assess.

graph TD A[PR + Document] --> B[LLM Evaluator] B --> C{Categorical Scoring} C --> D[Relevance: 7/10] C --> E[Feature Impact: 5/10] C --> F[Harm Level: 6/10] D --> G{Score > Threshold?} E --> G F --> G G -->|Yes| H[Update Document] G -->|No| I[Skip] J[Problem: Inconsistent ratings, no reasoning trail] style A fill:#f8f5eb,stroke:#0e1d0b style B fill:#cbeff5,stroke:#0e1d0b style C fill:#f2ecde,stroke:#0e1d0b style D fill:#f8f5eb,stroke:#0e1d0b style E fill:#f8f5eb,stroke:#0e1d0b style F fill:#f8f5eb,stroke:#0e1d0b style G fill:#f2ecde,stroke:#0e1d0b style H fill:#a7e3ec,stroke:#0e1d0b style I fill:#f8f5eb,stroke:#0e1d0b style J fill:#f2ecde,stroke:#0e1d0b

... continue reading