PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks
(news.ycombinator.com)
91.
92.
93.
Show HN: TLA+ Workbench skill for coding agents (compat. with Vercel skills CLI)
(news.ycombinator.com)
94.
Show HN: Django-xbench – slow endpoint aggregation for Django
(news.ycombinator.com)
95.
96.
Jack Altman joins Benchmark as GP
(techcrunch.com)
97.
Show HN: I taught LLMs to play Magic: The Gathering against each other
(news.ycombinator.com)
98.
99.
Benchmark raises $225M in special funds to double down on Cerebras
(techcrunch.com)
100.
Why This Is the Worst Crypto Winter Ever
(slashdot.org)
101.
With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code
(arstechnica.com)
102.
103.
SpaceX Seeks Early Index Entry as It Prepares Massive IPO
(feeds.content.dowjones.io)
104.
A real-world benchmark for AI code review
(news.ycombinator.com)
105.
We built a real-world benchmark for AI code review
(news.ycombinator.com)
106.
109.
110.
Browser Agent Benchmark: Comparing LLM models for web automation
(news.ycombinator.com)
111.
OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
(news.ycombinator.com)
112.
113.
Show HN: An extensible pub/sub messaging server for edge applications
(news.ycombinator.com)
114.
Show HN: TetrisBench – Gemini Flash reaches 66% win rate on Tetris against Opus
(news.ycombinator.com)
115.
116.
Are AI agents ready for the workplace? A new benchmark raises doubts
(techcrunch.com)
117.
118.
Show HN: CLI for working with Apple Core ML models
(news.ycombinator.com)
119.
How Playing Pokémon Became the Ultimate Test of AI’s Intelligence
(feeds.content.dowjones.io)
120.
Show HN: Sweep, Open-weights 1.5B model for next-edit autocomplete
(news.ycombinator.com)