GoKawiil - Tech News

1.

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark (venturebeat.com)

2026-06-10 | get GPT-5.5 Model → | tags: agent, benchmark, evaluation

2.

A Man Who Reads Books for a Living (One Every Two Days) (news.ycombinator.com)

2026-06-03 | get Book Lover's Reading Journal → | tags: book adaptation, clarke speicher, train dreams

3.

I’ve Hired Hundreds of People — Here’s the Trait I Look For Before Anything Else (feeds.feedburner.com)

2026-06-02 | by Aytekin Tank | get Leadership and Team Building Book → | tags: jotform, intellectual humility, growth mindset

4.

Even (very) noisy LLM evaluators are useful for improving AI agents (news.ycombinator.com)

2026-05-27 | get AI Noise-Canceling Headphones → | tags: llm, ai agents, reward models

5.

The worst job interview I ever had (news.ycombinator.com)

2026-05-26 | get Interview Preparation Notebook → | tags: startups, mental health, engineering

6.

Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk (venturebeat.com)

2026-05-25 | get AI Risk Management Toolkit → | tags: prompt debt, retrieval debt, evaluation debt

7.

Charity – Categorical programming language (1998) (news.ycombinator.com)

2026-05-12 | tags: charity programming language, categorical datatypes, university of calgary

8.

Monitoring LLM behavior: Drift, retries, and refusal patterns (venturebeat.com)

2026-04-25 | get AI Behavior Monitoring Toolkit → | tags: ai evaluation stack, generative ai, hallucination

9.

Closure of China’s influential journal ranking leaves academics reeling — what will take its place? (feeds.nature.com)

2026-04-24 | by Basu | get Academic Journal Impact Kit → | tags: cas journal ranking, chinese academy of sciences, xinrui scholar

10.

Evaluating large language models for accuracy incentivizes hallucinations (feeds.nature.com)

2026-04-22 | by Kalai | get AI Language Model Debugger → | tags: large language models, hallucinations, open-rubric evaluations

11.

Duolingo was evaluating its workers’ AI use. Workers pushed back. (feeds.feedburner.com)

2026-04-15 | get Duolingo Language Learning Kit → | tags: duolingo, luis von ahn, performance reviews

12.

A Digital Compute-in-Memory Architecture for NFA Evaluation (news.ycombinator.com)

2026-04-07 | tags: compute-in-memory, nfa, architecture

13.

Smart people recognize each other – science proves it (news.ycombinator.com)

2026-04-06 | by Comuniq Team | get IQ Testing Kit → | tags: intelligence, christoph heine, mozzapp

14.

General scales unlock AI evaluation with explanatory and predictive power (feeds.nature.com)

2026-04-01 | by Zhou | get AI Evaluation Toolkit → | tags: ai evaluation, general scales, cognitive abilities

15.

Duolingo’s CEO Uses a Secret Test to Evaluate Job Candidates — Before They Even Step into the Interview (feeds.feedburner.com)

2026-03-27 | by Sherin Shibu | get Duolingo Language Learning App → | tags: duolingo, luis von ahn, taxi evaluation

16.

Show HN: Claude skill that evaluates B2B vendors by talking to their AI agents (news.ycombinator.com)

2026-03-26 | get AI Vendor Evaluation Toolkit → | tags: claude, buyer eval, salespeak api

17.

Our whole way of thinking about leadership is a century out of date (feeds.feedburner.com)

2026-03-23 | get Leadership Books for Modern Leaders → | tags: leadership, employee motivation, management practices

18.

Tech Employees Are Reportedly Being Evaluated by How Fast They Burn Through LLM Tokens (gizmodo.com)

2026-03-22 | get Token Counting Keyboard → | tags: llm tokens, tech employees, evaluation

19.

Gemini 3.1 Pro (news.ycombinator.com)

2026-02-19 | get Smartwatch → | tags: evaluation, evaluations, gemini

20.

C++26: Std:Is_within_lifetime (news.ycombinator.com)

2026-02-19 | get C Standard Library → | tags: active, bool, evaluation

21.

Don't Trust the Salt: AI Summarization, Multilingual Safety, and LLM Guardrails (news.ycombinator.com)

2026-02-16 | by Roya Pakzad | get Language Model Guardian → | tags: english, evaluation, model

22.

Enterprises are measuring the wrong part of RAG (venturebeat.com)

2026-02-01 | get Smart Grid → | tags: enterprise, evaluation, freshness

23.

Claude Code daily benchmarks for degradation tracking (news.ycombinator.com)

2026-01-29 | get Cloudflare → | tags: claude, claude code, code

24.

Claude Code Daily Benchmarks for Degradation Tracking (news.ycombinator.com)

2026-01-29 | get Code Monitor → | tags: claude, claude code, code

25.

Counterfactual evaluation for recommendation systems (news.ycombinator.com)

2026-01-17 | by Eugene Yan | get Recommendation Systems → | tags: evaluation, model, probability

26.

Apple chooses Google’s Gemini over OpenAI’s ChatGPT to power next-gen Siri (arstechnica.com)

2026-01-12 | get Chatbot → | tags: ai models, apple, apple google

27.

Apple says its new AI-powered Siri will use Google’s Gemini language models (arstechnica.com)

2026-01-12 | get Google's Gemini → | tags: ai models, apple, apple google

28.

OpenEvolve: Teaching LLMs to Discover Algorithms Through Evolution (news.ycombinator.com)

2025-12-09 | by Asi Labs Research Team | get Artificial Intelligence → | tags: code, evaluation, evolution

29.

Opinion | The High Cost of Learning (feeds.content.dowjones.io)

2025-12-09 | get E-learning platform → | tags: devaluation, devaluation real, discuss

30.

Saturn (YC S24) Is Hiring Senior AI Engineer (news.ycombinator.com)

2025-12-04 | get Saturn AI → | tags: domain, engineering, evaluation