Latest Tech News

Stay updated with the latest in technology, AI, cybersecurity, and more

Filtered by: eval Clear Filter

Things you can do with a debugger but not with print debugging

Things you can do with a debugger but not with print debugging People do or do not use debuggers for a variety of reasons. For one thing, they are hard to setup in many codebases. Second, you can’t use them when your application is running on remote environments (such as kuberenetes). So, anecdotally, I have seen way more people using Print/ log.Debug compared to a debugger. Which is a shame, because while debug logging is convenient at times, debuggers can do some things which you can’t easil

Evaluating Agents

“Models constantly change and improve but evals persist” Look at the data No amount of evals will replace the need to look at the data, once you have a evals good coverage you’ll be able to decrease the time but it’ll be always a must to just look at the agent traces to identify possible issues or things to improve. Starting, end to end evals You must create evals for your agents, stop relying solely on manual testing. Not sure where to start? Add e2e evals, define a success criteria (

Topics: agent data e2e end evals

How to Build a Medieval Castle

Sometimes it takes a village to raise a window. Between 2015 and 2017, skilled masons meticulously carved and beveled arches and four-lobed flourishes for a Gothic-style stone window frame in Guédelon Castle’s ornate Chapel Tower. All that remained was to install some glass. But there was a problem, and the carpenters, painters, blacksmiths, basket weavers, historians, and archaeologists who work on-site were all enlisted to figure it out. Eight years later, the matter of what to put in the wind

Rerank-2.5 and rerank-2.5-lite: instruction-following rerankers

TL;DR – We are excited to introduce the rerank-2.5 series, which significantly improves upon rerank-2 ’s performance while also introducing instruction-following capabilities for the first time. On our standard suite of 93 retrieval datasets spanning multiple domains, rerank-2.5 and rerank-2.5-lite improve retrieval accuracy by 7.94% and 7.16% over Cohere Rerank v3.5. Furthermore, the new instruction-following feature allows users to steer the model’s output relevance scores using natural langua

rerank-2.5 and rerank-2.5-lite: instruction-following rerankers

TL;DR – We are excited to introduce the rerank-2.5 series, which significantly improves upon rerank-2 ’s performance while also introducing instruction-following capabilities for the first time. On our standard suite of 93 retrieval datasets spanning multiple domains, rerank-2.5 and rerank-2.5-lite improve retrieval accuracy by 7.94% and 7.16% over Cohere Rerank v3.5. Furthermore, the new instruction-following feature allows users to steer the model’s output relevance scores using natural langua

Beyond Retrieval: The Expanding Universe of Augmented Generation in AI

Introduction Standard large language models (LLMs) possess vast knowledge but struggle with limitations like hallucinations and accessing real-time information due to their static training data. This has spurred the development of dynamic AI architectures. Retrieval-Augmented Generation (RAG) has emerged as a key solution, integrating external knowledge into the generation process. However, the field is rapidly evolving beyond basic RAG. Newer models like RASG (Retrieval-Augmented Self-Generate

LangChain’s Align Evals closes the evaluator trust gap with prompt-level calibration

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now As enterprises increasingly turn to AI models to ensure their applications function well and are reliable, the gaps between model-led evaluations and human evaluations have only become clearer. To combat this, LangChain added Align Evals to LangSmith, a way to bridge the gap between large language model-based evaluators and human preferenc

Open-source MCPEval makes protocol-level agent testing plug-and-play

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Enterprises are beginning to adopt the Model Context Protocol (MCP) primarily to facilitate the identification and guidance of agent tool use. However, researchers from Salesforce discovered another way to utilize MCP technology, this time to aid in evaluating AI agents themselves. The researchers unveiled MCPEval, a new method and open-so

LSM-2: Learning from incomplete wearable sensor data

Training and evaluation We leverage a dataset with 40 million hours of wearable data sampled from over 60,000 participants during the period from March to May 2024. The dataset was thoroughly anonymized or de-identified to ensure that participant information was removed and privacy was maintained. Subjects wore a variety of Fitbit and Google Pixel smartwatches and trackers and consented for their data to be used for research and development of new health and wellness products and services. The

Writing your Clojure tests in EDN files

Jacob O'Bryant | 19 Jul 2025 I've previously written about my latest approach to unit tests: [Y]ou define only the input data for your function, and then the expected return value is generated by calling your function. The expected value is saved to an EDN file and checked into source, at which point you ensure the expected value is, in fact, what you expect. Then going forward, the unit test simply checks that what the function returns still matches what’s in the EDN file. If it’s supposed to

Haskell, Reverse Polish Notation, and Parsing

My Side Quest into Haskell, Reverse Polish Notation, and Parsing 26 Jun, 2025 My Journey into Haskell: Building a Reverse Polish Notation Calculator Introduction: A Side Quest In my attempt to get my first paycheck, aka get a job, I have led myself down a fascinating rabbit hole into functional programming, mathematical notation, and parsing theory. This is the story of how I discovered Haskell, tackled reverse Polish notation, and learned about monadic parsing along the way. My journey bega

Muvera: Making multi-vector retrieval as fast as single-vector search

Neural embedding models have become a cornerstone of modern information retrieval (IR). Given a query from a user (e.g., “How tall is Mt Everest?”), the goal of IR is to find information relevant to the query from a very large collection of data (e.g., the billions of documents, images, or videos on the Web). Embedding models transform each datapoint into a single-vector “embedding”, such that semantically similar datapoints are transformed into mathematically similar vectors. The embeddings are

Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

Hi HN - we're Jeffrey and Kritin, and we're building Confident AI ( https://confident-ai.com ). This is the cloud platform for DeepEval ( https://github.com/confident-ai/deepeval ), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs. We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, Astr