Dear friends,
I’ve noticed that many GenAI application projects put in automated evaluations (evals) of the system’s output probably later — and rely on humans to manually examine and judge outputs longer — than they should. This is because building evals is viewed as a massive investment (say, creating 100 or 1,000 examples, and designing and validating metrics) and there’s never a convenient moment to put in that up-front cost. Instead, I encourage teams to think of building evals as an iterative process. It’s okay to start with a quick-and-dirty implementation (say, 5 examples with unoptimized metrics) and then iterate and improve over time. This allows you to gradually shift the burden of evaluations away from humans and toward automated evals.
So long as the output of the evals correlates with overall performance, it’s fine to measure only a subset of things you care about when starting.
The development process thus comprises two iterative loops, which you might execute in parallel:
As with many things in AI, we often don’t get it right the first time. So t’s better to build an initial end-to-end system quickly and then iterate to improve it. We’re used to taking this approach to building AI systems. We can build evals the same way.
Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly. This is a similar philosophy to error analysis in building machine learning algorithms, only instead of focusing on errors of the machine learning algorithm's output — such as when it outputs an incorrect label — we focus on “errors” of the evals — such as when they incorrectly rank two systems A and B, so the evals aren’t helpful in choosing between them.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIBuild autonomous agents that take actions like scraping web pages, filling out forms, and subscribing to newsletters in “Building AI Browser Agents.” Explore the AgentQ framework, which helps agents self-correct using Monte Carlo tree search and direct preference optimization. Start learning today
News
Google Unveils Gemini 2.5Google’s new flagship model raised the state of the art in a variety of subjective and objective tests. What’s new: Google launched Gemini 2.5 Pro Experimental, the first model in the Gemini 2.5 family, and announced that Gemini 2.5 Flash, a version with lower latency, will be available soon. All Gemini 2.5 models will have reasoning capabilities, as will all Google models going forward.
How it works: Compared to Gemini 1.0 and Gemini 1.5, Google disclosed little information about Gemini 2.5 Pro Experimental or how it differs from previous versions.
Results: On a variety of popular benchmarks, Gemini 2.5 Pro Experimental outperforms top models from competing AI companies.
Why it matters: Late last year, some observers expressed concerns that progress in AI was slowing. Gemini 2.5 Pro Experimental arrives shortly after rival proprietary models GPT-4.5 (currently a research preview) and Claude 3.7 Sonnet, both of which showed improved performance, yet it outperforms them on most benchmarks. Clearly there’s still room for models — particularly reasoning models — to keep getting better. We’re thinking: Google said it plans to train all its new models on chains of thought going forward. This follows a similar statement by OpenAI. We’re sure they have their reasons!
Open Standard for Tool Use and Data Access Gains MomentumOpenAI embraced Model Context Protocol, providing powerful support for an open standard that connects large language models to tools and data. What’s new: OpenAI will support Model Context Protocol (MCP) in its Agents SDK and soon its ChatGPT desktop app and Responses API. The move will give developers who use OpenAI models access to a wide variety of pre-existing tools and proprietary data sources. How it works: Launched by Anthropic late last year, MCP connects AI models to a growing ecosystem of plug-and-play resources, including more than 6,000 community-built servers and connectors.
Behind the news: Momentum behind MCP has built rapidly. Last month, Microsoft integrated MCP into CoPilot Studio, enabling developers to build agents with access to MCP servers. Cloudflare enabled its customers to deploy remote MCP servers. In February, the AI-powered code editor Cursor enabled users to add MCP servers. Why it matters: OpenAI’s move will make it easier for developers who use its models to connect to a variety of tools and data sources, and it helps to establish MCP as a go-to protocol for building agentic applications. Instead of figuring out manually how to integrate various providers, developers can connect to a third-party server (or download and run it themselves) and tie it into existing workflows with a few lines of code. We’re thinking: Kudos to Anthropic, OpenAI, and other competitors who realize it’s better to solve shared problems together than fragment the industry.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six very brief news stories. This week, we covered OpenAI’s new model suite for developers and an open-source DeepCoder that matches top models. Subscribe today!
The Fall and Rise of Sam AltmanA behind-the-scenes account provides new details about the abrupt firing and reinstatement of OpenAI CEO Sam Altman in November 2023. How it works: Based on insider accounts, an excerpt from a forthcoming book about OpenAI by Wall Street Journal reporter Keach Hagey describes conflicts, accusations, and shifting alliances that led to Altman’s brief ouster and rapid return. Firing and reinstatement: OpenAI’s board of directors came to distrust Altman but failed to persuade executives and employees that he should be replaced.
Aftermath: Since Altman’s return, Murati and all but one director who voted to remove him have left OpenAI. The issues that precipitated his departure have given way to commercial concerns as the company considers a shift from its current hybrid nonprofit/for-profit structure to fully for-profit.
Why it matters: The AI frontier spawns not only technical innovations but also intense interpersonal relationships and corporate politics. Such dynamics have consequences for users and the world at large: Having survived serious challenges to his leadership, Altman has emerged in a strong position to build a path of faster growth as a for-profit company upon OpenAI’s philanthropic foundation. We’re thinking: Given OpenAI’s formidable achievements, Altman’s renewed leadership marks an inflection point in the AI landscape. Without Sam Altman at the helm, OpenAI would be a very different company, with different priorities and a different future.
Toward LLMs That Understand MisspellingsResearchers built a model that’s more robust to noisy inputs like misspellings, smarter about character-level information like the number of R's in strawberry, and potentially better able to understand unfamiliar languages that might share groups of letters with familiar languages. Their approach: Eliminate the tokenizer and instead integrate a system that learns to group input characters. What’s new: Artidoro Pagnoni, Ram Pasunuru, and collaborators at Meta, University of Washington, and University of Chicago introduced Byte Latent Transformer (BLT), a system of transformers that processes groups of text characters (in the form of bytes) directly. Key insight: A tokenizer turns bytes (characters) into tokens (a word or part of a word) based on learned rules: Specific sequences map to particular tokens. A large language model (LLM) would be more efficient if its tokenizer considered how easy or difficult it would be to predict the next token, because then it could group tokens that commonly occur together, thus saving memory and processing power. For instance, to complete the phrase, “The capital of the United States is,” a tokenizer may generate “Washington”, then “D”, then “.C”, and finally “.” — even though it’s easy to predict that “D.C.” will follow “Washington” (that is, the number of viable options is very small). Conversely, generating the token after “D.C.” is harder, since many viable options exist. Using a small LLM to estimate the difficulty of predicting the next token enables the model to split difficult-to-predict text into smaller groups while packing easier-to-predict text into larger groups. How it works: BLT comprises four transformers (8 billion parameters total): (i) a small byte-level transformer, (ii) an encoder transformer, (iii) a so-called latent transformer, and (iv) a decoder transformer. The authors trained the system to generate the next token in 1 trillion tokens of text, including tokens drawn from a filtered version of Common Crawl.
Results: On seven benchmarks that test general language and coding abilities, BLT achieved an average accuracy of 61.1 percent, outperforming Llama 3 (8 billion parameters and a similar number of floating point operations to BLT) at 60.0 percent.
Why it matters: By working directly on bytes, BLT is inherently more robust to variations in language, which improves its performance. For instance, when prompted to insert a "z" after every "n" in "not", Llama 3 incorrectly completed it as "znotz". This happened because its tokenizer treats "not" as a single, indivisible token. In contrast, BLT correctly generated "nzot," because it can dynamically regroup bytes and draw new boundaries. In a more practical case, instead of treating "pizya" and "pizza" as different tokens, BLT recognizes that they share nearly identical byte sequences, differing only in the bytes for "y" and "z", and therefore likely mean the same thing. We’re thinking: In some alternatives to traditional tokenization, an LLM might process much longer sequences because the number of bytes in a sentence is much larger than the number of words. This work addresses that issue by grouping bytes dynamically. The tradeoff is complexity: Instead of one transformer, we have four.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|