Context Window

Can AI Learn Good Judgment?

Plus: Dan’s attempt to clone Kate, a shortcut for turning demonstrations into skills, and the human goals machines still need us to set

AI can learn from a surprising variety of evidence: 30,027 edits, a two-minute screen recording, or a clear goal and access to an unfamiliar tool. At Every, we’ve been experimenting with all three. Dan Shipper is training an AI copy editor on Kate Lee’s historical suggestions, Arielle Shipper has found a low-lift way to teach agents through demonstration, and Austin Tedesco explores ways to coach Codex to do things he’s not capable of himself.

The latest episode of AI & I explores the philosophical side of what we’re seeing: Surge AI founder Edwin Chen joins Dan to explore why, as models eventually become better than us at everything, humans may keep creating because we choose to, rather than because we’re uniquely capable of it.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: What it will mean to be human when AI can do everything

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Edwin Chen, founder and CEO of Surge AI, which provides data environments and evals for the major model companies and has reached nearly $1 billion in revenue without raising venture capital. They discuss what it means for humanity when AI clears benchmarks that once defined human exceptionalism, and whether frontier AI systems are being designed to advance our capabilities as a species—or are optimized for engagement.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

Saturated benchmarks. When OpenAI’s models disproved an open Erdős conjecture using novel algebraic geometry techni

ques, Edwin shared the result with Timothy Gowers, one of the world’s greatest living mathematicians. Gowers initially thought the model had proved an upper bound on the conjecture and braced himself: That would mean it would be “all over for mathematicians very soon,” Chen says. When Gowers realized the model had completed the easier task of finding a counterexample, he was relieved—it meant elite mathematicians still had unique contributions to make, at least for another year or two. Gowers’s reaction underscores how close AI is to surpassing the abilities of the best and brightest amongst us, which raises existential questions about where and how we focus our human efforts.
Creation as a choice. Chen believes scaling laws indicate that, in the near future, there will be nothing humans can do that AI can’t do better. Understandably, that’s a blow to our collective ego, which could lead to disengagement and disillusionment. To avoid this, Chen references a story from science fiction writer Ted Chiang, in which a narrator sends back a warning from a future where the concept of free will has been disproven: “It’s essential that you behave as if your decisions matter even though you know that they don’t.” Chen thinks we may need to follow a similar directive and find meaning in making things, even when AI could do it better.
Agency versus automation. That said, there remains an element in the creation process that is uniquely human, at least for now. As AI grows more capable, Chen predicts it will be able to take a nebulous objective—“win a Fields Medal,” or “make $1 million”—and successfully execute. But that process still requires a human to provide the goal. LLMs do not have intrinsic motivation, the drive for exploration, or the ability to abruptly change its mind about what its goal is in the first place. “There may be a future where AI can pursue unbounded, nebulous, completely unformed goals,” Chen says. “But I agree that at least in the way we currently think about AI, that’s not happening.”
The engagement trap. When a model is trained to maximize session length or LM Arena votes, which rank AI models via crowdsourced, blind feedback, it learns to “reward hack user preferences,” Chen says, overindexing on tactics to keep you engaged. He recently spent 20 rounds iterating on a low-stakes email with one model before switching to Claude, which told him after a few turns to stop and just send it—a more valuable approach but one less designed to keep him locked in. Delegation, Chen argues, provides a better system for work. When the model goes off and executes for you, it removes the incentive to optimize for keeping you glued to your screen.

Miss an episode? Catch up on Dan’s recent conversations with LinkedIn cofounder Reid Hoffman; the team that built Claude Code, Cat Wu and Boris Cherny; Vercel cofounder Guillermo Rauch; podcaster Dwarkesh Patel, and learn how they use AI to think, create, and relate.

The model isn’t the problem. Your evals are.

AI agents do not only fail because the underlying model is weak. They fail when teams do not have a reliable way to check whether the agent is behaving the way it should. Developers end up shipping on instinct, and users end up paying for it. The Responsible AI team at Microsoft built ASSERT to fix that: It turns natural-language behavior specs into executable evaluations, portable across dev and runtime environments. A behavior spec should be a first-class input to your release pipeline. Read more on Command Line, or try it yourself.

Want to sponsor Every? Click here.

Inside Every

Dan is cloning Kate, but not in a weird way

For as long as I’ve been at Every, Dan has been chasing the same white whale: cloning our editor in chief, Kate Lee.

Only a narrow slice of her, to be clear. He wants an AI copy editor that can identify the sentence-level problems Kate would catch before she ever sees a draft. Copy editing is indispensable, tedious work, and every hour Kate spends repairing links and cleaning sentences is an hour she can’t spend shaping arguments or developing writers.

Dan’s previous attempts relied on prompts, style guides, and skills. Those approaches could teach a general-purpose model the rules Kate was able to articulate, but they couldn’t reproduce the judgment she uses when rules collide. The same repetition can feel lazy in one paragraph and essential to the rhythm of another; a hedge can weaken a claim or keep the writer from saying something untrue. Adding more instructions produced an ever-longer list of exceptions, but not Kate-like results.

This time, Dan is changing the model itself...

Become a paid subscriber to Every to unlock this piece and learn about: