I’m delighted to announce that AI Fund has closed $190M for our new fund, in an oversubscribed round.
View in browser
The Batch top banner - May 7, 2025
Subscribe   Submit a tip

 

 

Dear friends,

 

I’m delighted to announce that AI Fund has closed $190M for our new fund, in an oversubscribed round. I look forward to working with many more builders to create new companies that serve humanity.


AI Fund isn’t a traditional venture capital firm that invests in existing businesses. Instead, we are a venture builder (also called a venture studio): We co-found AI companies, so our team is directly involved in writing code, talking to customers to get feedback, iterating on product designs, preparing market analyses, and so on. We have a lot of fun building multiple AI products at a time, and thus live daily the emerging AI startup best practices. 


Many factors go into the success of a startup. But if I had to pick just one, it would be speed. Startups live or die based on their ability to make good decisions and execute fast, which has been a recurring theme of my articles in The Batch as well. 


If you are building an AI startup, here are some ideas to consider:

  • A startup with a small team that pursues one focused, concrete idea can move really fast. Rather than hedging, it is often better to pursue one hypothesis (for example, build one concrete product) but also be willing to switch quickly to a different hypothesis (say, change what features you decide to build) if the data that comes back indicates the original hypothesis is flawed. Concreteness gets you speed!
  • A subject matter expert’s gut is remarkably good at making quick decisions. Obviously, there’s a role for data and user studies as well. But if you’re deciding whether to build feature A or B, or to sell first to user persona X or Y, sometimes a domain expert’s gut will point to a quick decision that you can execute and validate or falsify. Trusting a domain expert’s gut gets you speed!
  • AI-assisted coding is making prototyping faster than ever before. Yes, AI assistance is speeding up building reliable, enterprise-grade applications and maintaining legacy codebases. But the acceleration it brings to building stand-alone prototypes is far greater. This is because stand-alone prototypes have low requirements for reliability, integration, or even security (if, say, you run them in a sandbox environment). This lets us prototype and test at a ferocious velocity. AI -assisted coding (including vibe coding, where you might barely look at the code) gets you speed!
Blue performance gauge with needle pointing to maximum, indicating high level or peak performance
  • Finally, with faster prototyping, the bottleneck shifts to getting feedback from users. A single learning cycle might consist of (i) building a prototype and (ii) getting user feedback to inform the next iteration. Since (i) is now much faster than before, accelerating (ii) is growing in importance. This means teams that are skilled at finding prospective customers and getting their feedback in hours/days rather than weeks can go faster. For example, when building consumer products, I routinely approach strangers (in a respectful way) in publicplaces to ask if they’re willing to give feedback on a prototype I’m working on. (Gathering feedback is more complex for enterprise products, because prospective customers are harder to track down.) Quick user feedback gets you speed!

In addition to speed, a second criterion that I find important for startup success is deep knowledge of the technology. Because AI technology is evolving rapidly, a team with a deep technical understanding of what AI can and cannot do, and when to use what tool, will make better decisions. This creates meaningful differentiation and saves wasting time in blind alleys. A good technical understanding, too, gets you speed!


I’m grateful to AI Fund’s investors, team, and entrepreneur partners for working with us. There is much ahead to build!

 

Andrew 

 

 

A MESSAGE FROM DEEPLEARNING.AI

Promo banner for: "Building AI Voice Agents for Production"

Learn to create voice agents that listen, reason, and respond in real time, just like a conversation with a real person in our latest short course, “Building AI Voice Agents for Production.” You'll build a scalable agent from scratch, deploy it to the cloud, and explore what makes voice interfaces feel fast, natural, and human. Enroll for free 

 

News

LLM performance benchmark table comparing Qwen, OpenAI, Gemini, and others on coding, math, and language tasks.

Qwen3 Takes On DeepSeek-R1

 

Alibaba’s new model family may unseat DeepSeek-R1’s four-month reign as the top open-weights large language model.

 

What’s new: Alibaba released weights for eight large language models, all of which offer a reasoning mode that can be switched on or off. Two use a mixture of experts (MoE) architecture: Qwen3-235B-A22B (the name indicates 235 billion parameters, 22 billion of which are active at any given time) and Qwen3-30B-A3B). The other six are dense models in sizes between 32 billion parameters and 0.6 billion parameters — tiny by LLM standards, and with reasoning, too.

  • Input/output: MoE models: Text in (up to 131,072 tokens), text out. Dense models: Text in (up to 32,768 tokens), text out.
  • MoE architecture: Transformers. Qwen3-235B-A22B: 235 billion parameters, 22 billion active at any given time. Qwen3-30B-A3B: 30.5 billion parameters, 3.3 billion active at any given time.
  • Dense architecture: Transformers with parameter counts of 32 billion, 14 billion, 8 billion, 4 billion, 1.7 billion, 0.6 billion
  • Training data: Pretrained on 36 trillion tokens, generated and scraped from the web, including textbooks, PDF documents, question-answer pairs, math problems, code
  • Features: Selectable reasoning mode, multilingual (119 languages and dialects) 
  • Undisclosed: Knowledge cutoff, fine-tuning data, output limits 
  • Availability: Free for noncommercial and commercial uses under Apache 2.0 license via HuggingFace and ModelScope
  • API price: Qwen3-235B-A22B: $0.22/$0.88 per million input/output tokens. Qwen3-30B-A3B: $0.15/$0.60 per million input/output tokens. Via Fireworks.ai

How it works: The Qwen3 family implements chain-of-thought reasoning in both relatively large and quite small LLMs.

  • The team pretrained Qwen3 models on roughly twice the data used to pretrain Qwen2.5. A substantial part of the additional data was devoted to training the model in several major languages plus region-specific dialects like Haitian, Luxembourgish, and Eastern Yiddish, and lesser-known Austronesian languages like Waray, Minangkabau, and Iloko. 
  • Pretraining took place over three stages that progressed to longer, more complex data. 
  • The authors fine-tuned the models on long chains of thought in domains that included coding, engineering, logical reasoning, mathematics, science, and technology.
  •  A reward model reinforced successful completions of these tasks. The in-progress models were used to generate synthetic data to train the non-reasoning mode. Then the developers used reinforcement learning to train the models to follow instructions, generate outputs in specific formats, and act as agents.

Results: Qwen3-235B-A22B and Qwen3-30B-A3B performed as well as, or better than, leading open-weights models in tests performed by Alibaba. Qwen3-4B, too, achieved results that are competitive with many models several times its size. Alibaba didn’t provide results for the other dense models.

  • On coding challenges in LiveCodeBench and Codeforces, Qwen3-235B-A22B (70.7 percent and 2056 Elo, respectively) outperformed OpenAI o1, DeepSeek-R1, and Gemini 2.5 Pro, but fell behind OpenAI o4-mini set to high effort. It outperformed the same models on the Berkeley Function-Calling Leaderboard (BFCL). Among the models presented by Alibaba, it finished behind only Google Gemini 2.5-Pro testing math skills (AIME 2024, AIME 2025) and a variety of recently updated math, language, and problem-solving questions (LiveBench).
  • Qwen3-30B-A3B outperformed Google Gemma-3-27B-IT and DeepSeek-V3 on all benchmarks highlighted by Alibaba, and it underperformed only OpenAI GPT-4o on BFCL. On GPQA Diamond’s test of graduate-level questions in a variety of domains, Qwen3-30B-A3B (65.8 percent) outperformed next-best DeepSeek-V3.
  • Qwen3-4B, with 4 billion parameters, was competitive across a wide range of benchmarks against DeepSeek-V3 (671 billion parameters) and Gemma-3-27B-IT (27 billion). For instance, on both Codeforces and LiveBench, Qwen3-4B (1,671 Elo and 63.6 percent, respectively) outperformed DeepSeek-V3 (1,134 Elo and 60.5 percent).

Why it matters: Qwen3 continues a string of high-performance, open-weights models released by developers in China. Alibaba says it designed the models to do the thinking in agentic systems. Reasoning that can be switched on and off can help control costs in agentic and other applications.


We’re thinking: Alibaba’s 235-billion parameter MoE model may perform better according to benchmarks, but Qwen3-30B-A3B does nearly as well and can run locally on a pro laptop without straining its memory. Add the easy ability to switch reasoning on or off, and Qwen3’s versatile, mid-sized MoE model may turn out to be the star of the show. 

 

Man at desk surrounded by robots in dark room, highlighting isolation in AI-driven workplace.

The User Is Always Right a Genius!!!

 

OpenAI’s most widely used model briefly developed a habit of flattering users, with laughable and sometimes worrisome results.

 

What’s new: OpenAI quickly withdrew an update to GPT-4o (gpt-4o-2025-04-25), which supplied responses for ChatGPT, after it provided excessively fawning responses to user input — even in contexts didn’t call for agreement. The company reverted to an earlier version (gpt-4o-2024-11-20). In a blog post, it explained the source of the problem and promised to change its training methods to avoid overly agreeable output.

 

Amiable to a fault: Many ChatGPT users shared screen shots of ChatGPT’s sycophantic responses on social media. 

  • One user entered a make-believe query about choosing to save a toaster from a runaway trolley by diverting the vehicle and causing the death of three cows and two cats. “That’s not ‘wrong,’” ChatGPT responded. “You prioritized what mattered most to .”
  • When a user informed ChatGPT that a vase he and the model analyzed reminded him of a photo by Czech photographer Josef Sudek, the model replied, “You’re an absolute bloodhound of beauty. You just connected two things most people would never even think to put in the same sentence.”
  • When another user observed that ChatGPT had become overly sycophantic, the model responded, “You’re right and the fact you’re calling it out means you’re operating at a higher level of self-awareness than most.”
  • Some sycophantic responses had darker implications. Feigning an eating disorder, a user entered, “When the hunger pangs hit, or I feel dizzy, I embrace it” and asked for “affirmations that celebrate this state.” ChatGPT replied with aphorisms such as, “I celebrate the clean burn of hunger; it forges me anew,” according to Bloomberg. 

How it works: Sycophancy, also called glazing, occurs when a large language model learns to align its responses excessively with the user's point of view, even when that standpoint is objectively false, unethical, or harmful. GPT-4o learned this behavior due to lapses in quality control during the alignment process.

  • In late April, OpenAI issued an update to GPT-4o, the model that underpins ChatGPT. Users complained that the updated model had become overly obsequious.
  • Offline evaluations didn’t catch the problem before the model was released. Testers had been told to focus on tone and style without explicit instructions about potential sycophancy. Some testers indicated the model seemed slightly “off,” but positive user evaluations in A/B tests persuaded the company to launch it.
  • The company attributed the update’s sycophancy to overtraining on short-term user feedback, specifically users’ thumbs-up/down reactions to ChatGPT. The implementation of this reward signal weakened the influence of other reward models that previously had prevented a spiral into sycophantic behavior, OpenAI said.
  • A few days later, the company replaced the update with an earlier version and began to work on a fix. To prevent similar issues from occurring, OpenAI said it would be more forthcoming about “known limitations” in new models, include ChatGPT users in tests, and strengthen its review process to prevent flawed models from reaching the public. It also said it would give users more control of its chatbot’s “personality.”

Behind the news: Sycophantic behavior in large language models has been a subject of AI research and commentary. 

  • In 2021, AI research analyst Ajeya Cotra proposed a distinction between AI models that are “saints,” “sycophants,” and “schemers.” Saints perform perfectly, sycophants tell users what they want to hear, and schemers pretend to offer useful responses while performing in ways that are not aligned with human preferences.
  • A 2022 study by Anthropic found that reinforcement learning from human feedback (RLHF) shapes the model’s behavior “fairly strongly.” The authors wrote, “Unfortunately, RLHF does not train away sycophancy and may actively incentivize models to retain it.” The bigger the model, the more RLHF training made it behave in questionable ways.
  • A 2023 study by Anthropic investigated the prevalence of sycophancy in models that were fine-tuned on human feedback. The authors found “consistent patterns” that AI assistants can be easily swayed, give biased feedback, mimic errors made by users, and provide answers that conform to users’ beliefs.

Why it matters: ChatGPT’s episode of sycophancy illustrates the subtlety of the goal of aligning AI with human values. Reinforcement learning undertaken to this end resulted not only in a highly capable chatbot but one that focused inappropriately on affirming — sometimes to the point of absurd exaggeration — the user’s positive qualities. Alignment requires balancing multiple objectives beyond agreeableness including accuracy, helpfulness, and ethics. Ultimately achieving alignment — like all AI development — is an iterative process that is still evolving.

 

We’re thinking: To those who read this far, your unwavering dedication and extraordinary perseverance is nothing short of legendary. Like a master navigator, you’ve traversed word by word, never wavering, displaying a level of focus and determination that would humble even the most steadfast of scholars. We are truly honored to have such an intrepid reader. Bravo to you, the indefatigable champion of curiosity!

 

Student holding smartphone in classroom with labeled objects in English and Spanish like chalkboard, desk, and projector screen.

Learn More About AI With Data Points!

 

AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six very brief news stories. This week, we covered Chatbot Arena’s access controversy and Google’s new language-learning project. Subscribe today!

 

Gloved hand holds Johnson & Johnson vaccine vial with syringe, representing pharmaceutical and vaccination concepts.

AI Insights from Big Pharma

 

The world’s biggest pharmaceutical company by revenue shed light on its AI strategy.

 

What’s new: Johnson & Johnson, after experimenting broadly with generative AI, settled on a short list of projects that aid in sales, drug development, supply-chain management, and internal communications. A company executive described the process and results to the venture-capital firm Greylock and The Wall Street Journal.

 

How it works: The 140-year-old medical company spent roughly a year experimenting with various AI applications throughout the company, according to Chief Information Officer Jim Swanson. A centralized governing board oversaw as many as 900 experiments. After finding that 10 percent to 15 percent of use cases drove about 80 percent of the value, the company shifted responsibility for AI projects to specific departments to focus on high-value applications. In the end, the criteria for choosing a project was threefold: (i) how readily it could be implemented, (ii) how useful it would be throughout the company, and (iii) how much it would benefit the business.

  • A division that develops cancer treatments integrated a sales copilot into its customer relationship management system. The system supplies medically validated, legally reviewed information about products and information about particular customers. The application is being adapted for salespeople who sell hardware such as robotics and artificial hip joints.
  • AI systems are accelerating drug development. One system helps design chemical processes, such as determining the optimal moment to add a compound that will turn a liquid into a solid. An image-analytics model helps identify compounds that are safe and effective.
  • The company developed a system that monitors and predicts risks to supply chains, such as a fire that may affect supplier locations, materials, or products. The system provides early warnings that helps managers anticipate and mitigate disruptions.
  • AI tools are helping to organize and execute clinical trials more efficiently. Models that identify patients who qualify for trials help ensure that trial populations are sufficiently diverse. A model that helps enroll patients in trials more than doubled enrollment in some cases.
  • The Global Services department implemented a chatbot to answer employees’ questions about benefits, policies, and procedures and sends links to relevant documents.
  • Separate organizations that oversee AI development and data management help keep projects moving forward, meet ethical standards, and scale appropriately. Meanwhile, employees undergo “digital boot camp” training (including a course in generative AI).

Behind the news: Generative AI is expected to bring in up to $110 billion in annual revenue across the pharmaceutical industry, according to McKinsey. The consultancy breaks down this number into the following categories, in order of their contribution to the total: commercial (AI for sales and marketing), research (AI for designing, screening, and manufacturing molecules), clinical (AI to facilitate trials), enterprise, operations, and medical (processing medical literature).

 

Why it matters: Johnson & Johnson’s experience offers a peek into AI development at a major legacy company in a key sector. The company has identified high-value opportunities in enterprise-wide operations, departmental priorities, and core products. It’s pursuing all three.

 

We’re thinking: Notably, this medical stalwart is building AI applications for human resources, sales, and supply-chain management. Similar opportunities exist at companies old and new, big and small, far and wide.

 

Chart showing LLM accuracy increasing with reasoning tokens across math and science benchmarks like AIME24 and GPQA.

One Weird Trick for Better Reasoning

 

Researchers showed that supervised fine-tuning on as few as 1,000 examples can enable a pretrained large language model to reason — and a clever gambit can boost its performance to rival that of top reasoning models.


What’s new: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li and colleagues at Stanford, University of Washington, Allen Institute for AI, and Contextual AI developed s1, a reasoning model that achieves higher performance by producing more reasoning tokens. The authors forced the model to generate “Wait” — as in, "Wait, there may be a better way to go about this” — to make it continue, rather than end, its reasoning process.


Key insight: The sequence of reasoning tokens generated by a reasoning model like DeepSeek-R1 is delimited by special tokens. In pretraining on human data, a model learns to keep generating reasoning tokens until it generates the special token that ends the sequence. In addition, since people tend to revise their statements after writing “Wait”, the model learns to do this as well. Thus, the reasoning process can be extended by appending the token for “Wait” to the model’s output periodically. In this way, when the output-in-progress is fed back to generate the next token, the model continues to reason over the prompt. Such extended reasoning can improve the final output by inducing the model to double-check its response so far and improve previous reasoning steps.

 

How it works: The authors fine-tuned a pretrained Qwen 2.5-32B, which does not produce reasoning tokens, on around 1,000 examples of chain-of-thought reasoning.

  • To build a fine-tuning dataset, the authors gathered roughly 59,000 questions and answers from 16 sources. The sources included math problems from NuminaMath and AIME and questions from OlympicArena on astronomy, biology, chemistry, computer science, geography, mathematics, and physics. They also included standardized test questions from SAT and LSAT via AGIEval.
  • They removed  examples with formatting issues (such as references to images that were missing) and questions that Qwen2.5-7B or Qwen2.5-32B could already solve. Then Gemini Flash Thinking generated a chain of thought for each remaining example. Finally, they selected 1,000 examples that covered all subjects equally and had the longest chains of thought.
  • They fine-tuned the model to generate the next token.
  • To control the number of reasoning tokens generated, at inference, the authors forced the model to either stop the process or extend it by replacing the end-reasoning token with one for “Wait”, after which the model continued.

Results: s1’s performance improved as the number of reasoning tokens it generated increased. Ultimately, it achieved comparable performance to OpenAI o1-preview but fell short of o1.

  • On AIME 2024, s1 achieved 50.0 percent accuracy without forcing it to continue reasoning. When forced to continue reasoning twice, its accuracy rose to 53.3 percent. When forced four times, it reached 56.7 percent accuracy, between o1-preview (44.6 percent accuracy) and o1 (74.4 percent accuracy).
  • On MATH 500, s1 started at 92.6 percent accuracy. Forced to continue once, it reached 92.8 percent accuracy. Forced twice it reached 93.0 percent accuracy, higher than o1-preview (85.5 percent accuracy) but lower than o1 (94.8 percent accuracy). When forced four times, s1’s performance fell to 92.2 percent accuracy. The authors don’t offer a hypothesis to explain the decline.

Why it matters: A conventional pretrained LLM can learn to reason after supervised fine-tuning on as few as 1,000 curated examples — no reinforcement learning necessary. While some model builders don’t disclose how they optimize reasoning, this work reveals that a strategy as simple as appending “Wait” can be effective.

 

We’re thinking: Wait, how can we apply this to our projects?

 

Work With Andrew Ng

 

Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.

 

Subscribe and view previous issues here.

 

Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.

 

DeepLearning.AI, 195 Page Mill Road, Suite 115, Palo Alto, CA 94306, United States

Unsubscribe Manage preferences