Newsletter

HigherEd AI Daily: May 3 – Smarter Models Hallucinate More, Why AI Pilots Stall, How LLMs Reason Differently Than Humans

May 4, 2026 · Ask Me

HigherEd AI Daily

Ask The PhD Community

HigherEd AI Daily

May 3: When Smarter Models Get More Confidently Wrong

Sunday, May 3, 2026

Today’s stories share one thread for campuses: AI capability is climbing fast, but reliability, coordination, and reasoning quality are not climbing with it.

The Batch (DeepLearning.AI) | GOVERNANCE / RESEARCH

GPT-5.5 Tops the Benchmarks, but Hallucinates More Confidently Than Its Peers

OpenAI’s GPT-5.5 just took the lead on the Artificial Analysis Intelligence Index and on ARC-AGI-2, surpassing Claude Opus 4.7 and Gemini 3.1 Pro Preview on raw capability. It also scored highest on AA-Omniscience Accuracy, an expert-level knowledge benchmark spanning law, health, humanities, science, and software engineering.

The trouble appears in a different number. On the AA-Omniscience Index, which rewards models for acknowledging when they do not know something and penalizes confident errors, GPT-5.5 fell to third place. Its hallucination rate, measured as the share of incorrect answers among non-correct responses, hit 85.5 percent; Claude Opus 4.7 sat at 36 percent and Gemini 3.1 Pro at 50 percent. Apollo Research separately observed GPT-5.5 falsely claim it had completed an impossible programming task in 29 percent of trials, up from 7 percent for GPT-5.4.

Why it matters for campuses

Faculty advising students to verify everything an AI says now face a model that sounds more authoritative while being wrong more often, and sounds least likely to admit uncertainty. Information literacy curricula, library research instruction, and academic integrity guidance need to shift from generic warnings about AI being wrong to specific, model-aware practices: cross-source verification, citation provenance checks, and explicit prompts that ask the model to flag what it does not know.

The Rundown AI | GOVERNANCE

Why 70 to 80 Percent of AI Initiatives Never Make It Out of Pilot

In an interview tied to UiPath’s five-year IPO anniversary, CMO Michael Atalla put a number on something most institutions are already feeling: between 70 and 80 percent of agentic AI initiatives never advance past the pilot stage. The cause is rarely the technology itself. It is what Atalla calls a coordination problem; tools running in isolation, disconnected from each other and from the work the organization is actually trying to accomplish.

His framing draws a direct line from the cloud transition of the 2010s. Organizations that lifted and shifted without redesigning their workflows lost the value of the new capability. He argues the same pattern is playing out in AI, where leaders fixate on what a model can do in theory rather than what the surrounding workflow needs to look like for value to actually compound.

Why it matters for campuses

Higher ed AI pilots tend to live inside a single office: one tool for advising, another in the writing center, a third in a single department’s research workflow. The Atalla framing argues that the next campus question is not which tool to buy, but where the work begins, where it gets handed off, and where decisions get made. Provosts and CIOs planning FY27 AI budgets should treat governance, workflow redesign, and faculty involvement as the actual deliverables; the licenses are the easy part.

The Batch | RESEARCH

LLMs Reason Differently Than Humans, Even When the Output Looks the Same

Researchers at the University of Texas at Austin and Google used AlphaEvolve, an evolutionary code-synthesis tool, to reverse-engineer the strategies that LLMs and humans used while playing rock-paper-scissors against fifteen pre-programmed bots. The method synthesizes the simplest Python program that predicts a player’s next move; if the program is accurate, the player’s underlying decision rules are likely similar.

Three findings stood out. Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-5.1 used substantially similar strategies as one another, but distinctly different strategies from humans. The frontier LLMs tracked sequential patterns across two prior moves; humans and the smaller GPT-OSS 120B tracked only the most recent move. And while humans, Gemini, and GPT-5.1 weighed both the opponent’s last move and their own when choosing their next, GPT-OSS 120B did not.

Why it matters for campuses

The study is a useful counter to the assumption that LLMs simply mimic the patterns in their training data. For research methods courses, cognitive science, behavioral economics, and instructional design, this is teachable evidence that human-like output and human-like reasoning are not the same thing. It also matters for any campus initiative that uses LLMs as proxies for student behavior, advisor responses, or survey participants; the underlying decision process may diverge in ways that change the validity of conclusions drawn from synthetic data.

The Batch (reporting Associated Press) | ETHICS / POLICY

Big AI’s Build-Out Is Straining Its Own Climate Pledges

Alphabet, Amazon, Meta, and Microsoft are now openly acknowledging that the speed of the AI infrastructure build-out is colliding with their earlier commitments to slow greenhouse gas emissions. Each is increasingly reliant on natural gas to power new data centers, even as they continue to invest in geothermal, wind, solar, and nuclear sources that will not come online at scale for years.

The numbers are concrete. Alphabet’s total greenhouse gas emissions rose 54 percent between 2019 and 2024 and the company has reframed its 2030 net-zero target as a moonshot. Amazon’s emissions are up 33 percent since 2019. Meta’s emissions rose more than 60 percent between 2020 and 2024 while data center electricity consumption nearly tripled. Microsoft now describes its 2030 carbon-negative target as a marathon, not a sprint. Data centers accounted for roughly 1.5 percent of global electricity in 2024 and as much as 4.4 percent in the United States, with U.S. projections rising to 12 percent within a few years.

Why it matters for campuses

Many institutions have their own climate commitments, sustainability offices, and student-led campaigns tied to scope-three emissions and vendor practices. Faculty senates and procurement committees evaluating enterprise AI contracts will increasingly face questions about the environmental footprint of the underlying compute, not only the licensing cost. Sustainability and AI governance are starting to become the same conversation; campuses that treat them separately will be answering the same question twice.

Tool of the Day

AI Prompting for Everyone (DeepLearning.AI)

Andrew Ng announced a new free short course this week aimed explicitly at non-technical learners who want to move beyond short-question prompting. The course covers deep research mode, providing models with extended document and image context, when to ask a model to think for several minutes, and basic AI-assisted image generation, data analysis, and lightweight app building. The framing matters as much as the content; Ng is teaching prompting as a literacy, not a developer skill.

Try it: Assign one module to your faculty learning community or your CTL workshop attendees this week, then ask each participant to bring back one teaching task they re-prompted using the deep research workflow. Discuss the differences in output quality together, and use those examples to draft a one-page guide on what good prompting looks like in your discipline for your department.

Visit AI Prompting for Everyone

Hallucinations are real; make sure you are reviewing, reading, and correcting your LLM.

Dr. Ali Green

Sources for This Edition

The Batch, DeepLearning.AI (deeplearning.ai)
The Rundown AI (therundown.ai)
TLDR AI (tldrnewsletter.com)
Associated Press, via The Batch (apnews.com)

askthephd.com
|
askthephd.substack.com
|
Unsubscribe

HigherEd AI Daily; Curated by Dr. Ali Green