Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Firm Says Latest Model Hallucinates Less, Scores Better on Benchmarks

OpenAI’s unveiling of its latest and newest model arrived wrapped in the big-claim language now standard in the generative artificial intelligence race. The company calls GPT-5 its “smartest, fastest, most useful model yet.” In 2025, those superlatives are table stakes, as every major lab has a headline model and each promises to out-think, out-speed and out-adapt the others.
See Also: AI Agents Demand Scalable Identity Security Frameworks
On the numbers OpenAI has shared, GPT-5 is a measurable step forward from its predecessors. The flagship Pro version tops the Graduate-Level Google-Proof Question Answering benchmark at 88.4% without tools, a score the company says is higher than GPT-4o’s. It also reports that GPT-5 has reduced sycophancy, its term for over-agreeable or echo-chamber-style answers, from 14.5% to less than 6%.
On the coding front, OpenAI calls GPT-5 its strongest coding model yet, scoring 74.9% on SWE-bench Verified and 88% on Aider Polyglot benchmarks. That’s a narrow lead over Anthropic’s newly released Claude Opus 4.1, which scored 74.5 on SWE-bench. The company says GPT-5 can complete complex coding tasks end-to-end with minimal prompting and even generate full interface designs for non-coders.
OpenAI is also touting GPT-5’s performance on domain-specific benchmarks. In health, the model scored 46.2% on HealthBench Hard, a metric the company devised and pitches as its most capable health model yet, with the caveat that “ChatGPT does not replace a medical professional.” The AI can help interpret medical results and suggest questions for healthcare providers, though the risks of relying on predictive systems prone to telling users what they want to hear remain.
Accuracy is another talking point: OpenAI claims GPT-5 with web search enabled is about 45% less likely to produce factual errors than GPT-4o, and, when in “thinking” mode, about 80% less likely than o3. On long-form content benchmarks, “thinking” GPT-5 shows around six times fewer confabulations than o3, although the company concedes no AI system is immune to plausible-sounding mistakes.
But benchmarks are only one lens on progress. OpenAI is also positioning GPT-5 as a more integrated, adaptable part of a user’s workflow. The model comes in three versions: Pro for the most demanding tasks; mini for faster and lighter work; and nano for constrained or embedded contexts. Free-tier users will get GPT-5 mini until they hit usage caps, at which point they may drop to a smaller model. Paying subscribers still get Pro for $20 a month and developers can tap into all three through the API under the existing per-token pricing.
The developer story has evolved too. GPT-5’s “Actions” system builds on earlier function-calling, giving applications more control over when and how the model invokes external tools. For companies building AI-powered products, this means GPT-5 can be wired into proprietary APIs with a clearer, more controllable execution pipeline. That’s not the same thing as allowing the model to run free as an autonomous agent, but pushes the technology towards multi-step reasoning and task completion without constant user intervention.
OpenAI’s smallest GPT-5, the nano model, is designed for contexts where bandwidth, latency or hardware constraints matter. The company is pitching it as a way to run certain interactions more efficiently, though it has not claimed full offline operation on consumer devices. The push toward size-optimized models marks an industry-wide shift, making the AI smaller and closer to the user, without losing too much of its intelligence.
OpenAI’s announcement comes at a time when Anthropic, Google DeepMind and Meta are all pursuing their approaches to reasoning, speed and scale, while regulators across regions scrutinize how deeply these systems reach into communications, documents and other personal data. OpenAI’s ability to demonstrate that GPT-5 can operate in that territory without missteps will determine the veracity of its “most useful” claims.