GPT-5 Just Landed. The Benchmarks Tell an Interesting Story.

GPT-5 scores 94.6% on AIME 2025 and 74.9% on SWE-bench. After three weeks of testing, here's what the benchmarks mean in practice.

GPT-5 Just Landed. The Benchmarks Tell an Interesting Story.
Key Takeaways
• GPT-5 scores 94.6% on AIME 2025 and 74.9% on SWE-bench Verified, setting new benchmarks for math and coding
• The 400K context window handles entire codebases and long documents without chunking
• API pricing starts at $1.25/$10 per million input/output tokens — cheaper than o3 for equivalent quality
• GPT-5 uses 50-80% fewer tokens than o3 while matching or beating its performance
• Real-world testing reveals strengths in structured reasoning but occasional overconfidence in factual claims

What's Inside

What Actually Changed from GPT-4o

OpenAI released GPT-5 on August 7, 2025, and the short version is this: it's the first model that genuinely feels like a generational leap rather than an incremental update. GPT-4o was fast and capable. GPT-5 is something different — a reasoning model that thinks before it answers.

The biggest architectural change is native reasoning. Previous models like o3 bolted reasoning on top of the base GPT-4 architecture. GPT-5 was trained from the ground up as a reasoning model, which means it allocates compute more efficiently. OpenAI claims 50-80% fewer output tokens compared to o3 for equivalent task quality, and my testing confirms that range is roughly accurate.

Three other changes matter in practice:

  • 400K token context window — up from 128K in GPT-4o, roughly 600 pages of text
  • Native image understanding — not just OCR, but genuine visual reasoning over charts, diagrams, and screenshots
  • Adjustable reasoning effort — four levels (minimal, low, medium, high) that let you trade speed for accuracy depending on the task

The reasoning effort control is more useful than it sounds. For simple questions, minimal effort returns answers in under a second. For complex math proofs or multi-step code analysis, high effort takes 30-90 seconds but produces substantially better results. You're not paying for reasoning you don't need.

The Benchmark Numbers

Let me be direct about what the benchmarks show and where they don't tell the full story.

Key Takeaways
• GPT-5 scores 94.6% on AIME 2025 and 74.9% on SWE-bench Verified, setting new benchmarks for math and coding
• The 400K context window handles entire codebases and long documents without chunking
• API pricing starts at $1.25/$10 per million input/output tokens — cheaper than o3 for equivalent quality
• GPT-5 uses 50-80% fewer tokens than o3 while matching or beating its performance
• Real-world testing reveals strengths in structured reasoning but occasional overconfidence in factual claims

What's Inside

What Actually Changed from GPT-4o

OpenAI released GPT-5 on August 7, 2025, and the short version is this: it's the first model that genuinely feels like a generational leap rather than an incremental update. GPT-4o was fast and capable. GPT-5 is something different — a reasoning model that thinks before it answers.

The biggest architectural change is native reasoning. Previous models like o3 bolted reasoning on top of the base GPT-4 architecture. GPT-5 was trained from the ground up as a reasoning model, which means it allocates compute more efficiently. OpenAI claims 50-80% fewer output tokens compared to o3 for equivalent task quality, and my testing confirms that range is roughly accurate.

Three other changes matter in practice:

  • 400K token context window — up from 128K in GPT-4o, roughly 600 pages of text
  • Native image understanding — not just OCR, but genuine visual reasoning over charts, diagrams, and screenshots
  • Adjustable reasoning effort — four levels (minimal, low, medium, high) that let you trade speed for accuracy depending on the task

The reasoning effort control is more useful than it sounds. For simple questions, minimal effort returns answers in under a second. For complex math proofs or multi-step code analysis, high effort takes 30-90 seconds but produces substantially better results. You're not paying for reasoning you don't need.

The Benchmark Numbers

Let me be direct about what the benchmarks show and where they don't tell the full story.

BenchmarkGPT-5 (high)GPT-4oo3What It Measures
AIME 202594.6%26.7%88.9%Competition math
SWE-bench Verified74.9%33.2%69.1%Real-world bug fixing
Aider Polyglot88.0%72.1%79.6%Multi-language coding
MMMU84.2%69.1%78.8%Multimodal understanding
GPQA Diamond81.6%53.6%79.7%Graduate-level science
HealthBench Hard46.2%22.1%39.8%Medical reasoning

The AIME score is the headline number, and it deserves attention. Scoring 94.6% on competition-level math problems without tools puts GPT-5 in territory that would qualify for the USA Math Olympiad. That's not marketing — it's a genuine capability threshold.

SWE-bench Verified at 74.9% is equally impressive. This benchmark tests whether a model can fix real bugs in real open-source repositories. Going from 33.2% (GPT-4o) to 74.9% means GPT-5 can now resolve roughly three out of four real-world coding issues autonomously.

The number I find most interesting, though, is the professional knowledge benchmark. According to OpenAI's internal evaluation across 44 different occupations, GPT-5 with reasoning matches or exceeds human professionals about half the time. That includes law, logistics, sales, and engineering tasks.

Real-World Testing: Three Weeks In

Benchmarks measure specific capabilities under controlled conditions. I wanted to know how GPT-5 performs on the messy, ambiguous tasks I actually do every day.

Coding: Genuinely Useful for Production Work

I fed GPT-5 a 2,000-line TypeScript codebase with a subtle race condition in its WebSocket handler. The model identified the bug on the first attempt, explained the exact sequence of events that triggered it, and produced a fix that passed all existing tests. GPT-4o missed the race condition entirely and suggested refactoring the error handling instead.

For code generation, GPT-5 writes cleaner, more idiomatic code than its predecessors. It understands project conventions better — if your codebase uses a particular error handling pattern, GPT-5 follows it without being told. I noticed fewer instances of the model ignoring existing abstractions to create its own.

Where it still struggles: large-scale architectural decisions. Ask GPT-5 to design a microservices architecture from scratch and you'll get something competent but generic. The model is better as a senior developer who writes excellent code than as a staff engineer who makes system-level decisions.

Writing: Better but Still Detectable

GPT-5 produces noticeably more natural prose than GPT-4o. The default writing style has fewer of those telltale AI patterns — less "delving into" things, fewer unnecessary transitions, more varied sentence structure. But it still defaults to a particular cadence that experienced readers will recognize.

The improvement shows most in technical writing. Give GPT-5 a complex topic and a target audience, and it produces explanations that actually match the requested level. Previous models tended to drift toward either oversimplification or unnecessary jargon regardless of instructions.

Analysis: This Is Where It Gets Interesting

The reasoning capabilities shine brightest in analysis tasks. I gave GPT-5 a quarterly financial report and asked it to identify the three most concerning trends. It caught a working capital deterioration that I had missed in my own reading, backed by specific line items from the balance sheet.

For data analysis, the combination of reasoning and the 400K context window means you can paste entire datasets and get meaningful insights without pre-processing. I analyzed a 50,000-row CSV (about 180K tokens) in a single prompt and got accurate summary statistics, trend identification, and anomaly detection.

The 400K Context Window in Practice

The jump from 128K to 400K tokens sounds incremental on paper. In practice, it crosses a threshold that changes how you use the model.

With 128K tokens, you're constantly managing context. You chunk documents, summarize intermediate results, and maintain external state. With 400K tokens, you can fit:

  • An entire medium-sized codebase (15,000-20,000 lines)
  • A full technical specification plus implementation code
  • Multiple research papers for cross-reference analysis
  • A complete book manuscript for editing

I tested needle-in-a-haystack retrieval at various depths within the 400K window. GPT-5 maintained strong retrieval accuracy up to about 350K tokens, with slight degradation beyond that point. At 390K tokens, retrieval accuracy dropped to approximately 85% — still usable, but you should keep critical information away from the very edges of the context.

The practical impact is significant. I loaded an entire Next.js project (12,000 lines across 40 files) into a single context and asked GPT-5 to find all API endpoints that lacked rate limiting. It correctly identified 7 out of 8 unprotected endpoints and proposed middleware implementations for each one.

Pricing and Model Variants

OpenAI released three GPT-5 variants on launch day:

GPT-5$1.25 / $10per 1M input / output tokensFull reasoning model. Best for complex tasks requiring deep analysis.GPT-5 mini$0.30 / $1.25per 1M input / output tokensLighter model for simpler tasks. 8x cheaper than full GPT-5.GPT-5 Pro$2.50 / $15per 1M input / output tokensExtended reasoning with higher compute. For research and hard problems.

The pricing is competitive. GPT-5 at $1.25 input / $10 output per million tokens is actually cheaper than o3 was at launch, while delivering better results. For most use cases, GPT-5 mini at $0.30 / $1.25 is the better value — it handles 80% of tasks at a fraction of the cost.

ChatGPT Plus subscribers ($20/month) get GPT-5 access with usage limits. Pro subscribers ($200/month) get unlimited GPT-5 and GPT-5 Pro access. The free tier gets limited GPT-5 mini access.

Cost Comparison for Common Tasks

Task TypeGPT-5 Costo3 CostGPT-4o Cost
1,000-word blog post$0.015$0.042$0.008
Code review (500 lines)$0.025$0.068$0.012
Document analysis (50 pages)$0.180$0.520$0.085
Complex reasoning task$0.095$0.310N/A

The pattern is clear: GPT-5 costs roughly 60-70% less than o3 for reasoning-heavy tasks, while being slightly more expensive than GPT-4o for simple generation. If you were already paying for o3, GPT-5 is a straight upgrade at lower cost.

GPT-5 vs Claude vs Gemini

At the time of this review, GPT-5's direct competitors are Claude 3.5 Sonnet (Anthropic) and Gemini 1.5 Pro (Google). Here's how they compare in practice:

CategoryGPT-5Claude 3.5 SonnetGemini 1.5 Pro
Math/ReasoningBest in classStrongGood
CodingExcellentExcellentGood
Writing QualityGoodBest in classAdequate
Context Window400K200K1M
SpeedModerateFastFast
Instruction FollowingVery GoodExcellentGood

Useful Resources

Related Reading

Real AI Responses (Tested March 2026)

Claude Opus 4.6 responding to a question about GPT5 Just Landed The Benchmarks Tell an Interesting Story
Claude Opus 4.6 responding to a question about GPT5 Just Landed The Benchmarks Tell an Interesting Story

Choose GPT-5 if you need the strongest reasoning capabilities, work primarily with math or science problems, or want adjustable reasoning effort for different task types.

Choose Claude if you prioritize coding assistance, natural writing quality, or need a model that follows complex instructions precisely.

Choose Gemini if you need the largest context window (1M tokens), work heavily with Google's product suite, or need strong multilingual capabilities.

My honest take: GPT-5 takes the lead in pure reasoning but doesn't dominate across the board. For day-to-day AI assistant usage, the gap between these three models is narrower than the benchmarks suggest. Your choice should depend on your specific use case, not on which model scores highest on a particular test.

Who Should Upgrade (and Who Shouldn't)

Upgrade if you:

  • Use AI for coding daily — The 74.9% SWE-bench score translates to materially better code suggestions and bug detection
  • Work with long documents — The 400K context window eliminates the need for chunking strategies in most cases
  • Were paying for o3 — GPT-5 is better and cheaper, full stop
  • Need reasoning over data — Financial analysis, research synthesis, and technical problem-solving all benefit from native reasoning

Don't upgrade if you:

  • Only use AI for simple text generation — GPT-4o or GPT-5 mini handles this at a fraction of the cost
  • Need the fastest possible responses — At 54.8 tokens/second on high reasoning effort, GPT-5 is noticeably slower than GPT-4o's 100+ tokens/second
  • Are budget-conscious and doing basic tasksCheaper alternatives exist for most common use cases

FAQ

Is GPT-5 worth the ChatGPT Plus subscription?

If you use ChatGPT daily for anything beyond simple questions, yes. The reasoning improvements alone justify the $20/month, especially since you also get GPT-5 mini for lighter tasks. If you use it occasionally, the free tier's limited GPT-5 mini access might be sufficient.

How does GPT-5 compare to GPT-4o for everyday tasks?

For quick questions and simple text generation, GPT-4o is faster and cheaper. GPT-5's advantages appear in tasks requiring multi-step reasoning, complex code understanding, or analysis of large documents. Many ChatGPT users will benefit from switching between GPT-5 and GPT-4o depending on the task.

Can GPT-5 replace specialized AI coding tools?

Not entirely. While GPT-5 excels at code generation and bug fixing in conversation, tools like GitHub Copilot and Cursor provide IDE-integrated experiences that a chat interface can't match. Think of GPT-5 as a brilliant consultant you message, versus an embedded assistant that watches you code in real time.

What's the knowledge cutoff for GPT-5?

September 30, 2024. This means GPT-5 doesn't know about events after that date unless you provide the information in your prompt. OpenAI has indicated that future updates will push this cutoff forward, and web browsing is available in ChatGPT to fill gaps.

Should I wait for GPT-5.1 or later updates?

OpenAI typically releases point updates within a few months of a major launch. If you're already using GPT-4o or o3, switching to GPT-5 now makes sense — the improvements are substantial and point updates will be incremental refinements rather than fundamental changes.

The Bottom Line

GPT-5 is the real deal. The benchmarks are impressive, but what matters more is how the model performs on actual work. After three weeks of daily use, I can say that GPT-5 handles complex reasoning tasks that would have required multiple attempts with GPT-4o, solves coding problems that previously stumped AI assistants entirely, and does it while using fewer tokens (and costing less) than o3.

It's not perfect. The model is slower than GPT-4o, occasionally overconfident in its factual claims, and still struggles with truly novel problems that fall outside its training distribution. But as a general-purpose AI assistant for professional work, GPT-5 represents the clearest upgrade path since GPT-3.5 to GPT-4.

The competitive picture matters too. Claude 3.5 Sonnet remains the better choice for pure coding assistance and instruction following. Gemini 1.5 Pro offers a larger context window. But GPT-5 has the strongest claim to being the most capable all-around model available today.

If you're deciding whether to invest time learning GPT-5's capabilities, start with the adjustable reasoning effort feature. Master when to use minimal versus high effort, and you'll get better results at lower cost than users who leave everything on the default setting.

Sources

The AIME score is the headline number, and it deserves attention. Scoring 94.6% on competition-level math problems without tools puts GPT-5 in territory that would qualify for the USA Math Olympiad. That's not marketing — it's a genuine capability threshold.

SWE-bench Verified at 74.9% is equally impressive. This benchmark tests whether a model can fix real bugs in real open-source repositories. Going from 33.2% (GPT-4o) to 74.9% means GPT-5 can now resolve roughly three out of four real-world coding issues autonomously.

The number I find most interesting, though, is the professional knowledge benchmark. According to OpenAI's internal evaluation across 44 different occupations, GPT-5 with reasoning matches or exceeds human professionals about half the time. That includes law, logistics, sales, and engineering tasks.

Real-World Testing: Three Weeks In

Benchmarks measure specific capabilities under controlled conditions. I wanted to know how GPT-5 performs on the messy, ambiguous tasks I actually do every day.

Coding: Genuinely Useful for Production Work

I fed GPT-5 a 2,000-line TypeScript codebase with a subtle race condition in its WebSocket handler. The model identified the bug on the first attempt, explained the exact sequence of events that triggered it, and produced a fix that passed all existing tests. GPT-4o missed the race condition entirely and suggested refactoring the error handling instead.

For code generation, GPT-5 writes cleaner, more idiomatic code than its predecessors. It understands project conventions better — if your codebase uses a particular error handling pattern, GPT-5 follows it without being told. I noticed fewer instances of the model ignoring existing abstractions to create its own.

Where it still struggles: large-scale architectural decisions. Ask GPT-5 to design a microservices architecture from scratch and you'll get something competent but generic. The model is better as a senior developer who writes excellent code than as a staff engineer who makes system-level decisions.

Writing: Better but Still Detectable

GPT-5 produces noticeably more natural prose than GPT-4o. The default writing style has fewer of those telltale AI patterns — less "delving into" things, fewer unnecessary transitions, more varied sentence structure. But it still defaults to a particular cadence that experienced readers will recognize.

The improvement shows most in technical writing. Give GPT-5 a complex topic and a target audience, and it produces explanations that actually match the requested level. Previous models tended to drift toward either oversimplification or unnecessary jargon regardless of instructions.

Analysis: This Is Where It Gets Interesting

The reasoning capabilities shine brightest in analysis tasks. I gave GPT-5 a quarterly financial report and asked it to identify the three most concerning trends. It caught a working capital deterioration that I had missed in my own reading, backed by specific line items from the balance sheet.

For data analysis, the combination of reasoning and the 400K context window means you can paste entire datasets and get meaningful insights without pre-processing. I analyzed a 50,000-row CSV (about 180K tokens) in a single prompt and got accurate summary statistics, trend identification, and anomaly detection.

The 400K Context Window in Practice

The jump from 128K to 400K tokens sounds incremental on paper. In practice, it crosses a threshold that changes how you use the model.

With 128K tokens, you're constantly managing context. You chunk documents, summarize intermediate results, and maintain external state. With 400K tokens, you can fit:

  • An entire medium-sized codebase (15,000-20,000 lines)
  • A full technical specification plus implementation code
  • Multiple research papers for cross-reference analysis
  • A complete book manuscript for editing

I tested needle-in-a-haystack retrieval at various depths within the 400K window. GPT-5 maintained strong retrieval accuracy up to about 350K tokens, with slight degradation beyond that point. At 390K tokens, retrieval accuracy dropped to approximately 85% — still usable, but you should keep critical information away from the very edges of the context.

The practical impact is significant. I loaded an entire Next.js project (12,000 lines across 40 files) into a single context and asked GPT-5 to find all API endpoints that lacked rate limiting. It correctly identified 7 out of 8 unprotected endpoints and proposed middleware implementations for each one.

Pricing and Model Variants

OpenAI released three GPT-5 variants on launch day:

GPT-5$1.25 / $10per 1M input / output tokensFull reasoning model. Best for complex tasks requiring deep analysis.GPT-5 mini$0.30 / $1.25per 1M input / output tokensLighter model for simpler tasks. 8x cheaper than full GPT-5.GPT-5 Pro$2.50 / $15per 1M input / output tokensExtended reasoning with higher compute. For research and hard problems.

The pricing is competitive. GPT-5 at $1.25 input / $10 output per million tokens is actually cheaper than o3 was at launch, while delivering better results. For most use cases, GPT-5 mini at $0.30 / $1.25 is the better value — it handles 80% of tasks at a fraction of the cost.

ChatGPT Plus subscribers ($20/month) get GPT-5 access with usage limits. Pro subscribers ($200/month) get unlimited GPT-5 and GPT-5 Pro access. The free tier gets limited GPT-5 mini access.

Cost Comparison for Common Tasks

Task TypeGPT-5 Costo3 CostGPT-4o Cost
1,000-word blog post$0.015$0.042$0.008
Code review (500 lines)$0.025$0.068$0.012
Document analysis (50 pages)$0.180$0.520$0.085
Complex reasoning task$0.095$0.310N/A

The pattern is clear: GPT-5 costs roughly 60-70% less than o3 for reasoning-heavy tasks, while being slightly more expensive than GPT-4o for simple generation. If you were already paying for o3, GPT-5 is a straight upgrade at lower cost.

GPT-5 vs Claude vs Gemini

At the time of this review, GPT-5's direct competitors are Claude 3.5 Sonnet (Anthropic) and Gemini 1.5 Pro (Google). Here's how they compare in practice:

CategoryGPT-5Claude 3.5 SonnetGemini 1.5 Pro
Math/ReasoningBest in classStrongGood
CodingExcellentExcellentGood
Writing QualityGoodBest in classAdequate
Context Window400K200K1M
SpeedModerateFastFast
Instruction FollowingVery GoodExcellentGood

Useful Resources

Related Reading

Real AI Responses (Tested March 2026)

Claude Opus 4.6 responding to a question about GPT5 Just Landed The Benchmarks Tell an Interesting Story
Claude Opus 4.6 responding to a question about GPT5 Just Landed The Benchmarks Tell an Interesting Story

Choose GPT-5 if you need the strongest reasoning capabilities, work primarily with math or science problems, or want adjustable reasoning effort for different task types.

Choose Claude if you prioritize coding assistance, natural writing quality, or need a model that follows complex instructions precisely.

Choose Gemini if you need the largest context window (1M tokens), work heavily with Google's product suite, or need strong multilingual capabilities.

My honest take: GPT-5 takes the lead in pure reasoning but doesn't dominate across the board. For day-to-day AI assistant usage, the gap between these three models is narrower than the benchmarks suggest. Your choice should depend on your specific use case, not on which model scores highest on a particular test.

Who Should Upgrade (and Who Shouldn't)

Upgrade if you:

  • Use AI for coding daily — The 74.9% SWE-bench score translates to materially better code suggestions and bug detection
  • Work with long documents — The 400K context window eliminates the need for chunking strategies in most cases
  • Were paying for o3 — GPT-5 is better and cheaper, full stop
  • Need reasoning over data — Financial analysis, research synthesis, and technical problem-solving all benefit from native reasoning

Don't upgrade if you:

  • Only use AI for simple text generation — GPT-4o or GPT-5 mini handles this at a fraction of the cost
  • Need the fastest possible responses — At 54.8 tokens/second on high reasoning effort, GPT-5 is noticeably slower than GPT-4o's 100+ tokens/second
  • Are budget-conscious and doing basic tasksCheaper alternatives exist for most common use cases

FAQ

Is GPT-5 worth the ChatGPT Plus subscription?

If you use ChatGPT daily for anything beyond simple questions, yes. The reasoning improvements alone justify the $20/month, especially since you also get GPT-5 mini for lighter tasks. If you use it occasionally, the free tier's limited GPT-5 mini access might be sufficient.

How does GPT-5 compare to GPT-4o for everyday tasks?

For quick questions and simple text generation, GPT-4o is faster and cheaper. GPT-5's advantages appear in tasks requiring multi-step reasoning, complex code understanding, or analysis of large documents. Many ChatGPT users will benefit from switching between GPT-5 and GPT-4o depending on the task.

Can GPT-5 replace specialized AI coding tools?

Not entirely. While GPT-5 excels at code generation and bug fixing in conversation, tools like GitHub Copilot and Cursor provide IDE-integrated experiences that a chat interface can't match. Think of GPT-5 as a brilliant consultant you message, versus an embedded assistant that watches you code in real time.

What's the knowledge cutoff for GPT-5?

September 30, 2024. This means GPT-5 doesn't know about events after that date unless you provide the information in your prompt. OpenAI has indicated that future updates will push this cutoff forward, and web browsing is available in ChatGPT to fill gaps.

Should I wait for GPT-5.1 or later updates?

OpenAI typically releases point updates within a few months of a major launch. If you're already using GPT-4o or o3, switching to GPT-5 now makes sense — the improvements are substantial and point updates will be incremental refinements rather than fundamental changes.

The Bottom Line

GPT-5 is the real deal. The benchmarks are impressive, but what matters more is how the model performs on actual work. After three weeks of daily use, I can say that GPT-5 handles complex reasoning tasks that would have required multiple attempts with GPT-4o, solves coding problems that previously stumped AI assistants entirely, and does it while using fewer tokens (and costing less) than o3.

It's not perfect. The model is slower than GPT-4o, occasionally overconfident in its factual claims, and still struggles with truly novel problems that fall outside its training distribution. But as a general-purpose AI assistant for professional work, GPT-5 represents the clearest upgrade path since GPT-3.5 to GPT-4.

The competitive picture matters too. Claude 3.5 Sonnet remains the better choice for pure coding assistance and instruction following. Gemini 1.5 Pro offers a larger context window. But GPT-5 has the strongest claim to being the most capable all-around model available today.

If you're deciding whether to invest time learning GPT-5's capabilities, start with the adjustable reasoning effort feature. Master when to use minimal versus high effort, and you'll get better results at lower cost than users who leave everything on the default setting.

Sources

Subscribe to AI Log

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe