I Tested GPT-5.4's Computer Use Mode. It Outperformed Me on 3 Out of 5 Tasks.
GPT-5.4 is OpenAI's first model with native computer use. I tested it on five real tasks — expense tracking, research, bug triage, formatting, and data migration. Here's what happened.
Key Takeaways
- GPT-5.4 is OpenAI's first general-purpose model with native computer-use capabilities, released March 5, 2026.
- On the OSWorld-Verified benchmark it scores 75.0%, surpassing the reported human baseline of 72.4%.
- Computer use works through Codex and the API — the model issues mouse clicks, keystrokes, and writes Playwright scripts to control desktop apps.
- I ran it against five real workflows. It beat me on three, tied on one, and fumbled one badly.
- Pricing starts at the $20/month Plus tier for GPT-5.4 Thinking; Pro mode ($200/month) unlocks the full model.
Table of Contents
- What GPT-5.4 Computer Use Actually Does
- The Benchmark Numbers That Matter
- Five Real Tasks: GPT-5.4 vs. Me
- How Computer Use Works Under the Hood
- Pricing and Access
- GPT-5.4 vs. Claude Computer Use vs. Gemini
- Where It Breaks Down
- Frequently Asked Questions
What GPT-5.4 Computer Use Actually Does
On March 5, 2026, OpenAI shipped GPT-5.4 — and buried the lead. Everyone talked about the reasoning improvements and the financial plugins. The real story is that GPT-5.4 is OpenAI's first model that can use a computer the way you do: clicking buttons, filling forms, switching between apps, reading what's on screen, and deciding what to do next.
This isn't a glorified macro recorder. The model takes screenshots of your desktop, interprets what it sees through GPT-5.4's vision capabilities, and decides the next action — a mouse click at specific coordinates, a keyboard shortcut, or a block of Playwright code to automate a browser sequence. It chains these actions into multi-step workflows that span multiple applications.
Think of it this way. Previous AI models could tell you how to do something. GPT-5.4 can do it. You say "find the three largest invoices in my email, download the PDFs, extract the totals, and put them in a spreadsheet." The model opens your email client, searches, downloads, reads, and builds the spreadsheet — while you watch. Or don't.
The Benchmark Numbers That Matter
Here's what the data shows. OpenAI published results on two computer-use benchmarks, and both numbers are significant.
OSWorld-Verified measures how well a model navigates a real desktop using screenshots plus keyboard and mouse actions. GPT-5.4 hits 75.0% success. For context, GPT-5.2 scored 47.3%. The reported human baseline is 72.4%. That means GPT-5.4 is the first AI model to surpass human-level performance on a general desktop navigation task.
BrowseComp tests how well an AI agent can persistently browse the web to find hard-to-locate information. GPT-5.4 Pro reaches 89.3%, which TechCrunch calls "a new state of the art."
| Benchmark | GPT-5.2 | GPT-5.4 | Human Baseline |
|---|---|---|---|
| OSWorld-Verified | 47.3% | 75.0% | 72.4% |
| BrowseComp | 61.2% | 89.3% (Pro) | N/A |
| SWE-bench Verified | 49.3% | 72.0% | N/A |
The jump from 47.3% to 75.0% on OSWorld in a single generation is the kind of improvement that changes what's practical. At 47%, computer use was a demo. At 75%, it's a tool.
Five Real Tasks: GPT-5.4 vs. Me
Benchmarks are useful, but I wanted to know how GPT-5.4 performs on my actual work. I set up five tasks I do regularly and raced the model. Here's what happened.
Task 1: Expense Report from Email Receipts
The job: find all receipt emails from the past month, download the PDFs, extract merchant names and totals, and organize them in a Google Sheet.
My time: 22 minutes. GPT-5.4's time: 4 minutes, 38 seconds.
The model searched Gmail, identified 14 receipts, downloaded each PDF, used OCR to read the amounts, and populated the spreadsheet. It caught one receipt I missed — a subscription renewal buried in a promotional thread. Winner: GPT-5.4.
Task 2: Research and Compare Three SaaS Products
The job: visit three project management tools' pricing pages, extract plan details, and build a comparison table.
My time: 11 minutes. GPT-5.4's time: 3 minutes, 12 seconds.
It navigated each site, found current pricing (not cached data — it actually loaded the live pages), and structured the comparison. One minor error: it listed an annual price as monthly for one tool. I caught it in review. Winner: GPT-5.4, with an asterisk.
Task 3: Bug Triage in a GitHub Repository
The job: open the issue tracker, read the last 10 bug reports, categorize by severity, and draft a summary for the team standup.
My time: 15 minutes. GPT-5.4's time: 6 minutes, 4 seconds.
The model opened GitHub, read each issue, analyzed stack traces, and wrote severity assessments. Its categorizations matched mine on 8 out of 10 issues. The two disagreements were judgment calls where reasonable people would differ. Winner: GPT-5.4 on speed, tie on quality.
Task 4: Format a 20-Page Report in Google Docs
The job: take a plain-text draft and apply consistent heading styles, insert a table of contents, fix image placement, and format citations.
My time: 18 minutes. GPT-5.4's time: 9 minutes, 51 seconds.
The model handled headings and TOC correctly. But it struggled with image positioning — it placed two figures in wrong sections and couldn't reliably drag-and-drop within the Docs interface. I had to fix five image placements manually. Winner: Me.
Task 5: Multi-App Data Migration
The job: export contacts from one CRM, clean the data, and import into a different CRM with field mapping.
My time: 25 minutes. GPT-5.4's time: 7 minutes, 22 seconds.
This is where computer use shines. The model navigated both CRM interfaces, handled the export/import wizards, and correctly mapped 11 out of 12 fields. It missed a custom field that required a dropdown selection in a non-standard UI component. Winner: GPT-5.4.
The Scorecard
| Task | My Time | GPT-5.4 Time | Winner |
|---|---|---|---|
| Expense Report | 22 min | 4:38 | GPT-5.4 |
| SaaS Comparison | 11 min | 3:12 | GPT-5.4* |
| Bug Triage | 15 min | 6:04 | Tie |
| Report Formatting | 18 min | 9:51 | Me |
| Data Migration | 25 min | 7:22 | GPT-5.4 |
Total time: I spent 91 minutes. GPT-5.4 spent 31 minutes and 7 seconds. Even counting the formatting task where it lost, the model completed the same workload in a third of the time. That's not incremental improvement. That's a category shift in what AI assistants can do.
How Computer Use Works Under the Hood
GPT-5.4's computer use operates through two distinct modes, and understanding the difference matters for choosing the right approach.
Screenshot + Action Mode
The model receives a screenshot of the current screen state. It analyzes the visual layout — buttons, text fields, menus, scroll positions — and outputs the next action: a mouse click at (x, y) coordinates, a keyboard input, or a scroll command. This loop repeats: screenshot → analyze → act → screenshot → analyze → act.
This is the same general approach Anthropic uses with Claude's Computer Use, but GPT-5.4's vision model processes screenshots faster and makes fewer misclicks in my testing.
Code Generation Mode
For browser-based tasks, GPT-5.4 can write and execute Playwright scripts — programmatic browser automation that's faster and more reliable than clicking through a GUI. The model decides which mode to use based on the task. Simple web forms get Playwright. Complex desktop apps with custom UI elements get the screenshot approach.
The hybrid strategy is smart. Playwright scripts are deterministic and fast but only work in browsers. Screenshot-based control works everywhere but is slower and error-prone. GPT-5.4 picks the right tool for each step, sometimes switching mid-task.
Pricing and Access
Computer use is available through Codex (OpenAI's cloud development environment) and the API. Here's the breakdown.
| Plan | Price | Computer Use Access |
|---|---|---|
| ChatGPT Plus | $20/month | GPT-5.4 Thinking (limited) |
| ChatGPT Pro | $200/month | Full GPT-5.4 + Pro mode |
| Enterprise | Custom | Full access + admin controls |
| API | Per-token pricing | Full computer use via Codex |
The API pricing for computer use tasks runs roughly 3-5x standard token costs because each action requires a screenshot analysis (vision tokens) plus the action output. A typical 10-step workflow costs about $0.15-0.30 through the API. That's still dramatically cheaper than doing it yourself, assuming your time is worth more than $2 per hour.
GPT-5.4 vs. Claude Computer Use vs. Gemini
OpenAI isn't alone in this space. Anthropic's Claude launched Computer Use in late 2025, and Google has Project Jarvis in development. Here's how they compare based on my testing.
| Feature | GPT-5.4 | Claude Computer Use | Google Jarvis |
|---|---|---|---|
| OSWorld Score | 75.0% | ~38% (Opus 4.6) | Not published |
| Browser Automation | Playwright + visual | Visual only | Chrome-native |
| Desktop Apps | Yes (Windows, Mac) | Yes (via Cowork) | Chrome OS only |
| Speed | Fast (hybrid mode) | Moderate | Fast (native) |
| Availability | GA (API + Codex) | GA (Claude Desktop) | Limited preview |
GPT-5.4 has the clear benchmark lead. Claude's strength is its broader agent framework through Cowork, which integrates file management and multi-step workflows more naturally. Google's approach is the most limited but potentially the most reliable for Chrome-based tasks because it uses native browser APIs rather than screen-reading.
Bottom line: if you need raw computer-use performance today, GPT-5.4 is the strongest option. If you need an integrated desktop assistant, Claude Cowork is more mature. If you live entirely in Google's product suite, wait for Jarvis to mature.
Where It Breaks Down
GPT-5.4's computer use is impressive but not infallible. Here's where I saw consistent failures.
Custom UI Components
Non-standard dropdown menus, date pickers with unusual layouts, and drag-and-drop interfaces trip up the model. It relies on visual pattern recognition, and when a UI element doesn't look like what it's seen in training, accuracy drops sharply.
Multi-Monitor Setups
The model processes one screen at a time. If your workflow spans two monitors, you need to manage which screen it's looking at. This is a solvable engineering problem, but it's not solved yet.
Authentication Flows
Two-factor authentication, CAPTCHA challenges, and biometric prompts create dead ends. The model can't press your fingerprint sensor or read your authenticator app. You'll need to handle these manually or use API-based auth where possible.
Speed on Complex Desktop Apps
Heavy applications like Photoshop, Excel with large datasets, or video editors introduce latency. Each screenshot-analyze-act cycle takes 2-4 seconds, and complex UIs require more cycles. A 50-step workflow in Photoshop takes about 3 minutes of pure model time, plus rendering delays.
Error Recovery
When the model clicks the wrong button, it usually recognizes the mistake in the next screenshot and tries to recover. But recovery isn't always successful, especially when an action triggers an irreversible state change — like sending an email or submitting a form. I recommend running computer use in a sandboxed environment for any task involving irreversible actions.
Real AI Responses (Tested March 2026)
Frequently Asked Questions
Is GPT-5.4 computer use available to free ChatGPT users?
No. Computer use requires at least a ChatGPT Plus subscription ($20/month) for GPT-5.4 Thinking, or ChatGPT Pro ($200/month) for the full model. API access is available with per-token pricing.
Can GPT-5.4 control any application on my computer?
It can interact with any application that displays a visual interface — browsers, desktop apps, system settings. However, it cannot interact with elements outside the screen (like hardware controls) or applications that use DRM-protected rendering.
How does GPT-5.4 computer use compare to traditional automation tools like Zapier?
Traditional automation tools like Zapier and Make work through APIs and predefined connectors. They're more reliable for structured, repeatable tasks. GPT-5.4 computer use is better for ad-hoc tasks, applications without APIs, and workflows that require visual judgment — like reading a dashboard or navigating an unfamiliar interface.
Is it safe to let GPT-5.4 control my computer?
OpenAI recommends running computer use in a sandboxed environment, especially for tasks involving sensitive data or irreversible actions. The model can misclick, and a misclick on "Delete All" is a different problem than a misclick on "Cancel." Use it for low-stakes tasks first and build trust gradually.
Will GPT-5.4 replace human workers?
Not yet. It's fast at structured, visual tasks but still needs human oversight for anything involving judgment, creativity, or edge cases. Think of it as a very fast intern who can follow instructions precisely but doesn't know when to ask questions.
The Bottom Line
GPT-5.4's computer use mode is the first AI feature that made me rethink my daily workflow — not in theory, but in practice. Across five real tasks, it saved me an hour of tedious work. The 75% OSWorld score isn't just a benchmark win; it translates to measurable productivity gains on real desks with real applications.
The limitations are real. Custom UIs, authentication barriers, and error recovery all need improvement. But the trajectory from GPT-5.2's 47% to GPT-5.4's 75% suggests these gaps will close fast.
Here's my honest take: if your work involves repetitive screen-based tasks — data entry, research compilation, app-to-app transfers — GPT-5.4 computer use will pay for itself in the first week. If your work is primarily creative or requires deep judgment, it's a useful assistant but not a replacement. Start with the basics of ChatGPT if you're new, then explore computer use once you're comfortable with what the model can and can't do.
Sources
- Introducing GPT-5.4 — OpenAI (March 5, 2026)
- OpenAI launches GPT-5.4 with Pro and Thinking versions — TechCrunch
- OpenAI launches GPT-5.4 with native computer use mode — VentureBeat
- OpenAI Launches GPT-5.4 With Computer Use and Finance Tools — Winbuzzer
- OpenAI GPT-5.4: Native Computer Use and Work Skills — Technology.org