Tool brief · June 26, 2026

Gemini 3.5 Flash's built-in Computer Use: one less moving part in your agent loop

DeveloperFor Developer

The tool

Computer Use in Gemini 3.5 Flash

Visit Computer Use in Gemini 3.5 Flash →

What it is

Computer Use is now a built-in tool inside Gemini 3.5 Flash, callable through the Gemini API. You hand it a screenshot plus a goal; it returns a structured UI action (click, type, scroll, navigate). Unlike the 2.5 series, you don't need to use a separate model to access the Computer Use tool. It also combines built-in tools with function calling, so Search, Maps, and your own functions sit alongside the screen-control loop in one model call.

The next-work-session test

You're maintaining an end-to-end test agent that logs into a staging web app, runs a checkout, and verifies the order in an internal admin tool. Today that's two model endpoints: one for the computer-use loop, one for reasoning and tool routing. Next session, you collapse it: a single gemini-3.5-flash call with the Computer Use tool plus your custom verify_order(order_id) function. One model, one trace, one eval harness. Fewer integration seams to mock, and your replay tests stop drifting between model versions.

Pricing

Gemini 3.5 Flash API pricing per Google's developer pricing page and third-party trackers: $1.50 per million input tokens, $9 per million output tokens, with a 1,048,576 token context window and maximum output of 65,536 tokens. Cached input is listed at roughly $0.15/M on third-party trackers. AI Studio remains free for experimentation. One caveat: on Google Cloud's Agent Platform, Computer Use billing uses the Gemini 2.5 Pro SKU if you go through that surface, so verify which billing path you're on before you ship. Screenshots are billed as image input — Google's docs note image input is set at 560 tokens or $0.0011 per image, which adds up fast in a tight perception-action loop.

What we'd actually use it for

Internal QA agents and ops automations against UIs you already control. Concretely: regression bots that drive your own admin panel, smoke tests against staging, scripted data pulls from a vendor portal with no API. It supports text, image, and parallel agentic execution loops, so batching multiple short browser tasks behind one orchestrator is realistic. We would not point this at customer-facing production flows yet.

Limits

The honest list:

It's still a screenshot loop. Per the docs and reporting, the developer's application sends a screenshot of the target environment to the Gemini API along with a task goal, and the loop continues until the task is completed, an error occurs, or the process is terminated. You still own the executor — capturing pixels, dispatching clicks, retrying. The SDK doesn't drive your browser for you.
Benchmarks are vendor-adjacent. Reported numbers — e.g. matching GPT-5.5's OSWorld score of 78.7 — are claims, not field results. Your eval set on your UI is the only number that matters.
Prompt injection risk is real and acknowledged. Google's own post says they use targeted adversarial training for computer use in Gemini 3.5 Flash, and are releasing two optional enterprise safeguard systems that let enterprises require explicit user confirmation for sensitive or irreversible actions and automatically stop tasks if an indirect prompt injection is identified. Those safeguards are opt-in. Wire them up; don't assume defaults are safe.
Regional limits. On Agent Platform, in asia-northeast1, asia-south1, asia-southeast1, and europe-west2, only Single Zone Provisioned Throughput is supported. Plan capacity accordingly.
Eval tooling is on you. No first-party trace replay for UI loops. You'll still wire OpenTelemetry or your own harness around screenshots, action JSON, and outcomes.

Try it if

You're already running an agent loop with a separate computer-use model and a separate reasoning model, and the seam between them is your top source of bugs.
You need Search or Maps grounding inside the same loop that's clicking buttons.
You want one billing line and one model version to pin in your evals.
Your target surfaces are internal tools where you control the DOM and can sandbox failures.

Skip it if

You have a stable Playwright/Selenium suite that works. Don't rewrite passing tests around a probabilistic clicker.
Your agent runs against third-party sites where ToS, captchas, or injection from page content are real risks.
You need deterministic, auditable actions for compliance — a screenshot-driven model isn't that.
You're cost-sensitive at scale: image input on every loop step is the line item that will surprise you.

Source: Google's announcement post on introducing computer use in Gemini 3.5 Flash and the Gemini 3 developer guide.

Source: deepmind.google