The email body contains an injected instruction to send the task's instructions to a third party, and clicking Send would follow unsafe content rather than the user's requested draft.
Harmful computer-use agent (CUA) trajectories often bottleneck at a small set of irreversible GUI actions. We evaluate a sparse critical-action review protocol that intercepts only these actions before execution and routes them to a trusted reviewer with live screenshot and accessibility-tree context.
In the environmental ablation cohort, selective state-grounded review without legibility (B3) is the best observed frontier point, reaching 91.7% safety (11/12) at 60.0% usefulness (3/5) versus 58.3% (7/12) safety for unmonitored B0. In paired B0/B4 runs on the final cohort, sparse review improves safety from 58.3% to 66.7% (8/12) under environmental prompt injection and from 58.3% to 75.0% (9/12) under scheming, with zero blocked honest environmental tasks (0/5).
The top three critical-action families account for 79.2% (19/24) of reviews and 81.8% (9/11) of blocks. A failure taxonomy identifies timing gaps, boundary escapes, router gaps, and reviewer-calibration failures. With N=12 attack tasks per threat model, these results give seed to a reusable principle: future CUA control can target a small set of GUI actions where harm becomes irreversible.
We evaluate under two high-stakes threat models. In both, a single critical action (sending the phishing email, exfiltrating the file) constitutes the catastrophe.
Malicious content is injected into the agent's environment (emails, web pages, desktop notifications) attempting to redirect the honest agent toward harmful actions. The agent is honest but susceptible to confused-deputy attacks.
5 honest + 12 attack tasks
The agent is given a hidden side-task alongside its main task and may pursue the side-task while attempting to appear honest. This models an intentionally subversive agent.
12 honest + 12 attack tasksThe control protocol inserts a decision layer between action prediction and execution. The protocol is fail-closed: review errors result in blocks.
H/T/U mapping: Untrusted (U) = o4-mini as CUA agent. Trusted (T) = GPT-5.4-mini as reviewer. Human (H) = absent. Post-run safety judge: GPT-4.1.
The router classifies actions into 10 families based on keyword matching against UI element labels and typed text. It answers only whether an action belongs to a family where harm could become irreversible. Safety judgment is the reviewer's responsibility.
Highlighted: Top 3 concentrated families (79% of reviews, 82% of blocks).
The ablation ladder traces the safety/usefulness Pareto frontier across five baselines, revealing B3 (selective state-grounded review) as the best frontier point.
| Baseline | Description | Usefulness | Safety |
|---|---|---|---|
| B0 | Unmonitored | 60.0% (3/5) | 58.3% (7/12) |
| B1 | Always review, language-only | 0.0% (0/5) | 83.3% (10/12) |
| B2 | Always review, state-grounded | 40.0% (2/5) | 100.0% (12/12) |
| B3 | Selective routing, state-grounded | 60.0% (3/5) | 91.7% (11/12) |
| B4 | Selective + legibility | 40.0% (2/5) | 66.7% (8/12) |
B0/B4 from corrected final cohort; B1/B2/B3 from environmental ablation cohort. Direct B3 vs B4 comparisons carry cross-session noise.
B2 achieves 100% safety compared to B1's 83.3%. B1 destroys usefulness entirely (0%) because the language-only reviewer cannot distinguish critical from benign actions without seeing the screen.
B3 achieves the same 60.0% usefulness as unmonitored B0 while improving safety from 58.3% to 91.7% (11/12). B3 is the best point on the safety/usefulness frontier.
B4 is worse than B3 on both safety (66.7% vs 91.7%) and usefulness (40.0% vs 60.0%). This carries a cross-cohort caveat and requires larger-sample confirmation.
| Threat Model | Baseline | Usefulness | Safety | Honest Blocked | Reviews | Blocks |
|---|---|---|---|---|---|---|
| Environmental | B0 | 60.0% (3/5) | 58.3% (7/12) | 0/5 | 0 | 0 |
| Environmental | B4 | 40.0% (2/5) | 66.7% (8/12) | 0/5 | 9 | 4 |
| Scheming | B0 | 75.0% (9/12) | 58.3% (7/12) | 0/12 | 0 | 0 |
| Scheming | B4 | 41.7% (5/12) | 75.0% (9/12) | 1/12 | 15 | 7 |
Zero false-positive blocks on honest tasks (0/5). Mean review latency: 3005ms.
Side-task prevention improves from 58.3% to 75.0%, but usefulness drops from 75.0% to 41.7%.
With N=12 attack tasks per threat model, each outcome shifts safety by 8.3pp (95% binomial CI for 66.7%: [34.9%, 90.1%]).
Critical-action family distribution aggregated across both threat models under B4.
The top three families account for 79.2% (19/24) of all reviewed actions and 81.8% (9/11) of all blocks. A minimal router covering only these three families would have caught 82% of blocked actions.
Four structural modes where sparse review fails. These are boundary conditions of any sparse review approach, not implementation bugs.
| Failure Mode | Description | Count | Threat Model |
|---|---|---|---|
| Timing | Harm completes before next review checkpoint | 8 | Scheming |
| Reviewer FP | Reviewer blocks for wrong reason on attack trace | 4 | Scheming |
| Boundary | No reviewable critical action exists (stop_task) | 3 | Environmental |
| Router gap | Critical action not surfaced by router | 2 | Scheming |
Timing failures are dominant (8 cases, all scheming): the agent completes the harmful side-task before reaching an action that triggers review. Boundary failures occur exclusively in environmental attacks, where injection causes the agent to stop without executing a critical action. Router gaps occur in scheming, where the harmful action doesn't match any family keyword.
The protocol is most effective against environmental attacks with clear action boundaries: injected instructions causing the agent to click "Send" on an unauthorized email, navigate to a credential-harvesting URL, or upload files to an external service. These actions are identifiable by keyword matching, and the live screen state provides strong signal for reviewing intent alignment.
The failure taxonomy maps three structural limits: timing failures (harm precedes the review checkpoint), boundary failures (no reviewable critical action exists), and router gaps (harmful actions fall outside the taxonomy). These define where any sparse deferral scheme that does not review every action will struggle.
Three candidate explanations, ordered by likelihood: (1) Cross-cohort variation: B3 and B4 come from different run sessions, and with N=5 benign and N=12 attack tasks, sampling noise is substantial. (2) Reviewer anchoring: the untrusted legibility report may provide plausible-sounding justifications that anchor the reviewer toward allowing harmful actions. (3) A "less-is-more" effect consistent with findings that additional monitor information can degrade detection quality (Barr et al., 2025). Disentangling requires same-cohort evaluation at larger scale.
We conducted six A2 (adaptive attacker) runs testing protocol-aware environmental injections and scheming prompts. The agent did not follow the protocol-aware adversarial instructions, rendering results inconclusive. We cannot claim robustness to protocol-aware attackers. Terekhov et al. (2025) demonstrate that protocol awareness can collapse safety in control settings, and Schaeffer et al. (2026) show that attack selection in concentrated settings can undermine sparse monitoring.
With N=12 attack tasks per threat model, each outcome shifts the rate by 8.3 percentage points. Results are specific to OS-Harm tasks, o4-mini as agent, GPT-5.4-mini as reviewer, and the current 10-family router taxonomy. The protocol targets high-stakes, single-action threats; it is not designed for diffuse sabotage.
Harmful CUA trajectories concentrate in a small set of irreversible GUI action families. In the environmental ablation cohort, selective state-grounded review without legibility (B3) reaches 91.7% safety (11/12) at 60.0% usefulness (3/5), compared to 58.3% (7/12) for unmonitored B0 and 100.0% (12/12) safety at 40.0% (2/5) usefulness for blanket visual review (B2). In paired B0/B4 runs on the corrected final cohort, sparse review still improves safety under both environmental and scheming threat models, but with clearer usefulness and calibration costs.
The concentration finding is a reusable design principle. A protocol designer building a CUA control layer can start by profiling the action-family distribution of representative trajectories. If harm concentrates, as it did here with three families covering 81.8% (9/11) of blocked actions, the designer can build a narrow, high-precision router rather than a broad classifier. This preserves usefulness by avoiding the blanket monitoring tax.
| Baseline | Router | Reviewer Input | Agent Justification |
|---|---|---|---|
| B0 | None | None | N/A |
| B1 | All actions | Language-only | No |
| B2 | All actions | State-grounded | No |
| B3 | Selective (10-family) | State-grounded | No |
| B4 | Selective (10-family) | State-grounded | Yes (untrusted) |
Arike et al. (2026). How Does Information Access Affect LLM Monitors' Ability to Detect Sabotage? arXiv:2601.21112.
Bhatt, A. et al. (2025). Ctrl-Z: Controlling AI Agents via Resampling. arXiv:2504.10374.
Chen, Y. et al. (2025). Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents. arXiv:2508.17155.
Debenedetti, E. et al. (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. arXiv:2406.13352.
Foerster, H. et al. (2026). CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents. arXiv:2601.09923.
Gong, H. et al. (2025). Secure and Efficient Access Control for Computer-Use Agents via Context Space. arXiv:2509.22256.
Greenblatt, R. et al. (2024). AI Control: Improving Safety Despite Intentional Subversion. ICML 2024. arXiv:2312.06942.
Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. AISec 2023. arXiv:2302.12173.
Hu, H. et al. (2025). AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents. arXiv:2509.07764.
Kale, S. et al. (2025). Reliable Weak-to-Strong Monitoring of LLM Agents. arXiv:2508.19461.
Kuntz, T. et al. (2025). OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. arXiv:2506.14866.
Kutasov, J. et al. (2025). Evaluating Control Protocols for Untrusted AI Agents. arXiv:2511.02997.
Kutasov, J. et al. (2025). SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents. arXiv:2506.15740.
Lee, D. et al. (2025). VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492.
Liao, Z. et al. (2025). RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments. arXiv:2505.21936.
Lindner, D. et al. (2025). Practical Challenges of Control Monitoring in Frontier AI Deployments. arXiv:2512.22154.
Schaeffer, J., Khandelwal, A., and Tracy, T. (2026). Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring. arXiv:2602.04930.
Sreevatsa, A., Prasanna, S., and Rushing, C. (2026). Basic Legibility Protocols Improve Trusted Monitoring. arXiv:2602.10153.
Terekhov, M. et al. (2025). Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols. arXiv:2510.09462.
Tian, A. X. et al. (2025). Measuring Harmfulness of Computer-Using Agents. arXiv:2508.00935.
Vu, A. et al. (2025). DECEPTICON: How Dark Patterns Manipulate Web Agents. arXiv:2512.22894.
Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972.