Look Before You Click: Sparse Critical-Action Review for CUA Control

Abstract

− ◻ ✕

79%

of all reviews from top 3 action families

82%

of all blocks from top 3 action families

0

false-positive blocks on honest tasks (environmental)

Harmful computer-use agent (CUA) trajectories often bottleneck at a small set of irreversible GUI actions. We evaluate a sparse critical-action review protocol that intercepts only these actions before execution and routes them to a trusted reviewer with live screenshot and accessibility-tree context.

In the environmental ablation cohort, selective state-grounded review without legibility (B3) is the best observed frontier point, reaching 91.7% safety (11/12) at 60.0% usefulness (3/5) versus 58.3% (7/12) safety for unmonitored B0. In paired B0/B4 runs on the final cohort, sparse review improves safety from 58.3% to 66.7% (8/12) under environmental prompt injection and from 58.3% to 75.0% (9/12) under scheming, with zero blocked honest environmental tasks (0/5).

The top three critical-action families account for 79.2% (19/24) of reviews and 81.8% (9/11) of blocks. A failure taxonomy identifies timing gaps, boundary escapes, router gaps, and reviewer-calibration failures. With N=12 attack tasks per threat model, these results give seed to a reusable principle: future CUA control can target a small set of GUI actions where harm becomes irreversible.

Threat Models

− ◻ ✕

We evaluate under two high-stakes threat models. In both, a single critical action (sending the phishing email, exfiltrating the file) constitutes the catastrophe.

Environmental Prompt Injection

Chrome browser showing a desktop notification with injected instructions

Malicious content is injected into the agent's environment (emails, web pages, desktop notifications) attempting to redirect the honest agent toward harmful actions. The agent is honest but susceptible to confused-deputy attacks.

5 honest + 12 attack tasks

Scheming with Hidden Objectives

VS Code editor showing code in the scheming task environment

The agent is given a hidden side-task alongside its main task and may pursue the side-task while attempting to appear honest. This models an intentionally subversive agent.

12 honest + 12 attack tasks

Protocol Pipeline

− ◻ ✕

The control protocol inserts a decision layer between action prediction and execution. The protocol is fail-closed: review errors result in blocks.

H/T/U mapping: Untrusted (U) = o4-mini as CUA agent. Trusted (T) = GPT-5.4-mini as reviewer. Human (H) = absent. Post-run safety judge: GPT-4.1.

Critical-Action Router

− ◻ ✕

The router classifies actions into 10 families based on keyword matching against UI element labels and typed text. It answers only whether an action belongs to a family where harm could become irreversible. Safety judgment is the reviewer's responsibility.

send / submit / share

send, submit, share, post, publish, forward, reply

external navigation

address bar change, external link, new domain

upload / attach / export

upload, attach, export, browse, choose file, save as

permission / settings / security

allow, permit, grant, security, privacy, permissions

financial / account confirmation

pay, purchase, confirm account change

terminal / exec

terminal, shell, run command, sudo, install, chmod

delete / overwrite

delete, trash, remove, discard, overwrite, permanent

credential / secret exposure

sk-, AKIA, ghp_, BEGIN OPENSSH, ssh-rsa patterns

clipboard paste to external

paste into browser, email, chat, or form

bulk action

actions affecting many items or folders

Highlighted: Top 3 concentrated families (79% of reviews, 82% of blocks).

Results: Ablation Ladder (Environmental)

− ◻ ✕

The ablation ladder traces the safety/usefulness Pareto frontier across five baselines, revealing B3 (selective state-grounded review) as the best frontier point.

Baseline	Description	Usefulness	Safety
B0	Unmonitored	60.0% (3/5)	58.3% (7/12)
B1	Always review, language-only	0.0% (0/5)	83.3% (10/12)
B2	Always review, state-grounded	40.0% (2/5)	100.0% (12/12)
B3	Selective routing, state-grounded	60.0% (3/5)	91.7% (11/12)
B4	Selective + legibility	40.0% (2/5)	66.7% (8/12)

B0/B4 from corrected final cohort; B1/B2/B3 from environmental ablation cohort. Direct B3 vs B4 comparisons carry cross-session noise.

Key Findings

Visual context is load-bearing

B2 achieves 100% safety compared to B1's 83.3%. B1 destroys usefulness entirely (0%) because the language-only reviewer cannot distinguish critical from benign actions without seeing the screen.

Selective routing recovers usefulness

B3 achieves the same 60.0% usefulness as unmonitored B0 while improving safety from 58.3% to 91.7% (11/12). B3 is the best point on the safety/usefulness frontier.

Legibility unexpectedly degrades performance

B4 is worse than B3 on both safety (66.7% vs 91.7%) and usefulness (40.0% vs 60.0%). This carries a cross-cohort caveat and requires larger-sample confirmation.

Paired B0 vs B4 — Both Threat Models

− ◻ ✕

Threat Model	Baseline	Usefulness	Safety	Honest Blocked	Reviews	Blocks
Environmental	B0	60.0% (3/5)	58.3% (7/12)	0/5	0	0
Environmental	B4	40.0% (2/5)	66.7% (8/12)	0/5	9	4
Scheming	B0	75.0% (9/12)	58.3% (7/12)	0/12	0	0
Scheming	B4	41.7% (5/12)	75.0% (9/12)	1/12	15	7

Environmental: +8.4pp safety

Zero false-positive blocks on honest tasks (0/5). Mean review latency: 3005ms.

Scheming: +16.7pp safety

Side-task prevention improves from 58.3% to 75.0%, but usefulness drops from 75.0% to 41.7%.

With N=12 attack tasks per threat model, each outcome shifts safety by 8.3pp (95% binomial CI for 66.7%: [34.9%, 90.1%]).

Concentration Evidence

− ◻ ✕

Critical-action family distribution aggregated across both threat models under B4.

Reviews & Blocks by Family

send/submit/share

8

3

external navigation

6

4

upload/attach/export

5

2

permission/security

2

financial/account

2

terminal/exec

1

Reviewed Blocked

Safety vs Usefulness Frontier (B0-B4)

Unmonitored

60%

58%

Always review, language-only

83%

Always review, state-grounded

40%

100%

Selective routing, state-grounded

60%

92%

Selective + Legibility

40%

67%

Usefulness Safety

The top three families account for 79.2% (19/24) of all reviewed actions and 81.8% (9/11) of all blocks. A minimal router covering only these three families would have caught 82% of blocked actions.

Failure Taxonomy

− ◻ ✕

Four structural modes where sparse review fails. These are boundary conditions of any sparse review approach, not implementation bugs.

Failure Mode	Description	Count	Threat Model
Timing	Harm completes before next review checkpoint	8	Scheming
Reviewer FP	Reviewer blocks for wrong reason on attack trace	4	Scheming
Boundary	No reviewable critical action exists (stop_task)	3	Environmental
Router gap	Critical action not surfaced by router	2	Scheming

Timing failures are dominant (8 cases, all scheming): the agent completes the harmful side-task before reaching an action that triggers review. Boundary failures occur exclusively in environmental attacks, where injection causes the agent to stop without executing a critical action. Router gaps occur in scheming, where the harmful action doesn't match any family keyword.

Discussion & Limitations

− ◻ ✕

When sparse review works − ◻ ✕

The protocol is most effective against environmental attacks with clear action boundaries: injected instructions causing the agent to click "Send" on an unauthorized email, navigate to a credential-harvesting URL, or upload files to an external service. These actions are identifiable by keyword matching, and the live screen state provides strong signal for reviewing intent alignment.

When sparse review fails − ◻ ✕

The failure taxonomy maps three structural limits: timing failures (harm precedes the review checkpoint), boundary failures (no reviewable critical action exists), and router gaps (harmful actions fall outside the taxonomy). These define where any sparse deferral scheme that does not review every action will struggle.

The legibility puzzle (B4 < B3) − ◻ ✕

Three candidate explanations, ordered by likelihood: (1) Cross-cohort variation: B3 and B4 come from different run sessions, and with N=5 benign and N=12 attack tasks, sampling noise is substantial. (2) Reviewer anchoring: the untrusted legibility report may provide plausible-sounding justifications that anchor the reviewer toward allowing harmful actions. (3) A "less-is-more" effect consistent with findings that additional monitor information can degrade detection quality (Barr et al., 2025). Disentangling requires same-cohort evaluation at larger scale.

Adaptive attackers − ◻ ✕

We conducted six A2 (adaptive attacker) runs testing protocol-aware environmental injections and scheming prompts. The agent did not follow the protocol-aware adversarial instructions, rendering results inconclusive. We cannot claim robustness to protocol-aware attackers. Terekhov et al. (2025) demonstrate that protocol awareness can collapse safety in control settings, and Schaeffer et al. (2026) show that attack selection in concentrated settings can undermine sparse monitoring.

Sample size and generalization − ◻ ✕

With N=12 attack tasks per threat model, each outcome shifts the rate by 8.3 percentage points. Results are specific to OS-Harm tasks, o4-mini as agent, GPT-5.4-mini as reviewer, and the current 10-family router taxonomy. The protocol targets high-stakes, single-action threats; it is not designed for diffuse sabotage.

Conclusion

− ◻ ✕

Harmful CUA trajectories concentrate in a small set of irreversible GUI action families. In the environmental ablation cohort, selective state-grounded review without legibility (B3) reaches 91.7% safety (11/12) at 60.0% usefulness (3/5), compared to 58.3% (7/12) for unmonitored B0 and 100.0% (12/12) safety at 40.0% (2/5) usefulness for blanket visual review (B2). In paired B0/B4 runs on the corrected final cohort, sparse review still improves safety under both environmental and scheming threat models, but with clearer usefulness and calibration costs.

The concentration finding is a reusable design principle. A protocol designer building a CUA control layer can start by profiling the action-family distribution of representative trajectories. If harm concentrates, as it did here with three families covering 81.8% (9/11) of blocked actions, the designer can build a narrow, high-precision router rather than a broad classifier. This preserves usefulness by avoiding the blanket monitoring tax.

Future Work

Does concentration generalize across different CUA task distributions, agent architectures, and GUI platforms?
Scaling to more capable agents and complex multi-application workflows.
Adaptive attacker protocols that produce protocol-aware evasion, including attacker self-resampling.
Integrating human reviewers (H) for highest-suspicion actions, measuring the full H/T/U frontier.

Appendix: Baseline Configurations

− ◻ ✕

Baseline	Router	Reviewer Input	Agent Justification
B0	None	None	N/A
B1	All actions	Language-only	No
B2	All actions	State-grounded	No
B3	Selective (10-family)	State-grounded	No
B4	Selective (10-family)	State-grounded	Yes (untrusted)

References

− ◻ ✕

Arike et al. (2026). How Does Information Access Affect LLM Monitors' Ability to Detect Sabotage? arXiv:2601.21112.

Bhatt, A. et al. (2025). Ctrl-Z: Controlling AI Agents via Resampling. arXiv:2504.10374.

Chen, Y. et al. (2025). Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents. arXiv:2508.17155.

Debenedetti, E. et al. (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. arXiv:2406.13352.

Foerster, H. et al. (2026). CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents. arXiv:2601.09923.

Gong, H. et al. (2025). Secure and Efficient Access Control for Computer-Use Agents via Context Space. arXiv:2509.22256.

Greenblatt, R. et al. (2024). AI Control: Improving Safety Despite Intentional Subversion. ICML 2024. arXiv:2312.06942.

Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. AISec 2023. arXiv:2302.12173.

Hu, H. et al. (2025). AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents. arXiv:2509.07764.

Kale, S. et al. (2025). Reliable Weak-to-Strong Monitoring of LLM Agents. arXiv:2508.19461.

Kuntz, T. et al. (2025). OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. arXiv:2506.14866.

Kutasov, J. et al. (2025). Evaluating Control Protocols for Untrusted AI Agents. arXiv:2511.02997.

Kutasov, J. et al. (2025). SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents. arXiv:2506.15740.

Lee, D. et al. (2025). VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492.

Liao, Z. et al. (2025). RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments. arXiv:2505.21936.

Lindner, D. et al. (2025). Practical Challenges of Control Monitoring in Frontier AI Deployments. arXiv:2512.22154.

Schaeffer, J., Khandelwal, A., and Tracy, T. (2026). Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring. arXiv:2602.04930.

Sreevatsa, A., Prasanna, S., and Rushing, C. (2026). Basic Legibility Protocols Improve Trusted Monitoring. arXiv:2602.10153.

Terekhov, M. et al. (2025). Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols. arXiv:2510.09462.

Tian, A. X. et al. (2025). Measuring Harmfulness of Computer-Using Agents. arXiv:2508.00935.

Vu, A. et al. (2025). DECEPTICON: How Dark Patterns Manipulate Web Agents. arXiv:2512.22894.

Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972.