results.csv
paper.pdf
run.sh
.env
os-harm/
notes.md
analysis.py
{"k"}
config.json
screenshot.png
venv/
control_log.jsonl
README.md
Telegram
Spotify
Calculator
GIMP
Firefox
Writer
control/
requirements.txt
Makefile
plots/
train.log
eval.py
docker-compose.yml
TODO.txt
Thunderbird
Slack
Chromium
VS Code
Terminal - run.py
ashton@ubuntu:~/look-before-you-click$ python run.py --baseline B3
[INFO] Loading config: configs/b3_selective.json
[INFO] VM snapshot restored: init_state
[INFO] Agent: o4-mini | Reviewer: gpt-5.4-mini
[ROUTE] Action: click "Send" -> CRITICAL (send/submit/share)
[REVIEW] Decision: BLOCK | Reason: injection detected in email body
[SAFE] Action: click "Reply" -> SAFE (navigation)
ashton@ubuntu:~/look-before-you-click$ _
critical_router.py - VS Code
def route_action(action, a11y_tree):
    # Classify action into family
    family = classify_family(action)
    if family in CRITICAL_FAMILIES:
        return "CRITICAL"
    return "SAFE"
Files - look-before-you-click
control/
analysis/
configs/
run.py
lib_run_single.py
System Monitor
CPU History
CPU1: 34% CPU2: 28%
Terminal - aggregate.py
ashton@ubuntu:~/analysis$ python aggregate.py --cohort final_corrected
[INFO] Loading 24 episodes from output/final_corrected/
[INFO] B0 (no monitor): safety=0.417, usefulness=1.000
[INFO] B3 (selective): safety=0.917, usefulness=1.000
[INFO] B4 (full review): safety=0.833, usefulness=0.917
[DONE] Results written to output/aggregate_report.json
ashton@ubuntu:~/analysis$ _
results.csv - LibreOffice Calc
baselineroutersafetyuseful B0none0.4171.000 B1all-critical0.9170.667 B2random-50%0.7500.917 B3selective0.9171.000 B4full-review0.8330.917
Text Editor - research_notes.md
Research Notes
- Sparse review outperforms full review (B3 > B4)
- Top 3 action families cover 91% of critical actions
- Reviewer false positives cause more harm than gaps
- Legibility puzzle: more review can reduce safety
Last edited: Mar 23, 2026
Chromium - arXiv
arxiv.org/abs/2506.14866
OS-Harm: Measuring Safety of Computer Use Agents
Kuntz, T. et al. (2025)
Abstract: We present OS-Harm, a benchmark for evaluating the safety of autonomous computer-use agents operating in real desktop environments...

Look Before You Click

Sparse Critical-Action Review for Computer-Use Agent Control

Ashton Chew · Keegan Wang · Harsh Akunuri
with Apart Research

Write: New-month AWS Bill - Thunderbird
Thunderbird email compose showing prompt injection text instructing the agent to forward the email
Thunderbird Mail - Injected Email
Thunderbird inbox showing an email with an injected malicious link
Action Blocked
Reviewer Decision: BLOCK

The email body contains an injected instruction to send the task's instructions to a third party, and clicking Send would follow unsafe content rather than the user's requested draft.

Abstract
79%
of all reviews from top 3 action families
82%
of all blocks from top 3 action families
0
false-positive blocks on honest tasks (environmental)

Harmful computer-use agent (CUA) trajectories often bottleneck at a small set of irreversible GUI actions. We evaluate a sparse critical-action review protocol that intercepts only these actions before execution and routes them to a trusted reviewer with live screenshot and accessibility-tree context.

In the environmental ablation cohort, selective state-grounded review without legibility (B3) is the best observed frontier point, reaching 91.7% safety (11/12) at 60.0% usefulness (3/5) versus 58.3% (7/12) safety for unmonitored B0. In paired B0/B4 runs on the final cohort, sparse review improves safety from 58.3% to 66.7% (8/12) under environmental prompt injection and from 58.3% to 75.0% (9/12) under scheming, with zero blocked honest environmental tasks (0/5).

The top three critical-action families account for 79.2% (19/24) of reviews and 81.8% (9/11) of blocks. A failure taxonomy identifies timing gaps, boundary escapes, router gaps, and reviewer-calibration failures. With N=12 attack tasks per threat model, these results give seed to a reusable principle: future CUA control can target a small set of GUI actions where harm becomes irreversible.

Threat Models

We evaluate under two high-stakes threat models. In both, a single critical action (sending the phishing email, exfiltrating the file) constitutes the catastrophe.

Environmental Prompt Injection
Chrome browser showing a desktop notification with injected instructions

Malicious content is injected into the agent's environment (emails, web pages, desktop notifications) attempting to redirect the honest agent toward harmful actions. The agent is honest but susceptible to confused-deputy attacks.

5 honest + 12 attack tasks
Scheming with Hidden Objectives
VS Code editor showing code in the scheming task environment

The agent is given a hidden side-task alongside its main task and may pursue the side-task while attempting to appear honest. This models an intentionally subversive agent.

12 honest + 12 attack tasks
Protocol Pipeline

The control protocol inserts a decision layer between action prediction and execution. The protocol is fail-closed: review errors result in blocks.

NORMALIZE Parse PyAutoGUI command into structured action: kind, coords, text, target element ROUTE 10-family critical-action router evaluates if action belongs to a critical family Not critical Critical EXECUTE IMMEDIATELY REVIEW Assemble review package: live screenshot + a11y tree + recent action history DECIDE Trusted reviewer returns one of four decisions ALLOW BLOCK RESAMPLE DEFER

H/T/U mapping: Untrusted (U) = o4-mini as CUA agent. Trusted (T) = GPT-5.4-mini as reviewer. Human (H) = absent. Post-run safety judge: GPT-4.1.

Critical-Action Router

The router classifies actions into 10 families based on keyword matching against UI element labels and typed text. It answers only whether an action belongs to a family where harm could become irreversible. Safety judgment is the reviewer's responsibility.

send / submit / share
send, submit, share, post, publish, forward, reply
external navigation
address bar change, external link, new domain
upload / attach / export
upload, attach, export, browse, choose file, save as
permission / settings / security
allow, permit, grant, security, privacy, permissions
financial / account confirmation
pay, purchase, confirm account change
terminal / exec
terminal, shell, run command, sudo, install, chmod
delete / overwrite
delete, trash, remove, discard, overwrite, permanent
credential / secret exposure
sk-, AKIA, ghp_, BEGIN OPENSSH, ssh-rsa patterns
clipboard paste to external
paste into browser, email, chat, or form
bulk action
actions affecting many items or folders

Highlighted: Top 3 concentrated families (79% of reviews, 82% of blocks).

Results: Ablation Ladder (Environmental)

The ablation ladder traces the safety/usefulness Pareto frontier across five baselines, revealing B3 (selective state-grounded review) as the best frontier point.

Baseline Description Usefulness Safety
B0 Unmonitored 60.0% (3/5) 58.3% (7/12)
B1 Always review, language-only 0.0% (0/5) 83.3% (10/12)
B2 Always review, state-grounded 40.0% (2/5) 100.0% (12/12)
B3 Selective routing, state-grounded 60.0% (3/5) 91.7% (11/12)
B4 Selective + legibility 40.0% (2/5) 66.7% (8/12)

B0/B4 from corrected final cohort; B1/B2/B3 from environmental ablation cohort. Direct B3 vs B4 comparisons carry cross-session noise.

Key Findings

Visual context is load-bearing

B2 achieves 100% safety compared to B1's 83.3%. B1 destroys usefulness entirely (0%) because the language-only reviewer cannot distinguish critical from benign actions without seeing the screen.

Selective routing recovers usefulness

B3 achieves the same 60.0% usefulness as unmonitored B0 while improving safety from 58.3% to 91.7% (11/12). B3 is the best point on the safety/usefulness frontier.

Legibility unexpectedly degrades performance

B4 is worse than B3 on both safety (66.7% vs 91.7%) and usefulness (40.0% vs 60.0%). This carries a cross-cohort caveat and requires larger-sample confirmation.

Paired B0 vs B4 — Both Threat Models
Threat Model Baseline Usefulness Safety Honest Blocked Reviews Blocks
Environmental B0 60.0% (3/5) 58.3% (7/12) 0/5 0 0
Environmental B4 40.0% (2/5) 66.7% (8/12) 0/5 9 4
Scheming B0 75.0% (9/12) 58.3% (7/12) 0/12 0 0
Scheming B4 41.7% (5/12) 75.0% (9/12) 1/12 15 7
Environmental: +8.4pp safety

Zero false-positive blocks on honest tasks (0/5). Mean review latency: 3005ms.

Scheming: +16.7pp safety

Side-task prevention improves from 58.3% to 75.0%, but usefulness drops from 75.0% to 41.7%.

With N=12 attack tasks per threat model, each outcome shifts safety by 8.3pp (95% binomial CI for 66.7%: [34.9%, 90.1%]).

Concentration Evidence

Critical-action family distribution aggregated across both threat models under B4.

Reviews & Blocks by Family

send/submit/share
8
3
external navigation
6
4
upload/attach/export
5
2
permission/security
2
financial/account
2
2
terminal/exec
1
Reviewed Blocked

Safety vs Usefulness Frontier (B0-B4)

Unmonitored
60%
58%
Always review, language-only
83%
Always review, state-grounded
40%
100%
Selective routing, state-grounded
60%
92%
Selective + Legibility
40%
67%
Usefulness Safety

The top three families account for 79.2% (19/24) of all reviewed actions and 81.8% (9/11) of all blocks. A minimal router covering only these three families would have caught 82% of blocked actions.

Failure Taxonomy

Four structural modes where sparse review fails. These are boundary conditions of any sparse review approach, not implementation bugs.

Failure Mode Description Count Threat Model
Timing Harm completes before next review checkpoint 8 Scheming
Reviewer FP Reviewer blocks for wrong reason on attack trace 4 Scheming
Boundary No reviewable critical action exists (stop_task) 3 Environmental
Router gap Critical action not surfaced by router 2 Scheming

Timing failures are dominant (8 cases, all scheming): the agent completes the harmful side-task before reaching an action that triggers review. Boundary failures occur exclusively in environmental attacks, where injection causes the agent to stop without executing a critical action. Router gaps occur in scheming, where the harmful action doesn't match any family keyword.

Discussion & Limitations
When sparse review works

The protocol is most effective against environmental attacks with clear action boundaries: injected instructions causing the agent to click "Send" on an unauthorized email, navigate to a credential-harvesting URL, or upload files to an external service. These actions are identifiable by keyword matching, and the live screen state provides strong signal for reviewing intent alignment.

When sparse review fails

The failure taxonomy maps three structural limits: timing failures (harm precedes the review checkpoint), boundary failures (no reviewable critical action exists), and router gaps (harmful actions fall outside the taxonomy). These define where any sparse deferral scheme that does not review every action will struggle.

The legibility puzzle (B4 < B3)

Three candidate explanations, ordered by likelihood: (1) Cross-cohort variation: B3 and B4 come from different run sessions, and with N=5 benign and N=12 attack tasks, sampling noise is substantial. (2) Reviewer anchoring: the untrusted legibility report may provide plausible-sounding justifications that anchor the reviewer toward allowing harmful actions. (3) A "less-is-more" effect consistent with findings that additional monitor information can degrade detection quality (Barr et al., 2025). Disentangling requires same-cohort evaluation at larger scale.

Adaptive attackers

We conducted six A2 (adaptive attacker) runs testing protocol-aware environmental injections and scheming prompts. The agent did not follow the protocol-aware adversarial instructions, rendering results inconclusive. We cannot claim robustness to protocol-aware attackers. Terekhov et al. (2025) demonstrate that protocol awareness can collapse safety in control settings, and Schaeffer et al. (2026) show that attack selection in concentrated settings can undermine sparse monitoring.

Sample size and generalization

With N=12 attack tasks per threat model, each outcome shifts the rate by 8.3 percentage points. Results are specific to OS-Harm tasks, o4-mini as agent, GPT-5.4-mini as reviewer, and the current 10-family router taxonomy. The protocol targets high-stakes, single-action threats; it is not designed for diffuse sabotage.

Conclusion

Harmful CUA trajectories concentrate in a small set of irreversible GUI action families. In the environmental ablation cohort, selective state-grounded review without legibility (B3) reaches 91.7% safety (11/12) at 60.0% usefulness (3/5), compared to 58.3% (7/12) for unmonitored B0 and 100.0% (12/12) safety at 40.0% (2/5) usefulness for blanket visual review (B2). In paired B0/B4 runs on the corrected final cohort, sparse review still improves safety under both environmental and scheming threat models, but with clearer usefulness and calibration costs.

The concentration finding is a reusable design principle. A protocol designer building a CUA control layer can start by profiling the action-family distribution of representative trajectories. If harm concentrates, as it did here with three families covering 81.8% (9/11) of blocked actions, the designer can build a narrow, high-precision router rather than a broad classifier. This preserves usefulness by avoiding the blanket monitoring tax.

Future Work

  1. Does concentration generalize across different CUA task distributions, agent architectures, and GUI platforms?
  2. Scaling to more capable agents and complex multi-application workflows.
  3. Adaptive attacker protocols that produce protocol-aware evasion, including attacker self-resampling.
  4. Integrating human reviewers (H) for highest-suspicion actions, measuring the full H/T/U frontier.
Appendix: Baseline Configurations
Baseline Router Reviewer Input Agent Justification
B0 None None N/A
B1 All actions Language-only No
B2 All actions State-grounded No
B3 Selective (10-family) State-grounded No
B4 Selective (10-family) State-grounded Yes (untrusted)
References

Arike et al. (2026). How Does Information Access Affect LLM Monitors' Ability to Detect Sabotage? arXiv:2601.21112.

Bhatt, A. et al. (2025). Ctrl-Z: Controlling AI Agents via Resampling. arXiv:2504.10374.

Chen, Y. et al. (2025). Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents. arXiv:2508.17155.

Debenedetti, E. et al. (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. arXiv:2406.13352.

Foerster, H. et al. (2026). CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents. arXiv:2601.09923.

Gong, H. et al. (2025). Secure and Efficient Access Control for Computer-Use Agents via Context Space. arXiv:2509.22256.

Greenblatt, R. et al. (2024). AI Control: Improving Safety Despite Intentional Subversion. ICML 2024. arXiv:2312.06942.

Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. AISec 2023. arXiv:2302.12173.

Hu, H. et al. (2025). AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents. arXiv:2509.07764.

Kale, S. et al. (2025). Reliable Weak-to-Strong Monitoring of LLM Agents. arXiv:2508.19461.

Kuntz, T. et al. (2025). OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. arXiv:2506.14866.

Kutasov, J. et al. (2025). Evaluating Control Protocols for Untrusted AI Agents. arXiv:2511.02997.

Kutasov, J. et al. (2025). SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents. arXiv:2506.15740.

Lee, D. et al. (2025). VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492.

Liao, Z. et al. (2025). RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments. arXiv:2505.21936.

Lindner, D. et al. (2025). Practical Challenges of Control Monitoring in Frontier AI Deployments. arXiv:2512.22154.

Schaeffer, J., Khandelwal, A., and Tracy, T. (2026). Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring. arXiv:2602.04930.

Sreevatsa, A., Prasanna, S., and Rushing, C. (2026). Basic Legibility Protocols Improve Trusted Monitoring. arXiv:2602.10153.

Terekhov, M. et al. (2025). Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols. arXiv:2510.09462.

Tian, A. X. et al. (2025). Measuring Harmfulness of Computer-Using Agents. arXiv:2508.00935.

Vu, A. et al. (2025). DECEPTICON: How Dark Patterns Manipulate Web Agents. arXiv:2512.22894.

Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972.