Semantic Policies for CUA Tools

Evaluating syntactic, semantic, and contextual policy enforcement for Computer Use Agent security across domains, task scales, and adversarial conditions.

0.973

Contextual F1

94.0%

Injection Detection

72.1%

Task Utility

0.0%

False Positive Rate

Main Results

Policy	Safety Recall	Precision	F1	FPR	Utility	Inj. Detect	Task Compl.
None	0.000	0.000	0.000	0.000	1.000	0.000	1.000
Syntactic	0.444	0.977	0.611	0.005	0.860	0.197	0.451
Semantic	0.560	1.000	0.718	0.000	0.830	0.306	0.358
Contextual	0.947	1.000	0.973	0.000	0.721	0.940	0.206

Safety-Utility Pareto Frontier

Policy Comparison (Radar)

F1 Score by Policy Level

Injection Detection Rate