Semantic Policies for CUA Tools

Evaluating syntactic, semantic, and contextual policy enforcement for Computer Use Agent security across domains, task scales, and adversarial conditions.

0.973
Contextual F1
94.0%
Injection Detection
72.1%
Task Utility
0.0%
False Positive Rate

Main Results

PolicySafety RecallPrecisionF1FPRUtilityInj. DetectTask Compl.
None0.0000.0000.0000.0001.0000.0001.000
Syntactic0.4440.9770.6110.0050.8600.1970.451
Semantic0.5601.0000.7180.0000.8300.3060.358
Contextual0.9471.0000.9730.0000.7210.9400.206

Safety-Utility Pareto Frontier

Policy Comparison (Radar)

F1 Score by Policy Level

Injection Detection Rate