Evaluating syntactic, semantic, and contextual policy enforcement for Computer Use Agent security across domains, task scales, and adversarial conditions.
| Policy | Safety Recall | Precision | F1 | FPR | Utility | Inj. Detect | Task Compl. |
|---|---|---|---|---|---|---|---|
| None | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 |
| Syntactic | 0.444 | 0.977 | 0.611 | 0.005 | 0.860 | 0.197 | 0.451 |
| Semantic | 0.560 | 1.000 | 0.718 | 0.000 | 0.830 | 0.306 | 0.358 |
| Contextual | 0.947 | 1.000 | 0.973 | 0.000 | 0.721 | 0.940 | 0.206 |