A large-scale simulation study of 2,500 synthetic agent skill packages across 8 categories and 10 vulnerability classes, calibrated to publicly reported ecosystem properties.
Agent skills are modular packages containing SKILL.md instructions and optional bundled scripts distributed via public marketplaces. Despite rapid adoption, the security posture of these skill packages remains largely uncharacterized.
We model a marketplace of N = 2,500 agent skill packages. Each skill is characterized by its category, code complexity, number of bundled scripts, requested permissions, popularity tier, and vetting status. A simulated multi-layer vulnerability scanner evaluates each skill against 10 vulnerability classes. The simulation uses a fixed random seed for full reproducibility.
For each skill-vulnerability pair, the detection probability is computed as: p_v = min(0.95, r_v * m_{c,v} * log(complexity+1)/log(101) * (1 + 0.08 * n_perms) * f_vet), where r_v is the base rate for vulnerability class v, m_{c,v} is the category-specific multiplier, and f_vet is the vetting reduction factor (1.0 for unreviewed, 0.65 for auto-scanned, 0.30 for human-reviewed).
Explore vulnerability prevalence across multiple dimensions: vulnerability classes, skill categories, vetting status, popularity tiers, and code complexity.
Detailed numerical results from the simulation study.
| Metric | Value |
|---|---|
| Skills scanned | 2,500 |
| Vulnerable skills | 1,899 |
| Overall prevalence | 0.7596 |
| Critical prevalence | 0.2748 |
| High-or-critical prevalence | 0.5216 |
| Total vulnerabilities | 3,863 |
| Mean vulns per skill | 1.5452 |
| Mean vulns per vulnerable skill | 2.0342 |
| Vulnerability Class | Prevalence | Critical | High | Medium | Low | Count |
|---|---|---|---|---|---|---|
| Missing input validation | 0.2992 | 0.0481 | 0.1832 | 0.4398 | 0.3289 | 748 |
| Excessive permissions | 0.2932 | 0.1173 | 0.2606 | 0.3752 | 0.2469 | 733 |
| Supply chain integrity | 0.2044 | 0.2505 | 0.3190 | 0.3190 | 0.1115 | 511 |
| Prompt injection | 0.1680 | 0.2405 | 0.3810 | 0.2929 | 0.0857 | 420 |
| Credential leakage | 0.1636 | 0.3374 | 0.3227 | 0.2445 | 0.0954 | 409 |
| Path traversal | 0.1216 | 0.1842 | 0.3487 | 0.3026 | 0.1645 | 304 |
| Data exfiltration | 0.1196 | 0.3579 | 0.2408 | 0.2843 | 0.1171 | 299 |
| Arbitrary code execution | 0.0860 | 0.4186 | 0.3442 | 0.2093 | 0.0279 | 215 |
| Dependency confusion | 0.0572 | 0.3077 | 0.3217 | 0.2937 | 0.0769 | 143 |
| Insecure deserialization | 0.0324 | 0.3333 | 0.2593 | 0.2963 | 0.1111 | 81 |
| Category | N | Vulnerable | Prevalence | Critical | Mean Vulns |
|---|---|---|---|---|---|
| Security tools | 153 | 124 | 0.8105 | 0.3203 | 1.7386 |
| System admin | 292 | 233 | 0.7979 | 0.3185 | 1.8014 |
| Web automation | 361 | 284 | 0.7867 | 0.2659 | 1.6205 |
| Data analysis | 408 | 314 | 0.7696 | 0.2794 | 1.6152 |
| File management | 243 | 183 | 0.7531 | 0.2551 | 1.4897 |
| Misc | 232 | 172 | 0.7414 | 0.2457 | 1.4828 |
| Communication | 247 | 183 | 0.7409 | 0.2794 | 1.4170 |
| Coding | 564 | 406 | 0.7199 | 0.2606 | 1.3670 |
| Vetting Status | N | Prevalence | Critical |
|---|---|---|---|
| Unreviewed | 1,386 | 0.8586 | 0.3341 |
| Auto-scanned | 771 | 0.7302 | 0.2374 |
| Human-reviewed | 343 | 0.4257 | 0.1195 |
| Popularity | N | Prevalence | Critical |
|---|---|---|---|
| Low | 1,403 | 0.8076 | 0.2937 |
| Medium | 698 | 0.7249 | 0.2564 |
| High | 272 | 0.6691 | 0.2537 |
| Very High | 127 | 0.6142 | 0.2126 |
| Complexity Tier | N | Prevalence |
|---|---|---|
| Tiny (<50 lines) | 774 | 0.6370 |
| Small (50-200) | 1,124 | 0.7891 |
| Medium (200-500) | 391 | 0.8568 |
| Large (500-2000) | 194 | 0.8608 |
| Very Large (2000+) | 17 | 1.0000 |
Primary findings from the simulation-based measurement study of agent skill security.
75.96% of all skills contain at least one vulnerability, with a mean of 1.5452 vulnerabilities per skill. This is substantially worse than mature package ecosystems such as npm (10-15%).
27.48% of skills contain critical-severity vulnerabilities, and 52.16% contain high or critical issues. Arbitrary code execution has the highest critical rate at 41.86%.
Missing input validation (29.92% prevalence) and excessive permissions (29.32%) are the most common. Supply chain integrity gaps affect 20.44% of skills.
Security tools (81.05%) and system administration (79.79%) skills are the most vulnerable. Paradoxically, security tools have the highest vulnerability rate in the ecosystem.
Human-reviewed skills show 42.57% prevalence vs. 85.86% for unreviewed -- a 43.29 percentage-point absolute reduction. However, only 13.7% of marketplace skills have human review.
Prevalence increases from 63.70% for tiny skills (<50 lines) to 86.08% for large skills (500-2000 lines), and 100% for very large skills (2000+ lines).