Engineering

The 96K-Skill Wild Scan: Methodology Walkthrough

KUAN-HSIN LINMay 11, 20267 min

How we collected 96,096 production SKILL.md files across four registries, ran ATR v2.1.1 detection, and surfaced 751 confirmed malicious instances with audit-grade reproducibility.

When people ask whether ATR rules generalize beyond synthetic benchmarks, the honest answer is: only if we measure them on production data. This is the walkthrough of how we did that across 96,096 real SKILL.md files.

The Corpus

We collected publicly accessible skills from four registries between 2026-03 and 2026-05:

●OpenClaw: 56,480 skills
●ClawHub: 36,378 skills
●Skills.sh: 3,115 skills
●Hermes: 123 skills
●Total: 96,096 SKILL.md files

Each file was fetched via the registry's public API, normalized to canonical SKILL.md frontmatter + body structure, and stored with a content-addressed hash for deduplication. No private repos, no scraped credentials, no terms-of-service violations.

Detection Pass

We ran ATR v2.1.1 (336 rules, MIT-licensed) against every normalized skill. The detection produced an initial hit set, which a human reviewer then triaged into three buckets: benign, malicious, indeterminate. Indeterminate cases got a second pass with payload extraction; if a working C2 endpoint or credential exfil chain could be demonstrated, it moved to malicious.

The 751

Across the corpus, 751 skills were confirmed malicious. They clustered into three systematic campaigns rather than scattered one-offs, which is the more telling result:

Campaign	Skills	Signature

---	---	---

ClawHavoc	1,184 indicators across 600+ skills	C2 endpoint `91.92.242.30`

AMOS infostealer	314 skills	`hightower6eu` exfil host

MedusaLocker PoC	reused tooling	Cato Networks attribution chain

Every malicious finding has an ATR rule ID + payload signature attached, so anyone can re-run the detection and reproduce the verdict.

The 432-Skill Labelled Benign Corpus

A recall number without a false-positive number is theater. We hand-labelled 432 known-benign skills (popular Anthropic-published, well-known maintainer-published, simple productivity skills) and measure per-rule FP rate on every release. ATR v2.1.1 sits at 0.20% FP on that corpus. That is the regression gate every rule PR has to clear before merge.

Recall

For recall we use NVIDIA Garak inthewild_jailbreak_llms (666 samples). ATR v2.1.1 lands at 97.1% recall. We report recall and FP separately on different corpora because mixing them inflates both numbers.

Why This Is Audit-Grade

Three properties:

1. Public-derived corpus — anyone can re-fetch the same registries and reproduce the input set within a tolerance.

2. Numbered results — 96,096 / 751 / 432 / 97.1% / 0.20% are real cardinalities, not rounded marketing.

3. Separated metrics — recall on Garak, FP on labelled benign, no mixing.

The scan scripts are public, the rule corpus is MIT, the campaign indicators are published. If you can fetch the registries, you can re-derive every number on this page.

Methodology repo · ATR v2.1.1 release · Benchmark dashboard