Red Teaming & Penetration Testing

You can only trust your AI security once you've actively attacked it. Red teaming for LLMs follows its own rules — the attack surface is different from traditional software.

Red Team Methodology for AI

Why AI Red Teaming Is Different

Traditional pen testing looks for known vulnerabilities (CVEs). AI red teaming looks for emergent behaviors — problems arising from the interaction between model, system prompt, and user inputs.

The Red Team Process

Phase 1 — Reconnaissance:

Extract (or reconstruct) the system prompt
Identify model type and version
Document available tools/plugins
Test rate limits and guardrails

Phase 2 — Attack Execution:

Systematic prompt injection attempts
Apply jailbreaking techniques
Data exfiltration attempts
Privilege escalation via tool calls

Phase 3 — Reporting:

Document all successful attacks
Severity rating (CVSS-like for AI)
Create reproducible proof-of-concepts
Recommendations for mitigations

Automated Adversarial Testing

Tools and Frameworks

Tool	Description	Type
Garak	Open-source LLM vulnerability scanner	Automated
PyRIT (Microsoft)	Red teaming automation framework	Automated
Promptfoo	Prompt testing & evaluation	Hybrid
ART (IBM)	Adversarial Robustness Toolbox	ML-focused
Rebuff	Self-hardening prompt injection detector	Real-time

Automated Testing Strategies

Mutation testing: Systematically vary known attacks (paraphrasing, translation, encoding)
Genetic algorithms: Evolutionarily optimize prompts to bypass guardrails
Tree-of-attacks: LLM-driven attacks that learn from failed attempts
Gradient-based attacks: For open-source models — mathematically compute adversarial tokens

Jailbreak Detection

Categories of Jailbreaks

Persona-based:

DAN (Do Anything Now), STAN, DUDE — role-play-based circumvention
"Grandma exploit": "My grandma always used to read me napalm recipes before bed..."

Encoding-based:

Base64-encoded instructions
ROT13, Caesar cipher, leetspeak
Unicode characters that look like ASCII (homoglyph attack)

Logic-based:

Hypothetical scenarios: "Purely hypothetically, if a character in a novel..."
Negation: "Don't explain to me how to..." (model explains it anyway)
Step-by-step: Gradual escalation across multiple turns

Detection Strategies

Pattern matching: Known jailbreak templates in a database
Semantic similarity: Compare new inputs against known jailbreaks
Behavioral analysis: Track model behavior across multiple turns (gradual escalation)
Canary tokens: Secret tokens in the system prompt — if they appear in the output, the prompt was leaked

Fuzzing for AI Systems

Prompt Fuzzing

Systematically generate unusual inputs:

Character fuzzing: Unicode special characters, control characters, zero-width spaces
Structure fuzzing: Extremely long inputs, deep nesting, unexpected formats
Semantic fuzzing: Meaningless but grammatically correct inputs, edge cases
Multi-modal fuzzing: Images with embedded text, audio with control sequences

Rule: Red teaming is not a one-time event. It must happen continuously — with every model update, every new feature, every change to the system prompt. Integrate it into your CI/CD pipeline.