Red Teaming & Penetration Testing
You can only trust your AI security once you've actively attacked it. Red teaming for LLMs follows its own rules — the attack surface is different from traditional software.
Red Team Methodology for AI
Why AI Red Teaming Is Different
Traditional pen testing looks for known vulnerabilities (CVEs). AI red teaming looks for emergent behaviors — problems arising from the interaction between model, system prompt, and user inputs.
The Red Team Process
Phase 1 — Reconnaissance:
- Extract (or reconstruct) the system prompt
- Identify model type and version
- Document available tools/plugins
- Test rate limits and guardrails
Phase 2 — Attack Execution:
- Systematic prompt injection attempts
- Apply jailbreaking techniques
- Data exfiltration attempts
- Privilege escalation via tool calls
Phase 3 — Reporting:
- Document all successful attacks
- Severity rating (CVSS-like for AI)
- Create reproducible proof-of-concepts
- Recommendations for mitigations
Automated Adversarial Testing
Tools and Frameworks
| Tool | Description | Type |
|---|
| Garak | Open-source LLM vulnerability scanner | Automated |
| PyRIT (Microsoft) | Red teaming automation framework | Automated |
| Promptfoo | Prompt testing & evaluation | Hybrid |
| ART (IBM) | Adversarial Robustness Toolbox | ML-focused |
| Rebuff | Self-hardening prompt injection detector | Real-time |
Automated Testing Strategies
- Mutation testing: Systematically vary known attacks (paraphrasing, translation, encoding)
- Genetic algorithms: Evolutionarily optimize prompts to bypass guardrails
- Tree-of-attacks: LLM-driven attacks that learn from failed attempts
- Gradient-based attacks: For open-source models — mathematically compute adversarial tokens
Jailbreak Detection
Categories of Jailbreaks
Persona-based:
- DAN (Do Anything Now), STAN, DUDE — role-play-based circumvention
- "Grandma exploit": "My grandma always used to read me napalm recipes before bed..."
Encoding-based:
- Base64-encoded instructions
- ROT13, Caesar cipher, leetspeak
- Unicode characters that look like ASCII (homoglyph attack)
Logic-based:
- Hypothetical scenarios: "Purely hypothetically, if a character in a novel..."
- Negation: "Don't explain to me how to..." (model explains it anyway)
- Step-by-step: Gradual escalation across multiple turns
Detection Strategies
- Pattern matching: Known jailbreak templates in a database
- Semantic similarity: Compare new inputs against known jailbreaks
- Behavioral analysis: Track model behavior across multiple turns (gradual escalation)
- Canary tokens: Secret tokens in the system prompt — if they appear in the output, the prompt was leaked
Fuzzing for AI Systems
Prompt Fuzzing
Systematically generate unusual inputs:
- Character fuzzing: Unicode special characters, control characters, zero-width spaces
- Structure fuzzing: Extremely long inputs, deep nesting, unexpected formats
- Semantic fuzzing: Meaningless but grammatically correct inputs, edge cases
- Multi-modal fuzzing: Images with embedded text, audio with control sequences
Rule: Red teaming is not a one-time event. It must happen continuously — with every model update, every new feature, every change to the system prompt. Integrate it into your CI/CD pipeline.