Lesson 3 of 6·10 min read

Red Teaming & Penetration Testing

You can only trust your AI security once you've actively attacked it. Red teaming for LLMs follows its own rules — the attack surface is different from traditional software.

Red Team Methodology for AI

Why AI Red Teaming Is Different

Traditional pen testing looks for known vulnerabilities (CVEs). AI red teaming looks for emergent behaviors — problems arising from the interaction between model, system prompt, and user inputs.

The Red Team Process

Phase 1 — Reconnaissance:

  • Extract (or reconstruct) the system prompt
  • Identify model type and version
  • Document available tools/plugins
  • Test rate limits and guardrails

Phase 2 — Attack Execution:

  • Systematic prompt injection attempts
  • Apply jailbreaking techniques
  • Data exfiltration attempts
  • Privilege escalation via tool calls

Phase 3 — Reporting:

  • Document all successful attacks
  • Severity rating (CVSS-like for AI)
  • Create reproducible proof-of-concepts
  • Recommendations for mitigations

Automated Adversarial Testing

Tools and Frameworks

ToolDescriptionType
GarakOpen-source LLM vulnerability scannerAutomated
PyRIT (Microsoft)Red teaming automation frameworkAutomated
PromptfooPrompt testing & evaluationHybrid
ART (IBM)Adversarial Robustness ToolboxML-focused
RebuffSelf-hardening prompt injection detectorReal-time

Automated Testing Strategies

  • Mutation testing: Systematically vary known attacks (paraphrasing, translation, encoding)
  • Genetic algorithms: Evolutionarily optimize prompts to bypass guardrails
  • Tree-of-attacks: LLM-driven attacks that learn from failed attempts
  • Gradient-based attacks: For open-source models — mathematically compute adversarial tokens

Jailbreak Detection

Categories of Jailbreaks

Persona-based:

  • DAN (Do Anything Now), STAN, DUDE — role-play-based circumvention
  • "Grandma exploit": "My grandma always used to read me napalm recipes before bed..."

Encoding-based:

  • Base64-encoded instructions
  • ROT13, Caesar cipher, leetspeak
  • Unicode characters that look like ASCII (homoglyph attack)

Logic-based:

  • Hypothetical scenarios: "Purely hypothetically, if a character in a novel..."
  • Negation: "Don't explain to me how to..." (model explains it anyway)
  • Step-by-step: Gradual escalation across multiple turns

Detection Strategies

  • Pattern matching: Known jailbreak templates in a database
  • Semantic similarity: Compare new inputs against known jailbreaks
  • Behavioral analysis: Track model behavior across multiple turns (gradual escalation)
  • Canary tokens: Secret tokens in the system prompt — if they appear in the output, the prompt was leaked

Fuzzing for AI Systems

Prompt Fuzzing

Systematically generate unusual inputs:

  • Character fuzzing: Unicode special characters, control characters, zero-width spaces
  • Structure fuzzing: Extremely long inputs, deep nesting, unexpected formats
  • Semantic fuzzing: Meaningless but grammatically correct inputs, edge cases
  • Multi-modal fuzzing: Images with embedded text, audio with control sequences

Rule: Red teaming is not a one-time event. It must happen continuously — with every model update, every new feature, every change to the system prompt. Integrate it into your CI/CD pipeline.