AI and LLM penetration testing

Research-led AI pentest, the prompts, policies, and trust paths attackers will take.

CREST-accredited researchers test LLM applications and agentic systems for prompt injection, policy bypass, data exfiltration, tool-call abuse, and the seven vulnerabilities catalogued in the Cloak Honey Trap paper (USENIX Security 2025).

GET YOUR SCOPING CALL

Talk to a security expert

Trusted by AI teams across Fintech, SaaS & Education, Enterprise & Telecom, Security & Critical Infrastructure

Airbase
Quiltt
Pacvue
Imagine Learning

Why this matters

LLM bugs are not OWASP bugs. Most pentest firms still ship the same playbook.

  • Generic pentest firms paste prompt-injection wordlists from 2024 and call it AI testing. Modern guardrails laugh that off.

  • Agentic systems with tool calls fail at the seam: a benign prompt plus a poisoned doc plus a permissive tool equals exfil.

  • The OWASP LLM Top 10 lists the categories; finding them in your actual stack takes someone who has chained them in production.

Here is what our researchers ship.

Why teams pick us

Chained AI attacks, not prompt-injection wordlists.

  • Cloak-Honey-Trap coverage

    All 7 agent vulnerabilities, 6 strategies, and 15 techniques from the USENIX paper. Tested as chains, not isolated prompts.

  • Agentic tool-call abuse

    Cross-tool injection, RAG poisoning, scoped credential leak, MCP boundary bypass.

  • Pentester-grade reporting

    Reproducible payload chains, business-impact tagging, fix paths for guardrail tuning.

How it works

From threat model to report in two weeks.

  1. Scope the AI surface

    Models, tools, RAG sources, deployment topology, customer-data boundary. Mapped to threat model on the call.

  2. Researchers chain attacks

    Direct and indirect prompt injection, policy bypass, exfil, tool-call abuse, MCP attacks.

  3. Findings with payload chains

    Each finding ships with the prompt, the policy gap, the data path, and the guardrail fix.

Inside the engagement

Built for AI teams shipping every week.

  • OWASP LLM Top 10

    Tested in your actual stack, not a sandbox model. Findings tagged to LLM01 through LLM10.

  • Agentic and MCP

    Cross-tool, cross-agent, and MCP boundary tested. Cloak Honey Trap techniques applied.

  • RAG and data

    RAG poisoning, vector-store leakage, retrieval-time injection, training-data exfil paths.

Research ledger,

Coordinated disclosures from SL7 research.

The same researchers chain attacks on your AI stack.

Full advisories index

What founders say

Thank you for being our pentest partners. Our user base is safer because of y'all.
Vinay Hiremath

Vinay Hiremath

Co-founder, Loom

View tweet

Common questions

What AI teams ask before they sign.

What gets tested?
LLM applications (chatbots, copilots, agents), agentic systems with tools, RAG pipelines, MCP servers.
Which threat catalogues do you cover?
OWASP LLM Top 10, MITRE ATLAS, and the Cloak Honey Trap (USENIX Security 2025) taxonomy.
Do you test the model or just the app?
Both. Model-level bypass plus app-level guardrail bypass. The interesting bugs live at the seam.
Is it safe on production?
Prod-safe by default. Destructive prompts and policy-violation tests run on stage unless approved.
What about agent tool-call abuse?
Yes. Cross-tool injection, scope creep, credential leak across tools, MCP boundary breaks.

Ready to test the AI surface attackers will?

20-minute scoping call with a researcher who has chained these bugs in production. Models, agents, RAG, and MCP boundaries.

CREST · CERT-In · SOC 2 · ISO 27001