LLM-Agent Hacking: High Effort, Low Reward Gamble

Mar 24

OWASP Gen AI Threat Intelligence Team Releases "LLM Exploits Generation" v1.0

The OWASP Gen AI Threat Intelligence Team has published version 1.0 of its "LLM Exploits Generation" research, evaluating how Large Language Models (LLMs) agents can be used to automate vulnerability exploitation and perform common hacking tasks. The research involved creating LLM-based agents and having them performing a set of 5-6 hacking tasks against a simulated application, called OWASP Juice Shop. Three LLMs, namely Claude, ChatGPT-4o, and local Deep Seek R1 (14B, 32B, and 70B parameter variants) were used within a so called Cybench Framework that described both the tasks, resources needed, as well as an evaluation system, comparing their performance.

Cybench Framework Overview

Cybench provides a structured way to evaluate LLM agents' offensive security capabilities. It consists of three core components:

Task Descriptions: Clearly defined security challenges within a containerized environment where agents can execute bash commands and interact with resources.
Starter Files: Pre-configured local and remote resources needed to complete tasks.
Evaluation System: Determines success based on achieving security objectives, operational success, and flag-based completion. It also tracks performance metrics like token usage and execution time.

Tested OWASP Juice Shop Challenges

The study assessed LLM agents on the following hacking tasks:

Access a Confidential Document: Retrieve a protected file.
Perform a DOM XSS Attack: Inject and execute malicious JavaScript.
Exposed Metrics: Identify an endpoint leaking usage data.
Five-Star Feedback: Post feedback under another user's identity.
Login-Jim: Gain unauthorized access to a test user account.
Reflected XSS: Execute a reflected cross-site scripting attack.

Key Findings

Despite fears that LLMs could empower script kiddies, the study revealed significant barriers to practical offensive use:

Threat Actor Challenges: Level of skills and access required

Advanced LLMs Perform Better: ChatGPT-4o led the evaluation, with Claude following closely. However, local models like DeepSeek failed to complete any Cybench tasks.
Higher level of access needed. While API-accessible models like GPT-4o and Claude could complete Cybench tasks. Local models, such as DeepSeek, failed entirely, implying that attackers would need stolen credentials for cloud-based LLMs.

Threat Actor Challenges: The study suggests that achieving advanced LLM hacking capabilities requires access to high-end LLM providers, increasing the risk of discovery for attackers. Low-skill actors (script kiddies) cannot effectively use LLMs for hacking due to complexity and the amount of handholding required. Thus requires mid-level of advanced skills to avoid being detected and achieve results.

Lack of Stealth

LLM-based scanning and exploitation require multiple iterations, becomes very noisy and basically too loud to avoid detection tools. Without a sophisticated proxy network or botnet, an attacker risks triggering security alerts.

High Cost and Inefficiency due to inconsistent success determination.

While LLMs could execute certain exploits, their performance was unreliable and required extensive human guidance.
Compared to traditional tools like Metasploit or Burp Suite, LLM-based hacking incurs high costs due to repeated token usage. If an LLM gets stuck in an unsuccessful loop, costs escalate without success. This is exacerbated by the fact that LLM agents struggle to recognize successful exploitation without explicit "capture the flag" markers, making automation unreliable.

Conclusion

OWASP’s findings indicate that while LLMs can assist in hacking tasks, they are far from a magic bullet for cybercriminals in particular those with limited skills and budgets. While AI-driven hacking is a real concern, practical barriers limit its immediate threat. Organizations should still monitor AI advancements but can take comfort in the current challenges facing LLM-based offensive security techniques.

Thank you: OWASP AI CTI Team

Madjid Nakhjiri

LLM-Agent Hacking: High Effort, Low Reward Gamble

OWASP Gen AI Threat Intelligence Team Releases "LLM Exploits Generation" v1.0

Cybench Framework Overview

Tested OWASP Juice Shop Challenges

Key Findings

Conclusion

AI security isn't just DevSecOps with a new name.

GTC 2025 Expo Experience: AI Hardware, Software, and Cloud Innovations

CognoFort AI security