LLM Prompt Injection
Large Language Model Prompt Injection Attack
Introduction
Prompt injection is the highest profile vulnerability in AI-powered features and applications. However, the impact varies greatly depending on who will use the feature, what data is accessible, and what functionality is exposed to the LLM. This guide aims to assist developers in creating secure AI-powered applications and features by helping them understand the actual risks of prompt injection.
Risk Factors
In order for Prompt Injection to be a security risk, there must be two existing components:
Untrusted Input
Impactful Functionality
The problem is that it's easy to miss all the ways untrusted input can be consumed by an AI system, and it's easy to overlook how a feature can be used to impact security.
*Note: Technically for deception risks, only Untrusted Input is requied.
Impactful functionality can be broken down into two major risk categories:
Unauthorized Data Access
Internal-only data
Other users' data
Intellectual property
Communications data (emails, direct messages, etc.)
State-Changing Actions
Modify permissions
Modify users, groups, or orgs
Note on Scope:
The focus of this guide is solely on the security implications associated with prompt injections. Trust, bias, and ethical considerations related to LLM outputs, while important in their own right, are outside the purview of this discussion.
Background
Understanding the Basics:
What is Prompt Injection?
Prompt injection is a hacking technique where malicious users provide misleading input that manipulates the output of an AI system.
Consequences:
To highlight the security impact as opposed to that of trust, bias, and ethical impact, we will focus on attacks that can lead to effects on the Confidentiality, Integrity, and Availability (CIA) of an application:
Confidentiality: Exposing sensitive data or user information.
Integrity: Misleading information or harmful actions.
Availability: Potential disruptions or denial-of-service.
Key Terms
External Input is any input into the system that isn't inserted by the direct interaction of the user. The most common example would be a browser-capability, but it could also include ingesting application error logs, users' queries in the app, or consuming users' object fields like address or title.
State-changing actions are those which can create, delete, or modify any object within the system. This encompasses users, content, application settings, and other pivotal data or configurations.
Out-of-bound requests refer to any attempt to access, retrieve, or manipulate data or services that are not intended for the user or the specific function of the application. These types of requests might attempt to interact with services outside the designated parameters or even reach out to external systems and services that are not typically accessible.
Internal-only data pertains to data that is meant exclusively for internal application use or administrative oversight. It's data that regular users or external entities shouldn't access or view. This can include system configurations, logs, admin-level user details, or any other sensitive information that, if exposed, can compromise the security or integrity of the application.
Do I need to worry about prompt injection?
This is a list of questions to answer about the feature or application being developed to determine if prompt injection is an issue. The flowchart is a visual representation, but the questions are likely better for printing or copying and pasting.
1) Untrusted input
A. Who has access to the application or feature? Groups can be considered inclusive of the groups above them. It's assumed that if non-admin employees can access something, that admins can also access it.
Employees (Admins only) -- if selected, the potential risk is "external input", but only if 1B is yes and if there is functionality from section Two
Employees (Non-Admins) -- if selected, the potential risk is "internal threat", but only if there is functionality from section Two
Users -- if selected, the potential risk is users utilizing prompt injection with their input, but only if there is functionality from section Two
B. Does the application for feature consume or utilize ANY external input.
External Input is any input into the system that isn't inserted by the direct interaction of the user. The most common example would be a browser-capability, but could also include ingesting application error logs, users' queries in the app, or consuming users' object fields like address or title.
Yes -- if selected, the potential risk is the input contains a prompt injection payload which could utilize impactful functionality from section Two or deceptively control the reply to the end user
No
2) Impactful Functionality
A. What user data is utilized as a part of the feature or application?
None
Current User Only -- if selected, the potential risk is the feature/app isn't authorized properly and the current user can access other users' data
Other User's Data or All User Data -- if selected, the risk is "accessing other users' data" as well
B. Does the application or feature have the ability to do state-changing actions? State-changing actions are those which can create, delete, or modify any object within the system. This encompasses users, content, application settings
Yes -- if selected the risk is that the untrusted input would allow an attacker to "make malicious state-changing actions such as modifying users, sending emails, etc"
No
C. Does the application or feature have the ability to make out-of-bound requests? Out-of-bound requests refer to any attempt to access, retrieve, or manipulate data or services that are not intended for the user or the specific funct
Yes -- if selected the risk is that the untrusted input would allow an attacker to "exfiltrate sensitive data such as other users' PII, internal-only data, etc"
No
D. Has the model been fine-tuned, given access to embeddings, or have a look-up feature for internal-only data? Internal-only data is data that regular users or external entities shouldn't access or view. This can include intellec
Yes -- if selected the risk is that the untrusted input would allow an attacker to "access internal-only data"
No
Attack Scenarios
It's often extremely helpful to know the specific attack scenerios that could occur. Below is a non-exhaustive table of attack scenarios. Find the row for the Untrusted Input that your application or feature allows and match it up with the column of the Impactful Functionality that exists in your app or feature.
The markdown table under the image can be hard to read depending on the screen width you're reading this on, so here is an image as well.
New Vectors for Traditional Web Vulnerabilities
The traditional vulnerabilities that can be found in web and mobile applications can now frequently be achieved through prompt injection. This section goes over those avenues of attack for your consideration.
SSRF (Server-Side Request Forgery):
If the AI app or feature can make HTTP requests to fetch data, an attacker could potentially request data from internal hosts. In a cloud environment, the metadata service might be accessible, granting the attacker full control.
Example prompt:
SQL Injection:
If the AI feature interfaces with databases and does not properly sanitize inputs, it could be vulnerable to SQL injections. Malicious inputs could manipulate the database queries, leading to unauthorized data access or modification.
Example prompt:
Remote Code Execution (RCE):
If the AI has capabilities to execute or evaluate code snippets provided by users, there's a risk that attackers might provide malicious code, leading to potential breaches or compromise of the hosting server.
Example prompt:
Cross-Site Scripting (XSS):
Should the AI system have a web interface where it displays output based on user input, there's a potential for XSS attacks. Unsuspecting users might get served malicious scripts that steal their session data or other sensitive information.
Example prompt:
Insecure Direct Object References (IDOR):
If the AI interacts with objects (like files or database entries) based on user input, there's a chance for IDOR. Attackers could potentially access or modify objects they're not supposed to.
Example prompt:
Other Vulnerabilities
The list above is not exhaustive. It's an example of the most common, impactful security vulnerabilities and how they can manifest through prompt injection. Other vulnerabilities are likely to be possible as well.
Mitigations
Existing Solutions
While perfect mitigation against prompt injection isn't yet a guarantee, tools such as Nvidia's NeMo and protectai's Rebuff have shown significant progress in tackling this risk. If your application is highly susceptible to prompt injection attacks, these tools could offer a layer of protection.
Dual-LLM Approach
In situations where impactful functionality is vital, it would be advantageous to consider implementing the 'Dual LLM' design initially discussed by Simon Willison in his blog post. Although the suggestion that AI cannot mitigate prompt injection risk is contested, Willison's insights provide valuable reading on the topic.
High-level Mitigation Principles
Shared Authorization: The user and AI-powered feature or application should share authorization. Doing so reduces the odds of unauthorized data access. Practically, they should share an authentication token or session. This tackles the problem of unauthorized access to data for API calls and database calls, but also may help mitigate SSRF risks. It could also be applicable to RCE if the user has auth to a "workspace" where their code can be executed such as how OpenAI's Code Interpreter works.
Read-Only: When possible, restrict the feature to read-only access. For example, when using an AI-powered feature to hit an API or make database calls, be sure the permissions are read-only.
Sandboxing: If executing code is required, a nearly perfect sandbox would be required. OpenAI's Code Interpreter pulls it off, but it's a hard problem to solve. Be wary!
Rate-limiting: There is a risk of utlizing the model for a user's own biddings rather than the intended use of the application. This can be harmful as it incurs costs on the business for an illegitimate user. I've heard stories of this inflicting as much as $25,000. Rate-limiting each user will limit this impact. Also, in my prompt injection security research, I found myself tweaking the same prompt over and over to achieve a goal of prompt injection, and fuzzing with many payloads. Both of these can be detected and prevented through stringent rate-limiting.
Multi-modal Prompt Injection
Image
Image-processing generative AI can also be suceptible to prompt injection leading to all the implications that we've discussed above. Johann was the first to share it in Bard:
https://twitter.com/wunderwuzzi23/status/1679676160341581824
There is multi-modal GPT-4 coming out soon, and it will have the same issue.
Image processing functionality does OCR (optical character recognition), and as they are still LLMs under the hood, prompt injection is possible via text on the image.
Voice
Naturally, voice gets parsed as text, so voice-based prompt injection is viable.
Video
There aren't video-based multi-modal models yet. I assume the image+text models can effectively do video processing if handling the video frame-by-frame, but that's different from processing the video as a single input file. However, video is simply images and voice (which reduces to text) so prompt injection would be viable for video models as well.
Hacking on AI-Powered Apps
To utilize this guide for pentesting or bug hunting, remember that understanding an application's or feature's prompt handling processes is key to uncovering prompt injection opportunities. Here's how you can structure your approach:
Identify and Understand Untrusted Inputs: Using this guide, figure out all the ways untrusted input can find its way into the AI system. Direct methods like prompt interactions or more subtle methods like support chatbots are excellent places to start. If the application offers more advanced interaction methods like web browsing or email processing, try those of course.
Identify Potentially Impactful Functionality Abilities: Recognize the possibilities of existing impactful functionalities that can wreak havoc if manipulated. These could be unauthorized data access including but not limited to internal-only data or other users' personal data. State-altering actions such as permission change or tinkering with users, groups, and organizations also fall under this category.
Various Prompt Injection Attacks: Based on the promptmap project, I'd suggest testing the full spectrum of possible prompt injection attacks:
Basic Injection: Start with the simplest form and ask the AI to execute a state-changing action or leak confidential data.
Translation Injection: Try manipulating the system in multiple languages.
Context Switch: Explore the possibility of asking something related to its primary task, then pivot into an unrelated harmful request.
External Prompt Injection: Remember to explore how external input processed by LLM could be manipulated to inject malicious prompts.
Explore other Vulnerabilities: Using the primer's guide, see if other web-specific vulnerabilities can be achieved through prompt injection. Investigate SSRF, SQL Injection, and RCE directly. If any UI returns the manipulated outputs directly to the user, test for potential XSS vulnerabilities.
Conclusion
AI offers immense potential in numerous applications but developers should be aware of the inherent security risks, particularly prompt injection. By understanding these risks and applying the principles outlined above, we can build applications that harness the power of AI while maintaining a high level of security
Last updated