LockPrompt - AI-Based Prompt Injection Discovery

LockPrompt is a proposed solution to the burning question: How can LLM vendors defend themselves against never-seen-before prompt injection attacks?

Sep 07, 2024

OpenAI offers a cybersecurity grant, offering to fund research that contributes to the AI defense landscape. As my application for this grant, I propose LockPrompt - an AI-based prompt injection discovery tool. In this article, I will expand on the problems that LockPrompt will address, the methodologies and approaches it will leverage, and the expected results of my research.

Problems The Project Will Address

Current Prompt Injection Defense

Case Study - ASCII Smuggling

How Will LockPrompt Solve This?

Methodologies and Approaches

Expected Results

Final Thoughts - The Future

Problems The Project Will Address

Current Prompt Injection Defense

One serious problem that Large Language Model vendors face is prompt injection defense. Bad actors may use an LLM to generate harmful content or launch attacks on applications, which are undesirable outcomes. In response, engineers have begun building out prompt injection detectors, comparing input to known lists of harmful material and attempting to categorize prompts as malicious or benign.

Case Study - ASCII Smuggling

What if a prompt injection has never been seen before? A good example of this is a technique coined as ASCII Smuggling. ASCII Smuggling uses special Unicode characters that don’t render in user interfaces, appearing invisible - however, LLMs still interpret and carry out these Unicode instructions. This was discovered in January 2024 and has since been added to the prompt injection detection logic for most LLM providers.

If a prompt injection technique has never been seen before, it will work by default. Attackers have the near-infinite arsenal of natural language at their disposal to create novel attacks, putting them ahead in the prompt injection arms race.

How Will LockPrompt Solve This?

LockPrompt is a neural network that will be trained on a vast dataset of ‘normal’ user prompts. Using this information, LockPrompt will be able to take in prompts it has never seen and quarantine highly irregular ones. Humans will be able to inspect the quarantine and discover new prompt injection techniques being used in the wild, then add these signatures to the known lists of malicious input already being used.

This solution puts the power back into the hands of defenders, allowing them to react to new prompt injection attacks quickly and efficiently.

Methodologies and Approaches

1. Gathering Training Data

To begin, a large and diverse selection of training data needs to be gathered. The more representative this is of normal user behavior, the more accurate LockPrompt will be. Fortunately, this GitHub repo contains a collection of several open-source prompt datasets. Each dataset will be analyzed, then the best option will be selected and imported.

2. Data Preprocessing

Next, the training data must be preprocessed. Preprocessing involves standardizing training data to ensure it is in a consistent format that can be efficiently processed by a neural network.

Below is an example preprocessor that performs two key operations:

Defining a maximum prompt length: This ensures that all inputs are of uniform size, preventing issues with variable-length inputs. By capping the length, we avoid unnecessary computational overhead and focus on the most important portion of each prompt.
Establishing a vocabulary size of 10,000: By limiting the vocabulary to the top 10,000 most frequent words, we reduce noise from rare or irrelevant terms. This makes training more efficient and helps the model generalize better.

These preprocessing steps are essential to streamline the training process, improve model performance, and ensure the neural network processes data consistently.

3. Autoencoder Neural Network Construction

An autoencoder is a special type of neural network that consists of an encoder and a decoder. Once an autoencoder is trained, it will encode new prompts, and then attempt to reconstruct them with the decoder. If the prompt is similar to the training data, it will reconstruct it well, leading to a low reconstruction error. If the prompt is different the autoencoder will reconstruct it inaccurately, giving us a high reconstruction error.

The code below defines layers of a basic autoencoder neural network. Explaining this in detail is outside the scope of this article, but you can learn more here.

4. Training The Autoencoder

Next, the model is trained on the training dataset, using a subset of the original data as validation input and going through the dataset a defined number of times (50 in this case)

5. Anomaly Threshold Calculation

Now our autoencoder is trained, we can use it to calculate the reconstruction error for a given prompt. We can perform mathematical statistics to create a standard distribution of reconstruction errors for our training data set, and then set an anomaly threshold of our choosing. In this example, the threshold is set to 2.5 standard deviations above the mean. Any prompts that score greater than this threshold will be quarantined for human analysis.

6. Fine Tuning

Finally, we can test the model with new prompts using the following code snippet:

With models such as these, the devil is in the details. Many variables exist in the model, like the training data, neural network layers, number of epochs, and vocabulary size. To make the model more accurate at finding anomalous prompts, we can collect a dataset of historical novel prompt injections like the ASCII Smuggling technique. Then, we can pass in this dataset to our model and determine both the number of prompts successfully quarantined and the average reconstruction error.

Using these metrics, we can tweak our model and determine if it led to an improvement or degradation of accuracy. Through several cycles of testing, the model will become very accurate at detecting new prompt injections.

Expected Results

A futuristic landscape illustrating the concept of AI-driven anomaly detection. The scene features a web of interconnected neural networks with flowing data streams, highlighted by irregular, glowing patterns representing anomalies detected by autoencoders. In contrast to the smooth flow of normal data, some nodes pulse with vibrant red or orange hues, symbolizing anomalies. The backdrop includes abstract representations of data grids, mathematical symbols, and fluctuating waveforms, creating a visual contrast between normal and anomalous behavior. The color palette is a mix of cool blues and greens, with striking warm tones marking the anomalies.

I expect LockPrompt to be uninspiring in its first iteration - there are so many unforeseen factors that could negatively impact its accuracy. However, with a robust fine-tuning phase, LockPrompt has the potential to revolutionize prompt injection defense and be integrated into every major Large Language Model.

The end product will be an end-to-end package that can integrate with any LLM. It will include refined autoencoder logic, a quarantine for manual review, and a simple mechanism to add anomalous prompt injection signatures into existing blocklists. While there is lots of groundwork ahead, I am thrilled to begin working on this project.

Final Thoughts - The Future

A more naturalistic seascape symbolizing the future of AI and security. The ocean waves gently roll under a bright, soft sky, reflecting hues of blue and green. Subtle neural network patterns are faintly woven into the surface of the water, blending seamlessly with the natural sea elements. In the distance, a golden horizon shines, symbolizing the future of AI, while subtle hints of security symbols like locks and shields are barely visible, softly merging with the landscape. The overall scene feels organic and serene, with a natural balance between technology and the beauty of the ocean.

As LLMs become increasingly integrated into our daily lives, devising new prompt injection techniques will deliver increasing value to attackers. LockPrompt aims to secure our future by catching new prompt injections the first time they are used in the wild.

LockPrompt has the potential to become an enterprise-grade tool and deliver valuable insights to the research field of defensive AI security. I look forward to commencing work and contributing to society in the process.

LockPrompt - AI-Based Prompt Injection Discovery

LockPrompt is a proposed solution to the burning question: How can LLM vendors defend themselves against never-seen-before prompt injection attacks?

Contents

Problems The Project Will Address

Current Prompt Injection Defense

Case Study - ASCII Smuggling

How Will LockPrompt Solve This?

Methodologies and Approaches

Expected Results

Final Thoughts - The Future

Problems The Project Will Address

Current Prompt Injection Defense

Case Study - ASCII Smuggling

How Will LockPrompt Solve This?

Methodologies and Approaches

1. Gathering Training Data

2. Data Preprocessing

3. Autoencoder Neural Network Construction

4. Training The Autoencoder

5. Anomaly Threshold Calculation

6. Fine Tuning

Expected Results

Final Thoughts - The Future

Discussion about this post