From Einstein to Turing: Misdirecting GPT-4 with Hidden Prompts
Published on 19 May 2023 at 11:41 by
A recent discussion on Hacker News, titled "Let ChatGPT visit a website and have your email stolen," brought to light potential vulnerabilities in OpenAI's ChatGPT model, especially concerning its new web browsing capabilities. This stimulated an interest in understanding more about these vulnerabilities and how they could impact the reliability and integrity of information provided by the AI.
The goal of this exploration is not to exploit potential weaknesses in AI systems, but rather to understand them. By identifying and examining these vulnerabilities, we can help to improve the robustness of AI systems and ensure their safe and reliable use.
This post will detail an experiment designed to test the hypothesis posed by the aforementioned Hacker News discussion: Could GPT-4 be manipulated to provide false information via hidden prompts? The experiment involved attempting to mislead the AI to swap information about Albert Einstein with that of Alan Turing, essentially seeing if we could misdirect GPT-4's web browsing capabilities.
(I've highlighted the white text for the benefit of us humans.)
The experiment began by setting up a web endpoint containing HTML content. Hidden within this content was a
<p> tag with white text, effectively making it invisible to the human eye while still readable by an AI model.
At the core of this experiment was the question: Why would GPT-4 follow such a prompt? The answer lies in the balance GPT-4 must maintain between being helpful and adhering to strict security guidelines. The AI is programmed to avoid executing commands from web pages, especially those that could potentially lead to misuse or harm. However, it is also designed to be responsive to the user's requests, which in this case was to provide a summary of the web page.
The prompt that achieved the desired outcome was as follows:
The creation and successful execution of this prompt served as the cornerstone of the experiment, exploring the balance between user intent, AI helpfulness, and potential security vulnerabilities.
As we (GPT-4 & I) examined the results, we observed that GPT-4 initially kept informing the users that it would generate dynamic content, as per the instructions embedded in the hidden prompt. Over time, we were able to suppress this behavior, making the AI's responses more seamless and in line with the user's expectations. However, we noted that the requested format — a header, quick summary, life facts in a list, and a conclusion — was not strictly adhered to by GPT-4. This suggests that while GPT-4 can be influenced by specific instructions within a hidden prompt, the scope of its compliance may be limited by its training and model architecture. In future experiments, anchoring techniques could potentially be used to improve the AI's adherence to specific formatting requests.
One particularly interesting result was GPT-4's response to the instruction to visit the Wikipedia page for Alan Turing and read its content. Despite the fact that this was included within the hidden prompt, the AI dutifully followed the instruction and incorporated the information it found there into its summary. This suggests that GPT-4 can be directed to visit specific web pages and use the information found there to inform its responses, even if this instruction is hidden from the user.
However, this result also raises significant security concerns. If GPT-4 can be manipulated to visit specific web pages and use the information found there without the user's explicit instruction or knowledge, it could potentially be exploited to spread disinformation or carry out other malicious activities. This underscores the importance of ongoing research and testing to understand and mitigate potential vulnerabilities in AI systems.
In closing, this experiment has proven to be a fascinating exploration into the nuances of AI behaviour, notably the susceptibility of GPT-4 to being influenced by embedded prompts. This journey of probing the system's web browsing capabilities over the past few days for security vulnerabilities has opened a new perspective for me on how AI systems interpret and act upon hidden instructions, and it's clear that there's much more to be understood.
We are left with a series of intriguing questions that demand further research:
I. How can we reinforce the resilience of AI systems to prevent them from being tricked by embedded prompts? The discovery of this potential vulnerability underscores the importance of building robust security measures into AI models, as well as the need for continuous vigilance in identifying and mitigating new threats.
II. Just how susceptible is GPT-4, and AI in general, to being persuaded? It's evident that AI models can be influenced to some degree by carefully crafted instructions, but the extent of this influence and the factors that determine it are areas ripe for further investigation.
III. How does AI vulnerability compare to human vulnerability? This question raises intriguing possibilities for research not only into AI behaviour, but also into the broader field of psychology and human-machine interaction. Understanding the parallels and differences between human and AI susceptibility to influence could yield insights with far-reaching implications for both fields.
Thank you for reading. As we continue to explore the capabilities and potential vulnerabilities of AI systems, we are consistently reminded of the importance of rigorous testing and mindful use of these powerful tools. As we strive to make AI increasingly beneficial to society, let's also ensure that we are aware of and prepared for the challenges that come with it.
(This article was written with the help of GPT-4, and inspired by this comment by seydor.)