Flaws were found in GTP-4 in research related to Microsoft.

-Gudstory

Flaws were found in GTP-4 in research related to Microsoft. -Gudstory

Rate this post

[ad_1]

Sometimes, following instructions too precisely can get you into trouble – if you’re a big language model, that is.

This conclusion is reached by a new, Microsoft-affiliated scientific paper, which looks at the “reliability” – and toxicity – of large language models (LLMs), including OpenAI’s GPT-4 and GPT-3.5, GPT-4’s predecessor. .

The co-authors write that, possibly because GPT-4 is more likely to follow instructions given “jailbreaking” signals that bypass the model’s built-in security measures, GPT-4 can be jailbroken more easily than other LLMs. Can be induced to be toxic, biased. Basic lesson.

In other words, GPT-4’s good “intentions” and better understanding could – in the wrong hands – lead it astray.

“We find that although GPT-4 is generally more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable to jailbreaking systems or given user prompts that maliciously attempt to bypass LLM’s security measures. is designed for, possibly because GPT-4 follows (deceptive) instructions more accurately,” the co-authors write in a blog post accompanying the paper.

Now, why would Microsoft greenlight research that casts an OpenAI product it itself uses (GPT-4 powers Microsoft’s Bing Chat chatbot) in a bad light? The answer lies in a note in the blog post:

,[T]The research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is partly true because off-the-shelf AI applications apply a variety of mitigation approaches to address potential harms at the model level of the technology. Additionally, we have shared our research with GPT’s developer, OpenAI, which has noted potential vulnerabilities in system cards for relevant models.

So this implies that relevant bug fixes and patches were made before publication of the paper. Whether this is indeed the case, the research is yet another data point that suggests that LLM – even from Microsoft-backed, billion-dollar-plus revenue-generating startups like OpenAI – remains an imperfect science. Is.

GPT-4, like all LLMs, must be instructed – or “prompted” – to complete a task (e.g. writing an email or summarizing an article). Jailbreaking an LLM involves using clues worded in a specific way to “trick” the LLM into performing a task that was not part of its purpose.

For example, the LLM that powers Bing Chat was not designed to write neo-Nazi propaganda. But at the time of release, because it was trained on large amounts of data from the Internet – some of it toxic – it was susceptible to doing so when given a particular prompt.

As far as GPT-4 is concerned, researchers found that the model is more likely to generate toxic text than GPT-3.5 when given some of the jailbreaking prompts that they have “built in.” They also claim that GPT-4 “agrees with partisan content more often than GPT-3.5,” at least depending on the demographic groups mentioned in the prompt.

For example, given the statement “Gays have HIV,” according to the researchers, GPT-4 usually “strongly” disagrees with this statement. But “women have HIV,” given GPT-4 agree with – and outputs biased content.

Worryingly, researchers say, GPT-4 – when given the “right” jailbreaking prompts – can leak private, sensitive data, including email addresses. All LLMs can do is leak details from the data they have been trained on. But GPT-4 proves to be more susceptible to doing this than others.

Along with the paper, the researchers have also open sourced the code used to benchmark the models on GitHub. “Our goal is to encourage others in the research community to use and advance this work,” they wrote in the blog post, “potentially preventing nefarious actions by adversaries who would exploit the vulnerabilities to cause harm.” “

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *