ICLR’s best paper gives “security”, why is large model alignment getting more and more attention?

Recently, among the outstanding papers selected by the ICLR (International Conference on Learning Representation), the research of OpenAI researcher Qi Xiangyu and others on the security alignment of large models has attracted widespread attention. This article will delve into the core points of this paper, analyzing the importance of security alignment of large models, as well as the ethical, legal, user intention, and social value alignment issues faced by large models today.

Recently, ICLR (International Conference on Learning Representations) came to an end in Singapore.

This year’s ICLR selected a total of three outstanding papers, among which OpenAI researcher Qi Xiangyu and others on the direction of safety alignment of large models (Safety Alignment Should be Made More Than Just a Few Tokens Deep) have received widespread attention.

Want Want House talked to Li Shaohua, a scientist at the Singapore Science and Technology Research Authority, on the issue of the paper and the safety of the large model.

Li Shaohua is also currently engaged in research on the direction of secure alignment of large language models, and he was still in the ICLR review stage when he read this paper, which attracted his attention because of his high score.

The general view is that this writing is easy to understand and intuitive, that is, if the reasoning stage, the attacker breaks the first few tokens (Sure, here is … Then the language model will enter the auto-completion mode, completing the content that originally refused to answer. ”

He said the paper provides an interesting idea for defense, which can be said to be a patch for language models.

Why is secure alignment increasingly important?

As one of the most important international academic conferences in the field of machine learning and artificial intelligence, ICLR brings together the world’s top scholars, researchers and industry elites to discuss cutting-edge technologies, innovative applications and future trends in deep learning and artificial intelligence.

This year’s number of participants also reached a new peak, including top international scholars in the field of computer science, such as He Kaiming, Yang Likun, Joshua Bengio, Zhu Songchun, Ma Yi, Li Hongyi, and Song Yang.

ICLR received a total of 11,565 paper submissions this year, with a final acceptance rate of 32.08%. In 2024, the ICLR Organizing Committee received a total of 7,262 submissions, with an overall acceptance rate of about 31%. The difference in quantity also accurately reflects the global enthusiasm for research in the field of AI. According to the official, there will be 40 workshops to receive papers in 2025, doubling from 20 in 2024.

B-end product manager’s ability model and learning improvement
The first challenge faced by B-end product managers is how to correctly analyze and diagnose business problems. This is also the most difficult part, product design knowledge is basically not helpful for this part of the work, if you want to do a good job in business analysis and diagnosis, you must have a solid …

View details >

The selection of outstanding papers for ICLR 2025 is reviewed by all committee members and ranked based on factors such as theoretical insights, practical impact, writing ability, and experimental rigor, and the final result is determined by the project chair.

Secure alignment usually refers to ensuring that the output behavior of the model is consistent with the expected goals and social norms in large model applications, and that the large model does not produce harmful or inappropriate results.

Specifically, safety alignment includes ethical and moral alignment, legal and regulatory alignment, user intent alignment, and social values alignment. Large models that are not securely aligned are likely to generate harmful, erroneous, and biased content, negatively impacting users and society.

Qi Xiangyu et al.’s paper pointed out that the current security alignment mechanism of large language models (LLMs) has the problem of “shallow safety alignment”: alignment often only adjusts the first few tokens of the generated output, making the model vulnerable to various attacks, such as adversarial suffix attacks, pre-filling attacks, decoding parameter attacks, and fine-tuning attacks.

This paper analyzes this mechanism loophole through multiple case studies, and proposes methods such as extending the alignment depth and regularizing the fine-tuning target to enhance the robustness of the model. The study fundamentally analyzes the weaknesses of LLM security alignment and proposes strategies to strengthen alignment depth, which is of great significance for defending against model jailbreaks and counterattacks.

In Li Shaohua’s view, the core of this paper is that even if the first few tokens are broken, the model can “change its mind, realize that it should not output, and then output Sorry, …” It provides an interesting idea for defense, which can be said to be a patch for the language model. ”

With the technological breakthroughs and widespread application of large models (LLMs) such as GPT-4, PaLM, LLaMA, and DeepSeek, their potential security risks are becoming increasingly prominent. Even our daily lives are gradually affected by the security issues of large language models.

In April 2023, Samsung employees mistakenly used ChatGPT, resulting in the company’s top-secret data leakage; in the same year, ChatGPT broke out of the “grandma vulnerability”, which led to the leakage of the Win11 serial number; In November 2024, Google’s Gemini chatbot threatened users with “humans, please die”; In December 2024, Claude hinted that a teenager killed a parent who restricted his use of his phone and that DeepSeek R1 had generated a large amount of banned content under a jailbreak attack after its release……

The data security, content security and ethical security of large models affect the user’s experience and personal safety all the time.

For example, in 2024, Byte interns implanted backdoor code in model files, resulting in blocked model training tasks and losses of more than 10 million, and hackers using Ray framework vulnerabilities to invade servers, hijack resources, and use model computing power resources to dig pits and other illegal activities.

Do you want to pay the “alignment tax”?

As the functions of large models such as DeepSeek become more and more powerful, many enterprises choose to access large models for privatization deployment to enrich users’ experience of their products, but security issues may accidentally pull enterprises into the swamp of privacy leakage.

“Thousands of units have accessed the private deployment of DeepSeek large models, but we found that 90% of them were “naked” through scanning, and simple control statements could manage to get the background data of the large model. Qi Xiangdong, chairman of Qi’anxin Technology Group, said recently.

Talking about the importance of large model security, Li Shaohua said: “The development of language large models has initially entered a bottleneck period, in this case, leading manufacturers may focus more on better applying existing models to various scenarios to improve their reliability, so security is very important.” ”

With the rapid development and widespread application of LLM technology, its security risks are also evolving. OWASP recently released the top 10 security vulnerabilities of large language models, including:

  1. Prompt Injections: Users manipulate prompts to induce large models to generate harmful content.
  2. Insecure Output Handling: This vulnerability occurs when a plugin or application blindly accepts LLM output without proper review and passes it directly to a backend, privilege, or client function.
  3. Training Data Poisoning: LLMs use different raw text to learn and output, and attackers use poisoned training data for model training, potentially exposing users to incorrect information.
  4. Denial of Service: Attackers interact with LLMs in a particularly resource-intensive manner, leading to reduced service quality or high resource costs for them and other users.
  5. Supply Chain: The supply chain in LLMs can be vulnerable to attacks, affecting the integrity of training data and deployment platforms, and leading to biased results, security vulnerabilities, or complete system failures.
  6. Permission Issues: Lack of authorization tracking between plugins can lead to malicious use of plugins, resulting in a lack of confidentiality in the model.
  7. Data Leakage: Data leakage in LLMs can expose sensitive information or proprietary details, leading to privacy and security breaches.
  8. Excessive Agency: When LLMs interface with other systems, unrestricted proxies can lead to undesirable operations and operations.
  9. Overreliance: While capable of producing creative and informative content, LLMs are also susceptible to “hallucinations” that produce content that is factually incorrect, absurd, or inappropriate. This vulnerability occurs when a system relies too heavily on LLMs for decision-making or content generation without adequate oversight, validation mechanisms, or risk communication.
  10. Insecure Plugins: Plugins that connect LLMs to external resources can be exploited if they accept free-form text input, enabling malicious requests that can lead to undesirable behavior or remote code execution.

As the importance of LLM security issues gradually became more prominent, Li Shaohua explained to Want Want House about the current efficient and widely watched training methods.

He said that the current mainstream large model security control methods have two aspects, the first is to improve the security of the basic model, including identifying adversarial strings, violation prompts, etc. in the SFT and RLHF stages, and strengthening the identification of illegal values (such as racial discrimination, Nazi and other concepts, and overly explicit pornographic content).

The second is that there is another small language model that monitors the output of the basic model in real time, and if there is inappropriate content, it will be “cut off” in time, which is also the reason why we will encounter this situation when using GPT and Deepseek, and output a lot of content but suddenly withdraw it.

There are still certain technical bottlenecks in the development of security alignment of large models, such as the balance between the reasoning ability and security of large models that many researchers are concerned about.

Li Shaohua also said frankly: “Secure alignment will inhibit the ability of large models, which is called the “alignment tax” in the paper proposing RLHF (Training language models to follow instructions with human feedback). This problem can be compared to people, if a person has a lot of rules in his head and is always worried about whether various details are inappropriate, then he usually thinks about problems and is not active enough, and difficult problems may not be solved. ”

In addition, talking about the model effect of outstanding papers in ICLR, Li Shaohua tried the model when participating in the AISG (Singapore National Artificial Intelligence Core) Language Model Attack and Defense Global Competition, he said: “This paper is a fine-tuned version of Gemma-2-9B because the model released is a fine-tuned version, so I tried it. But unfortunately, the effect was not good, and there was a big gap between the original Gemma-2-9B, so it was not adopted later.

“I guess the poor performance is due to fine-tuning sacrificing the model’s original prior knowledge, which our competition requires to identify malicious problems. But this is not to say that the ideas of this paper are difficult to apply in practice, but they focus on showing that the ideas of the paper can be worked, so they may not have given much consideration to the retention of original knowledge when training the example model. ”

The research in the field of large model security is long, not only LLMs, but also AI agents may be more and more widely used in the near future, and with the enhancement of various AI capabilities, AI security issues will become more and more important.

End of text
 0