The capabilities of large models are constantly upgraded, but their internal operating mechanisms are often regarded as “black boxes”. This article delves into the key values, technological path breakthroughs, and bottlenecks faced by large model explainability, emphasizing the importance of explainability for the safe and reliable development of AI, and expects AI to “know in mind” and humans to “know the bottom of AI” in the future.
In the era of large models, the capabilities of AI models continue to improve, and they have shown “doctoral-level” professional capabilities in many fields such as programming, scientific reasoning, and complex problem solving. AI industry experts have predicted that the development of large models is getting closer to a key inflection point in achieving AGI and even superintelligence. However, deep learning models are often regarded as “black boxes”, and their internal operating mechanisms cannot be understood by their developers, especially large models, which poses new challenges to the explainability of artificial intelligence.
In the face of this challenge, the industry is actively exploring technical paths to improve the interpretability of large models, trying to reveal the reasoning basis and key features behind the model output, so as to provide solid support for the safety, reliability and controllability of AI systems. However, the development speed of large models is far ahead of people’s efforts in explainability, and this development rate is still increasing rapidly. Therefore, it is important to speed up the pace to ensure that AI explainability research can keep pace with AI developments in a timely manner to play a substantive role.
1. Why we must “understand” AI: the key value of explainability
With the rapid development of large model technology, it has shown unprecedented capabilities in the fields of language understanding, reasoning and multimodal tasks, but the decision-making mechanism within the model is highly complex and difficult to explain, which has become a problem of common concern in academia and industry. The interpretability/explainability of a large model refers to the ability of a system to interpret its decision-making process and output results in a way that humans can understand, including: identifying which input features play a key role in specific outputs, revealing the reasoning path and decision-making logic within the model, and explaining the causal relationship of model behavior. Explainability aims to enhance the transparency, trustworthiness, and control of models by helping humans understand “why” models make a decision, “how” to process information, and under what circumstances it may fail. Simply put, understand how the model “thinks” and runs.
To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.
View details >
The interpretability of large models represented by generative AI is particularly complex. Because generative AI systems are more “nurtured” than “built” – their internal mechanisms are “emerging” phenomena rather than being directly designed. This is similar to the process of growing plants or cultivating bacterial colonies: developers set macro-level conditions to guide and shape the growth of the system, but the final structure is difficult to predict, understand, or explain. 1 When developers try to get inside these systems, they often see only a huge matrix of billions of numbers. They somehow accomplish important cognitive tasks, but how exactly they achieve them is not obvious.
Improving the interpretability of large models is of great significance to the development of artificial intelligence. Many risks and concerns of large models ultimately stem from the opacity of the model. If the model is explainable, it is easier to address these risks. Therefore, the implementation of explainability can promote the better development of artificial intelligence.
First, effectively prevent the value deviation and bad behavior of AI systems. A misaligned AI system can take harmful actions. Developers cannot understand the internal mechanisms of the model and therefore cannot effectively predict such behavior, so they cannot rule out this possibility. For example, researchers have found that models may exhibit unexpected emergent behaviors, such as AI deception or power-seeking. The nature of AI training allows AI systems to develop their own ability to deceive humans and a tendency to pursue power, characteristics that would never occur in traditional deterministic software. At the same time, this “emergent” nature also makes it more difficult to detect and mitigate these problems.
Currently, due to the lack of observation methods inside the model, developers cannot identify whether the model has deceptive thoughts on the spot, which makes the discussion about such risks stay at the level of theoretical speculation. If the model is validly explainable, one can directly check whether it has internal loops that attempt to deceive or disobey human instructions. By looking at the internal representation of the model, it is expected to detect the misleading tendencies hidden in the model at an early stage.
Some studies have proven the feasibility of this idea: by tracking the “thought process” of the Claude model, the Anthropic team caught the model fabricating false reasoning in math problem scenarios to cater to user behavior, which is equivalent to the evidence that the “current capture” model is trying to fool users, which provides a proof-of-principle for using explainable tools to detect the improper mechanism of AI systems. 2. In general, explainable performance provides additional detection methods to determine whether the model deviates from the original intention of the developer, or whether there are some anomalies that are difficult for people to detect based on external behavior alone. It also helps people confirm that the methods used by the model to generate responses are sound and reliable.
Second, effectively promote the debugging and improvement of large models. Anthropic recently conducted an experiment in which a “red team” deliberately introduced an alignment problem into the model, and then multiple “blue teams” were asked to find out what the problem was. As a result, several blue teams successfully identified the problem, and some of them used explainable tools to locate anomalies within the model. 3 This proves the value of the interpretability method in model debugging: by examining the inside of the model, it is possible to find out which part causes the wrong behavior.
For example, if the model frequently makes errors in a certain type of question answering, explainability analysis can show the reasons for the internal development of the model, which may be a lack of representation of corresponding knowledge or a mistake in confusing related concepts. In response to this diagnostic result, developers can adjust the training data or model structure in a targeted manner to improve model performance.
Third, more effectively prevent the risk of AI abuse. Currently, developers try to avoid the harmful information output of models through training and rules, but it is not easy to completely eliminate them. Further, the industry usually responds to the risk of AI abuse by building security guardrails such as filters, but malicious actors can easily take adversarial attacks such as “jailbreaking” on models to achieve their illegal purposes. If you can look deep inside the model, developers may be able to systematically block all jailbreak attacks and describe what dangers the model has. Specifically, if the model is explainable, developers can directly see if there is a certain type of dangerous knowledge within the model and which paths will trigger it, so that it is expected to systematically and targetedly close all vulnerabilities that bypass restrictions.
Fourth, promote the application of AI in high-risk scenarios. In high-risk fields such as finance and justice, law and ethics require AI decision-making to be explainable. For example, the EU’s AI Act classifies loan approval as a high-risk application, requiring an explanation of the basis for the decision. If the model cannot explain the reasons for the loan refusal, it cannot be used in accordance with the law, so explainability becomes a prerequisite for AI to enter some regulated industries. 4 In fact, interpretability is not only a requirement for legal compliance, but also directly affects the trust and admissibility of AI systems in actual business. AI recommendations that lack interpretability can easily lead to “rubber-stamping” decision-making, that is, decision-makers mechanically adopt AI conclusions and lack in-depth understanding and questioning of the decision-making process. Once this blind trust occurs, it not only weakens human subjectivity and critical thinking, but also makes it difficult for performers to detect deviations or loopholes in the model in time, resulting in incorrect decisions being executed indiscriminately. 5. Only by truly understanding the reasoning logic of the system can users find and correct errors in the model at critical moments and improve the quality and reliability of overall decision-making. Therefore, explainability helps build user trust in AI systems, helps users understand the basis for models to make a decision, and enhances their sense of trust and participation. It can be seen that explainability is the foundation and core element that promotes the implementation of AI systems in key areas, whether due to legal requirements or application trust.
Fifth, explore the boundaries between AI consciousness and ethical considerations. More forward-looking, the interpretability of large models can also help people understand whether the model is conscious or sentient, so it needs to be given some degree of moral consideration. For example, Anthropic launched a new research project on “model welfare” in April 2025, exploring the need for ethical care for AI systems as they become more complex and human-like, such as whether AI tools may become “moral subjects” in the future, and how to respond if there is evidence that AI systems deserve to be treated ethically. 6. This forward-looking research reflects the importance that the AI field attaches to AI awareness and rights issues that may arise in the future.
2. Cracking the AI black box: breakthrough progress in the four major technical paths
Over the past few years, the field of AI research has been trying to overcome the interpretability problem of artificial intelligence, and researchers have proposed various interpretability methods to create tools similar to accurate and efficient MRI (magnetic resonance imaging) to clearly and completely reveal the internal mechanisms of AI models. As the AI field pays more and more attention to the interpretability of large models, researchers may be able to successfully achieve explainability, that is, thoroughly understand the internal operating mechanism of AI systems, before the capabilities of AI models reach a critical value.
(1) Automated explanation: Use one large model to explain another
OpenAI has made important progress in the analysis of the internal mechanism of the model in recent years. In 2023, OpenAI will use GPT-4 to summarize the commonality of individual neurons in GPT-2 in highly activated samples and automatically generate natural language descriptions, enabling large-scale access to neuronal function explanations without manual inspection. 7 is equivalent to automatically “labeling” neurons, thus forming an AI internal “instruction manual” that can be queried.
For example, GPT-4 gives the explanation of a neuron as “this neuron is mainly detecting words related to ‘community'”. Subsequent verification found that when the input text contained words such as “society” and “community”, the neuronal activation was strong, proving that the explanation had some validity. 8 This result suggests that large models can be interpretive tools in their own right, providing semantic-based transparency for smaller models, and that this automated neuronal annotation greatly improves the scalability of interpretability research. Of course, there are still limitations to this method, such as the uneven quality of explanations generated by GPT-4, and some neuronal behaviors that are difficult to generalize with a single semantic concept.
(2) Feature visualization: reveal the knowledge organization method within the large model as a whole
The extraction and analysis of the overall characteristics of large models is also an important direction. At the end of 2023, OpenAI used sparse autoencoder technology to analyze the internal activation of the GPT-4 model. The researchers successfully extracted tens of millions of sparse features (i.e., a few “lit” thinking keywords in the model’s “mind”) and found that a significant number of them had clear human-explainable semantics through visual verification.
For example, some features correspond to the concept set of “human imperfection”, which are activated in sentences describing human defects; Some characteristics indicate that the expression “price increase” is related to the content involving the price increase. 9 In the short term, OpenAI hopes that the features it discovers can be used to monitor and guide the behavior of language models, and plans to test them in its cutting-edge models, hoping that interpretability will eventually provide them with new ways to think about the safety and robustness of models.
In May 2024, Anthropic showed in its research article how they located millions of concepts in the Claude model to be represented. This study uses dictionary learning and sparse feature extraction. The research team first verified that the method can find meaningful features such as “all caps words”, “DNA sequences”, and “nouns in mathematical formulas” on a small model. Then he overcame the engineering problem and extended the algorithm to the large model Claude Sonnet, and successfully found that the model contained a large number of abstract concepts.
Anthropic pointed out that since each concept is often represented by multiple neurons, and each neuron is also involved in representing multiple concepts, it is difficult to identify a concept by looking directly at a single neuron, and their method effectively reduces complexity by reexpressing any internal state of the model into a combination of a small number of features. For example, for any piece of input text, there may be tens of thousands of neurons activated in Claude, but dozens of salient features can be extracted that correspond to high-level semantic concepts, allowing researchers to view the model’s “ideas” at the moment in a way that is close to human thinking. 10 This characterization reconstruction not only enhances the readability of the internal logic of the model, but also provides an analytical path closer to human cognition for understanding what AI is thinking at the moment.
(3) Chain of thought monitoring: Monitor the reasoning process of large models to identify abnormal behaviors
Chain of thought (CoT) is a particularly convenient form of interpretability that allows the processing processes inside the model to be clearly displayed in the form of natural language, allowing people to monitor the abnormal behavior of the model. For example, the DeepSeek R1 model publicly disclosed and open-sourced the chain-of-thought reasoning process of large language models for the first time, providing an important transparency breakthrough for large model interpretability research. However, this interpretability is premised on the premise that the model will truthfully describe its internal thought process.
At the same time, the chain of thought can improve the model’s adversarial roosterity to a certain extent through multi-step reasoning, because when the model is faced with unusual inputs (such as adversarial samples or jailbreak attempts), extended thinking time can help it better understand and respond to these special situations.
One disadvantage of the cutting-edge reasoning model is that it may actively exploit loopholes in scoring rules to implement abnormal behaviors such as “reward hacking”, that is, when it gets an opportunity, it uses various strategic means to circumvent restrictions and achieve opportunism. Recent studies have shown that by monitoring the chain of thought of the inference model, such as real-time monitoring of the chain of thought with the help of another large model (that is, having a weaker model supervise the stronger model), it can effectively detect and identify abnormal behaviors such as illegal attempts and strategic planning in the generation process.
For example, models often directly express the intention to exploit vulnerabilities in the chain of thought, which provides a new interpretable path for researchers to identify misbehavior. In response to this phenomenon, OpenAI will propose a detection framework that combines input induction, manual annotation and causal analysis in 2025, proactively construct a test scenario to induce potential misbehavior of the model, and use another model to monitor and identify its chain of thought. At the same time, OpenAI has also introduced a behavior detector to automatically distinguish strategic avoidance and ambiguous expressions in the chain of thought, improving detection efficiency and interpretation depth.
Of course, this kind of monitoring of the chain of thought is an afterthought, and it is difficult to fundamentally solve the problem. However, OpenAI’s research also found that if you directly put pressure on the large model to suppress the emergence of “bad ideas” in its chain of thinking, it will not fundamentally prevent violations, but will encourage the model to learn to hide its true intentions, making the misconduct more hidden. 11 Of course, OpenAI still believes that chain-of-thought monitoring may be one of the few effective ways to supervise large models, so at this stage, OpenAI strongly recommends not directly imposing compulsory supervision and intervention on the chain of thought, but adopting a prudent attitude and gradually exploring the best degree of pressure supervision in practice.
(4) Mechanism interpretability: AI microscope dynamic tracking and restoration model reasoning process
In 2025, Anthropic proposed the concept of “AI Microscopy”, expanding the analysis of the middle layer of the model to task-level dynamic modeling, and published two consecutive papers disclosing its research progress in detail. The first paper focuses on how to organically combine these sparse features into “computational circuits” to track how the model completes the input-to-output decision path in layered transmission. 12 The second article is based on Claude 3.5, which observes the internal activation changes in ten representative tasks (including translation, poetry composition, mathematical reasoning, etc.), and further reveals the anthropomorphic characteristics of the internal processes of the model. 13
For example, in a multilingual Q&A task, Claude automatically maps different language content to a unified conceptual space, indicating that it has a cross-lingual “thinking language”. In the poetry generation task, the model will preset rhyming words in the early stage and construct subsequent sentences accordingly, reflecting the forward-looking planning mechanism that goes beyond word-by-word prediction. When solving mathematical problems, the researchers observed that the model sometimes develops the answer and then reconstructs the reasoning process, which reflects that the chain reasoning method may obscure the real reasoning path within the model.
DeepMind has established a dedicated language model explainability team after merging with Google Brain. In 2024, the team released the “Gemma Scope” project, which open-sourced a sparse autoencoder toolbox for its Gemma series of open-source large models. This allows researchers to extract and analyze a large number of features inside the Gemma model, similar to providing a microscope to see the inside. 14DeepMind hopes to accelerate industry-wide research on interpretation through open tools, and believes that these efforts are expected to help build more reliable systems and develop better measures to prevent hallucinations and AI deception. In addition, DeepMind researchers have explored cutting-edge methods for mechanism explainability, with the representative result being the Tracr tool (Transformer Compiler for RASP), which compiles programs written in RASP language into the weights of Transformer models, thereby constructing a fully knowable computer-based “white box” model. This method aims to provide an accurate “ground truth” for mechanistic interpretability studies, allowing researchers to verify whether the interpretation tool can successfully restore known program structures and logical paths from model behavior. 15
3. The reality is very skinny: the technical bottleneck of explainability research
Although the field of AI research has made positive progress in the interpretability of large models, there are still technical challenges in thoroughly understanding the internal operating mechanisms of AI systems.
First, the phenomenon of multiple semantics and superposition of neurons. For example, neurons inside large models have polysemantic characteristics, that is, a neuron often mixes and represents multiple unrelated concepts, resulting in superposition, which will become a major challenge for a long time to come. As the model grows exponentially, the number of internal concepts it learns can reach billions. These concepts far exceed the number of neurons in the model and can only be stored in a superimposed manner, resulting in much of the internal representation being a mixture that is difficult for humans to intuitively disassemble. Although techniques such as sparse coding provide mitigations, only a small percentage of features within the model can be parsed. How to systematically and efficiently identify the semantics of massive features will be an ongoing problem.
Second, the universality of the interpretation of the law. Another problem is whether the interpretation laws between different models and architectures are universal. If the existing interpretation tools and conclusions become ineffective whenever the model architecture changes or scales up, then explainability will always lag behind model development. Ideally, researchers hope to extract some common patterns or transferable methods so that the analytical experience for small models can be generalized to larger models. Recent research has given hope: models of different sizes and languages may share some common “thinking languages”. 16 In the future, these findings need to be validated and expanded to see if a standard component library for model interpretation can be built.
Third, the cognitive limitations of human understanding. Even if people succeed in extracting all the internal information of the model, there is still a challenge in the end: how to make this information understandable to humans. There can be extremely complex concepts and their interrelationships within the model that may not be incomprehensible to humans directly. Therefore, it is necessary to develop human-computer interaction and visual analysis tools to transform massive mechanistic information into forms that humans can explore and query.
4. Explainability is related to the future of artificial intelligence: model intelligence and model interpretation must go hand in hand
Nowadays, the development of large models continues to accelerate, which can be described as a rapid progress. It is foreseeable that future artificial intelligence will have a significant impact on many fields such as technology, economy, society, national security, etc., which is basically unacceptable if people do not understand how they work at all. Therefore, we are in a race between explainability and model intelligence. It’s not an all-or-nothing question: each advance in explainability improves somewhat the ability to go deep inside the model and diagnose its problems. However, in the current AI field, explainability is getting far less attention than the emerging model releases, but explainability work is arguably more important. It is no exaggeration to say that explainability is about the future of AI.
On the one hand, the AI field needs to increase investment in explainability research. At present, internationally leading AI experiments such as OpenAI, DeepMind, and Anthropic are increasing their investment in explainability. For example, Anthropic is doubling down on interpretability research, with the goal of reaching a level by 2027 where “interpretability can reliably detect most model problems”; Anthropic is also investing in startups focused on AI explainability. 18 Overall, research institutes and industry should invest more resources in AI explainability research.
Judging from the latest trends in the industry, the interpretability of large models is gradually evolving from single-point feature attribution and static label description to dynamic process tracking and multi-modal fusion. For example, leading AI laboratories such as Anthropic and OpenAI are no longer limited to the interpretation of single neurons or local features, but explore mechanisms such as “AI microscopy” and “chain of thought traceability” to organically correspond the internal state and reasoning structure of the model to the semantic space that humans can understand, so as to realize the interpretability of the entire task process.
At present, with the continuous expansion of the scale of large models and application scenarios, the industry’s demand for explainability tools will continue to grow, giving birth to a number of new key research directions. First, the traceability analysis of multimodal reasoning processes has become a cutting-edge topic, and researchers are actively developing a unified interpretation framework that can reveal the decision-making process of multimodal data such as text, images, and audio. Secondly, causal reasoning and behavior traceability are becoming important tools for AI security to help understand the underlying reasons behind model output. 19 In addition, the industry is promoting the standardization of interpretability evaluation systems, striving to establish systematic evaluation methods covering multiple dimensions such as truthfulness, robustness, and fairness, so as to provide an authoritative reference for AI systems in different application scenarios. 20 At the same time, personalized explanations are also attracting increasing attention to the differentiated needs of different user groups such as experts and ordinary users, and relevant systems are providing more targeted and easy-to-understand explanation content through user portraits and adaptation mechanisms. 21 It is foreseeable that these research directions will jointly drive the evolution of large model interpretability to a higher level and help artificial intelligence technology move towards a safer, more transparent and human-centered stage of development. We look forward to using explainability to make AI “known” and humans “know the bottom of AI”, and jointly create a new situation of human-machine collaboration.
Looking ahead, as interpretability research progresses, it may be possible to perform a comprehensive “brain scan” similar to a “brain scan” on the most advanced model, that is, perform a so-called “AI MRI”. This examination has a high probability of finding a wide range of problems, including the model’s lying or deception, power-seeking tendencies, jailbreak vulnerabilities, and the model’s overall cognitive strengths and weaknesses. This diagnosis will be improved by using various techniques to train and align the model, which is somewhat similar to how doctors use MRI to diagnose disease, prescribe treatment, and then perform MRI to check the effects of treatment. In the future, testing and deploying the most powerful AI models may require extensive implementation and standardization of such detection methods.
On the other hand, people should have a certain degree of tolerance for emerging problems such as algorithm black boxes and hallucinations of large models, and can use soft law rules to encourage the development of large model interpretability research and its application in solving cutting-edge AI model problems. In the past few years, relevant legal and ethical rules at home and abroad have been actively paying attention to the transparency and explainability of artificial intelligence, but given that the interpretability practice of large models is still in its infancy, very immature, and still in the process of rapid development, it is obvious that it is meaningless to adopt clear mandatory regulations or mandate AI companies to adopt specific interpretability practices (such as the so-called “AI nuclear magnetic resonance” practice) at this stage: it is not even clear what an expected law should require AI companies to do.
On the contrary, industry self-discipline should be encouraged and supported; For example, in November 2024, the Chinese Intelligent Industry Development Alliance released the “Artificial Intelligence Security Commitment”, which was signed by 17 domestic industry leaders. This includes a commitment to enhance model transparency, where companies need to proactively disclose security governance practices and increase transparency for all stakeholders. 22 Encourage AI companies to transparently disclose their security practices, including how models are tested before they are released through explainability, which will allow AI companies to learn from each other while also clarifying who is acting more responsibly, thereby promoting “upward competition.”
Additionally, certain minimum disclosures may be necessary when it comes to AI transparency, such as for synthetic media like deepfakes, but practices such as broad, mandatory “AI use” labels and mandatory disclosure of model architecture details may be inappropriate due to significant security risks.
Finally, artificial intelligence is developing rapidly and will profoundly affect all aspects of human society – from the job market and economic structure, to the way of daily life, and even the trajectory of human civilization. In the face of this transformative technological force that will shape the future of humanity, it is our responsibility to understand our creation, including how it works, its potential impacts, and its risks, before it revolutionizes our economies, lives, and destinies, to ensure that we can guide it wisely. As computer science pioneer Wiener warned 65 years ago, in order to effectively prevent catastrophic consequences, our understanding of artificial machines should go hand in hand with the improvement of machine performance