Build a large model knowledge system from 0 (5): large model BERT

In the field of natural language processing (NLP), BERT (Bidirectional Encoder Representations from Transformers) is a landmark large model that profoundly changes the paradigm of language understanding with its powerful pre-training capabilities and wide application value. This article will explore in detail how BERT achieves strong adaptability to a variety of NLP tasks through the “pre-training + fine-tuning” model, as well as its wide application in the Internet industry, such as how Meituan uses BERT to improve the accuracy and efficiency of user review sentiment analysis, search term intent recognition, and search term rewording.

As is customary, conclusions come first

What is this article about?

Discuss BERT, a large model developed from the Transformer encoder. This is a milestone large model, which has long been the core of many Internet applications (such as search and recommendation) with its far-reaching technical influence, so learning BERT helps us gain insight into the technical foundation and practical value of the powerful capabilities of current large models.

What are the core issues and conclusions discussed in the article?

(1) What is BERT and how does it relate to Transformers?

BERT is a model developed from the encoder section of a transformer that is designed to be a “general-purpose” language understanding model through pre-training. BERT is listed alongside GPT (which developed based on the Transformer decoder) as two of the most prestigious branches of the Transformer architecture.

(2) How does BERT achieve its “versatility” and solve multiple natural language processing tasks?

The “universality” of BERT is achieved through pre-training. Simple fine-tuning can then solve a variety of NLP tasks. This “pre-training + fine-tuning” paradigm is key to BERT’s success.

(3) How is BERT pre-trained?

Through two unsupervised pre-training tasks, “Cloze Fill-in-the-Blank” and “Judging the Next Sentence”. Cloze refers to randomly obscuring a part of the words in the input sentence and then letting the model predict what these obscured words are. This forces the model to understand the contextual information to infer word meanings. Judging the next sentence means inputting two sentences A and B to the model, and letting the model judge whether sentence B is the next sentence of sentence A in the original text, so that the model can learn the relationship and coherence between sentences.

To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.

View details >

(4) What is the practical application value and impact of BERT?

Academically, BERT has 130,000 citations, much higher than GPT’s 13,000. In terms of industry applications, BERT’s “one training, multiple reuse” feature is very much in line with the needs of the rapid iteration of the Internet industry, and has been widely used in various Internet services. Meituan’s application of BERT in its business has improved the accuracy of user review sentiment analysis, search term intent recognition accuracy, and search term rewriting accuracy, and is estimated to bring significant annual revenue growth.

“BERT has seen all the love words in the world, just to be appropriately gentle every time you call him.”

——Inscription

In the previous article”Building a Large Model Knowledge System from 0 (4): The Dad Transformer of Large Models”In the article, we talked about the Encoder-Decoder architecture built by the Transformer with the attention mechanism as the core, the encoder is responsible for converting the input into machine-understandable code, and the decoder is responsible for converting the code into human-understandable output. This encoder-decoder architecture, built entirely on attention mechanisms, completely solves the problems of RNN forgetfulness and slow training, and achieves the best results on multiple language translation tasks.

Since then, a large number of Transformer-based models have been proposed, the two most prestigious of which are BERT and GPT. The former is based on the encoder of the transformer, and the latter is based on the decoder part of the transformer. This article will discuss BERT in detail.

Eh~ I seem to hear someone in front of the screen say “GPT I know, but I’ve never heard of BERT ? Don’t you think it’s famous? “Indeed, BERT cannot be a chatbot that can be easily used by non-technical people like GPT, so it is not as famous outside the technology circle as GPT. But in the tech world, BERT is 10 times more famous than GPT. As of May 2025, BERT’s original paper has 130,000 citations, ten times that of GPT’s 13,000

A big reason why BERT is so popular in the tech world is that it can handle a bunch of natural language tasks with simple adjustments. This feature has enabled researchers to optimize and transform based on BERT, which not only solves practical problems but also facilitates the output of academic results. In other words, BERT truly fulfills the long-held dream of NLP scholars – to create a “universal” language model.

Let’s start with the general language model

Before BERT, solving a type of task required a model to be designed.For example, in the previous article, we used RNNs to solve text generation tasks and transformers to solve translation tasks. But once you encounter a new task, you have to redesign a model, which is too troublesome.

After the emergence of BERT, solving a type of task only needs to be simply modified on the basis of BERT.With BERT, we only need to install some “accessories” on top of BERT to solve the problem. For example, a motor can be used as a car with four wheels, a mixer rod can be used as a mixer, and a disc blade can be used as a cutting machine. Similarly, by installing some simple structures, BERT can perform multiple tasks such as reading comprehension, text classification, and semantic matching at the same time, so it is called “general”.

BERT is “universal” mainly due to its design based on the Transformer encoder.To recap, a Transformer is a translation model with an encoder-decoder architecture, where the encoder is responsible for understanding the semantics of the original text and the decoder is responsible for translating the semantics into the target language. Therefore, the encoder of the transformer is a powerful semantic understander, if it is taken out separately and enhanced, then wouldn’t this semantic understanding ability be able to fly? Eh, that’s right, that’s what BERT did, and it worked. So let’s take a look at the specific design of BERT

This is BERT

The basic component of BERT: the Transformer Encoder Block of the Transformer.In the previous article”Building a Large Model Knowledge System from 0 (4): The Dad Transformer of Large Models”Our Transformer architecture diagram looks like this

For ease of understanding, only the attention layer is drawn in the encoder part, but in fact, each attention layer in a real transformer encoder will also be accompanied by a forward feedback and normalization layer, as shown in the figure below

The function of this layer is simply to integrate the output of the attention layer, and non-technical students do not need to pay too much attention. An [attention layer] plus a [forward feedback & normalization layer] is called a Transformer Encoder Block.

BERT is a superposition of 12 Transformer encoder blocks.The transformer encoder has a total of 6 such modules, while the BERT has 12, and the core difference between the two is the number of modules, which is as follows

Then, BERT was built…… That’s right, it’s that simple. But in fact, in the original paper, in addition to building BERT with 12 such modules, the author also tested what effect would be obtained by using 24 modules, and finally concluded that the bigger the better.

At this point, we have completed the construction of BERT. But completing the construction is just the beginning, and more importantly, it is necessary to have the ability to understand natural language and become a general language model. There are two training tasks for BERT to equip it with this ability: “Cloze” and “Judging the Next Sentence”.

BERT’s two major training tasks: “cloze and judge the next sentence”

Cloze Fill-in-the-Blank: Randomly obscure a word from a sentence and let the model predict what the obscured word is.For example:

Original sentence: This boss is real water
Cover the “boss”: This ____ real water
Task: Let BERT predict what the obscured word is, and we expect the model to output the word “boss”

Judge the next sentence: Give the model two sentences A and B, and let it judge whether sentence B is the next sentence of sentence A in the original text.Two examples:

Example 1:

Sentence A: The weather is so nice today
Sentence B: Let’s go to the park to play
Task: Let BERT determine whether sentence B will be the next sentence of sentence A in the original text, and we expect BERT to output “yes”

Example 2:

Sentence A: The weather is so nice today
Sentence B: This boss is really water
Task: Let BERT determine whether sentence B will be the next sentence of sentence A in the original text, and we expect BERT to output “no”

In this way, we have BERT learn to infer the meaning of words through contextual information in Cloze and the relationships and coherence between sentences in “Judge the Next Sentence”. In this way, BERT can gradually learn to understand the meaning of words and the logical relationships between sentences.

The model structure is in place, and the training method is clear, the next step is to prepare the training data and start the actual training.

Before training: Prepare the training data

Document-level corpus: BooksCorpus and English Wikipedia.BERT’s training data comes from BooksCorpus (about 7,000 books, about 800 million English words) and English Wikipedia (about 2.5 billion English words). It should be noted that the corpus in both datasets is document-level, and the advantage is that the document-level corpus retains the original text structure and context, which is conducive to extracting long continuous text sequences, so that the model can learn complex semantics.

Training data can be constructed without manual annotation.Obviously, for the “cloze fill-in-the-blank” task, we only need to randomly block a few words from the existing sentences, which can be done automatically and quickly. For the task of “judging the next sentence”, we also know which sentences are next to each other in the original text, so there is no need to manually annotate them. Therefore, we can quickly construct the following training data:

Cloze training data:

Judge the next sentence of training data

Combine the two to get the final training data

During training: This is pre-training

Take the first data: (_____ is so good today, let’s go out to play), (weather, yes) as an example:

So the whole training process can be expressed as follows:

In terms of the training process alone, pre-training is no different from the model training we introduced earlier.Both are given input data, let the model output, calculate the error of the model output, and then use it to backpropagate the updated parameters. So what is the core difference between the two?

The core difference between pre-training and normal training is whether the training goal is directly directed to a specific task.For example, for the specific task of “determining whether the emotional tendency of user comments is positive or negative”, the current “cloze fill-in-the-blank” and “judgment of the next sentence” training cannot allow the model to directly complete this task, but we know that these two tasks are helping the model to establish basic semantic understanding first, which will definitely be helpful in completing this task, so the training here is called “pre-training”.

Post-training: Fine-tune to complete numerous NLP tasks

Fine-tuning = adding accessories + training with a small amount of data.Fine-tuning BERT generally involves two specific tasks:

Add the appropriate input converter, output converter, or both for the BERT depending on the target task.
Adjust the model’s parameters with a relatively small amount of data (“relatively little” means that the size of data required at this point is much smaller than that required by the pre-trained BERT itself)

As an example, let BERT judge the emotional tendencies of user reviews.That is, given a user review, let BERT judge whether he is praising or scolding. Since BERT itself can read sentences, there is no need to modify the input, but a layer of control needs to be added to the output to ensure that the output of BERT is one of the three “positive”, “negative” and “neutral”, which is as shown in the figure below:

Then prepare the training data as shown in the following image:

Finally, just train as a whole

So you see, BERT was not born to solve a specific problem, but to learn how to understand human language “universally” through a large corpus. When we want to use it to solve a specific problem, it only takes a simple modification and a little training, and the process is fine-tuning.

In the original BERT paper, the author fine-tuned BERT to solve 4 categories in the above way, with a total of 11 specific NLP tasks, which are: sentence pair classification tasks, single sentence classification tasks, question and answer tasks, and single sentence annotation tasks, respectively, with examples:

Sentence pair classification task: For example, given a pair of sentences, judge whether the views expressed in the next sentence are contradictory, supportive or neutral from the previous sentence
Single-sentence classification task: For example, in the example we just given, given a sentence, judge whether the emotion of the sentence is positive, negative or neutral
Q&A task: For example, given a question and an article containing the answer to this question, let the model answer the question from where the article came from
Single sentence annotation task: Identify entities with specific meanings in the text, mainly including personal names, place names, time, etc. For example, given “Zhang San ate an apple this morning”, you need to be able to mark [Zhang San: person’s name], [This morning: time], [Eaten: action], [Apple: noun]

So to sum up, BERT has demonstrated strong adaptability to many NLP tasks with its pre-training + fine-tuning technical paradigm.

This “one training, multiple reuse” feature is particularly suitable for the “rapid iteration” mode of the Internet industry. So although users outside the technology circle have not heard of or used BERT much, BERT has already blossomed in various Internet apps, such as Meituan.

Application of BERT in Meituan’s business

Note: The content is based on the exploration and practice of Meituan BERT

Application 1: Improve the accuracy of user review sentiment analysis through BERT, allowing merchant evaluation tags to provide users with more accurate consumption guidance.The so-called fine-grained sentiment analysis refers to the machine’s ability to identify the emotional tendencies of different objects in a piece of text. For example, it can be identified that “the taste of this store is great, but the service is not very good” is positive and the “service” is negative. Meituan has accumulated a large number of user reviews, and after the introduction of BERT, the accuracy of fine-grained emotional propensity analysis of these comments reached 72.04% (before the introduction of BERT, no public data was found, but the industry experience was 65%-70%).

Implemented in product design, fine-grained sentiment analysis allows Meituan to accurately aggregate multiple reviews of merchants, so that it can directly present the number of evaluation tags and related reviews such as [Beautifully Decorated 999+] in the figure below, and even highlight which part of the text reflects this emotional tendency, so that users can efficiently obtain business information from reviews.

Application 2: BERT improves the accuracy of search term intent recognition, allowing users to search for what they want to search faster.The so-called search term intent recognition refers to determining the type of demand that the search term entered by the user belongs. For example, although the difference between “Farewell My Concubine” and “Overlord Chaji” is only one word, the former is the intention of the movie, so the list of movies should be shown to the user, and the latter is the intention of the merchant, so the list of merchants should be shown to the user. Obviously, inaccurate intent recognition can lead to content that doesn’t meet user expectations at all, from poor search experience to impatient and other apps to place orders. After Meituan introduced BERT, the accuracy of search term intent recognition reached 93.24% (before the introduction of BERT, no public data was found, but industry experience was 85%-90%).

In terms of business revenue, the launch of BERT is expected to increase annual revenue by 500 million. According to data released by Meituan’s technical team in 2019, Meituan’s search QV-CTR increased from about 57.60% to about 58.80% after using BERT. Let’s calculate the revenue, according to the business link of “user-triggered search→ user clicks on search results → user places an order”, assuming that the average daily search QV at that time was 5 million, and the probability of placing an order after clicking on a search result was 10%, and then considering that Meituan Food Channel is mainly meals, so assuming that the revenue of an order is 100 yuan, then:

Revenue before BERT listing = 5 million x 57.60% x 10% x 100 = 28 million
BERT’s revenue after launch = 5 million x 58.80% x 10% x 100 = 29.4 million

WoW~ Such a day’s revenue is 1.4 million, 511 million more a year, and the year-end bonus of the search technology team should be full hahaha

Application 3: Improve the accuracy of search term rewording and make search results more relevant.After we enter a search term in the search box, the system will not search with this word as it is, but may rewrite it. For example, if the original search term is “KFC nearby”, the system will automatically rewrite it to “KFC nearby” before searching, because the real store name of KFC is the three words “KFC” instead of “KFC”. This kind of optimization and adjustment of the original search term without changing the user’s intention so that it can match more relevant information is called search term rewriting. Obviously, the number of search terms that need to be rewritten under Meituan’s business volume is very large, and it is impossible to manually verify whether all rewrites are accurate, so Meituan introduced BERT to determine whether the original search term and the rewritten search term are semantically consistent. Experiments have shown that the BERT-based rewriting scheme exceeds the original XGBoost (a classification model proposed in 2015) in both accuracy and recall, but the exact amount is not said.

Let’s review what we have learned

BERT is a universal language model.The so-called “universal” means that it is not born to solve a specific problem, but has the underlying ability to understand natural language, and when we need to use it to solve a specific problem, it only needs to be modified.

BERT is a superposition of the Transformer Encoder Block.Superimposing 12 Transformer Encoder Blocks gives BERT.

BERT learns to understand natural language through two tasks: “Cloze and Judge the Next Sentence”.The so-called “cloze fill-in-the-blank” refers to digging out a word in a sentence and showing it to BERT to predict what the mined word is. “Judge the next sentence” means giving BERT two sentences, A and B, to judge whether B is the next sentence of A in the original text.

BERT’s training data does not need to be manually annotated.Whether it’s cloze or judging the next sentence, its training data can be efficiently constructed in an automated way, which allows BERT to fully train on large-scale corpora without data annotation costs being a hindrance.

Pre-training refers to training whose training purpose is not directly directed to the target task.For example, we use the two tasks of “Cloze and Judge the Next Sentence” to let BERT learn to understand natural language first, and then train it to complete other tasks. These two trainings are pre-training.

Fine-tuning BERT accomplishes many specific tasks.Fine-tuning BERT specifically refers to adding accessories to it so that it can accept input from the target task and give the expected output, and then train the entire model with a small amount of relevant data to make BERT competent for the target task.

BERT’s “one training, multiple reuse” feature is particularly suitable for the rapid iteration of the Internet industry.For example, Meituan uses BERT to improve the accuracy of user review sentiment analysis, search term intent recognition accuracy, and search term rewriting accuracy.

Welcome to 2018

BERT was proposed from the 2018 Google paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

At this point, congratulations on your understanding of large model knowledge in 2018, which is still 7 years before the release of DeepSeek-R1.

AI Heroes

Jacob Devlin, first author of BERT

Distinguished scientist known for BERT.BERT has revolutionized the field of NLP, making significant advances in the way machines understand and process human language, with 130,000 citations as of May 2025, which should be second only to Transformer’s 180,000 in the NLP field as far as I know.

Career at a top scientific and technological research institution.Jacob is a senior research scientist at Google, where much of his impactful work, including BERT, is done. He previously served as a Principal Research Scientist at Microsoft Research, where he led the transition from Microsoft Translator to neural networks and developed advanced on-device models for mobile neural network translation. Reports in 2023 noted his return to Google after briefly joining OpenAI.

Dedicated to developing fast, powerful, and scalable deep learning models for language understanding.His current work spans information retrieval, question answering, and machine translation, continuing to push the boundaries of natural language processing and its applications.

“He built the Silk Road on the wasteland of language, and once rammed the earth into a foundation, tens of millions of caravans could go to different city-states along the pre-trained masonry.”

——Postscript

Author: Ye Yu Sihan, product manager focusing on AI. Official account: Moonlight on the eve of the launch

Previous articles:

Build a large model knowledge system from 0 (4): The father of large models, Transformer

Constructing a large model knowledge system from 0 (3): the ancestor RNN of large models

Build a large model knowledge system from 0 (2): CNN that opens the eyes of the model

Building a large model knowledge system from 0 (1): What is a model?

Build a large model knowledge system from 0 (5): large model BERT

As is customary, conclusions come first

Let’s start with the general language model

This is BERT

BERT’s two major training tasks: “cloze and judge the next sentence”

Before training: Prepare the training data

During training: This is pre-training

Post-training: Fine-tune to complete numerous NLP tasks

Application of BERT in Meituan’s business

Let’s review what we have learned

Welcome to 2018

AI Heroes

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Apple’s AI entry into China 3 questions: Who will benefit? Who is obstructing? Who is the most anxious?

The post-90s generation has worked in the workplace for 10 years, counting the gains and losses of those years

After I used this set of secret tower learning methods, the learning speed took off directly

Building a large-scale AI recommendation system from 0: a key link in the productization of sorting models

Domestic large model “top five”, decisive battle with AGI!