Model fine-tuning: in-depth analysis from theory to practice

In the realm of artificial intelligence, model fine-tuning has emerged as a key technology for enhancing model performance and adapting it to specific tasks. This article will comprehensively and systematically introduce all aspects of model fine-tuning to help readers gain an in-depth understanding of this important technique.

1. What is model fine-tuning?

Model fine-tuning refers to further adjustment and optimization based on the pre-trained model that has already been trained to make the model’s output more in line with specific application needs. Essentially, fine-tuning is also a form of model training, and the process has many similarities to training a completely new model.

Who will do the fine-tuning

Fine-tuning work is usually done by experienced R&D personnel or algorithm engineers. This process not only requires solid technical skills, but also inseparable from two core elements: code implementation capabilities and sufficient computing power support. It is worth mentioning that although some platforms currently provide visual interfaces to assist in fine-tuning, the functions of these interfaces are often limited and can only play a certain auxiliary role.

After 10 years of interaction design, why did I transfer to product manager?
After the real job transfer, I found that many jobs were still beyond my imagination. The work of a product manager is indeed more complicated. Theoretically, the work of a product manager includes all aspects of the product, from market research, user research, data analysis…

View details >

What kind of model can be fine-tuned

Not all models are suitable for fine-tuning, and the following two types of models are the most common fine-tuning objects:

Most of the open-source models such as LLaMA, qwen, glm, etc. These models have open architectures and parameters, making it easy for users to make personalized adjustments.
Closed-source models or platforms with open fine-tuning interfaces in the API, such as Wenxin, Zhipu AI, etc. However, there are obvious differences between open source models and closed source models when fine-tuning: open source models can generate new models after fine-tuning; The fine-tuning process of closed-source models is carried out on the platform’s servers, and users cannot directly obtain the original parameters of the model.

Core factors that influence fine-tuning

The effect of fine-tuning is affected by a combination of factors, the most important of which include:

Pedestal Model Selection: The performance and characteristics of the pedestal model largely determine the upper limit of fine-tuning.
Selection of fine-tuning methods: Different fine-tuning methods are suitable for different scenarios and needs.
Data Quality: High-quality data is a critical foundation for ensuring fine-tuning success.

2. Model fine-tuning workflow

Step 1: Demand analysis and goal setting

This stage is primarily led by the project team or product manager and is the starting point and key to fine-tuning efforts.

When fine-tuning is needed

In practical applications, the following situations usually require consideration of fine-tuning the model:

Project nature requirements: For example, Party A clearly puts forward requirements, for capitalization considerations, or for the purpose of completing political performance projects, etc. In addition, fine-tuning is also a common means to quickly obtain large models that meet the needs of specific fields, such as large mine models, large field models, etc.
Special requirements for communication styles and language styles: When the pedestal model cannot stably implement a specific communication style or language style through prompt control, such as AI children’s storytelling scenarios.
The base model lacks vertical field data: In fields that require high professional knowledge, such as medical care and military, the base model cannot complete professional tasks due to the public data on the Internet may not meet the demand.
The pedestal model cannot complete specific tasks: for example, the model needs to be automatically operated by computers, mobile phones, and other functions.

Things to consider before fine-tuning

Before deciding to fine-tune, the following aspects need to be thoroughly evaluated:

Have you tried prompts (including few-shot, cot, etc.) and RAG techniques enough?
Can you ensure the amount of data and data quality required for fine-tuning?
As the base model is constantly being introduced and its capabilities are constantly improving, have you considered the need for re-fine-tuning?

Core work steps

1) Clarify business requirements and fine-tune model goals:

Carefully examine how the chosen pedestal model will perform in real-world scenarios to determine if it really needs fine-tuning.
Check if various prompt methods have been tried.
Consider whether the task is reasonably disassembled.
Confirm that the RAG system has been perfected. It’s important to note that in most cases, fine-tuning may not be necessary.

2) Identify specific problems that need to be addressed.

3) Set expected performance improvement goals.

4) Clarify specific business metrics or constraints.

Step 2: Data collection and preparation

Data collection and preparation are primarily led by the product manager, which is the basis for fine-tuning efforts.

Data collection

Collect relevant data from various sources such as enterprise databases, log files, and user interaction records based on specific needs. It is important to emphasize that it is crucial to collect real data in real scenarios, which will directly affect the effectiveness of fine-tuning.

Data cleaning

The collected data is cleaned to remove noise, errors, and inconsistencies to ensure the quality of the data. High-quality data is a prerequisite for models to learn effectively.

Data annotation

If you use supervised learning methods for fine-tuning, you need to annotate the data. This step may need to be done by hiring an external team or utilizing internal resources. Accurate annotation data is of great significance for model training and performance improvement.

Data division

Divide the dataset into training, validation, and test sets to evaluate the model’s performance:

Training set: 70-80% for actual model training and learning.
Validation set: 10-15% to evaluate the model’s performance during training to make timely adjustments.
Test set: 10-15%, used to finally evaluate the model’s performance after model training is completed.

The amount of data fine-tuned by the model (LoRA fine-tuning)

There is a certain approximate relationship between model size and fine-tuning orders of magnitude, as shown in the following table:

Data quality standards

Taking the fine-tuning of the conversation scenario model of the intelligent customer service system as an example, the data quality standard can include the following dimensions:

Step 3: Model selection

Model selection is often dominated by algorithms, and product managers should be actively involved.

Premise considerations

Before choosing a model, you also need to consider the following questions:

Have you tried prompts (including few-shot, cot, etc.) and RAG techniques enough?
Can you ensure the amount of data and data quality required for fine-tuning?
Since the pedestal model is constantly updated, have you considered the need to re-fine-tune?

Principles and methodologies for selecting models

Open and closed source selection: In principle, open source models are preferred, but the final decision needs to be determined based on specific business scenarios.
Choice of Pedestal Model Company: For example, Zhipu AI’s glm series models, among which the strongest model of glm4 is not open source, while Alibaba’s QWEN model is a recommended choice.
Model size selection: It is necessary to consider both effect and cost, and in the project, it may be necessary to choose different sizes of models according to different scenarios. In general, you can try the largest size model for best results before dropping down to the minimum feasible size based on actual needs.

Scenario-based model selection

In actual projects, the selection of models needs to consider both effect and cost. A project may contain multiple scenarios, so different models may need to be selected. Experimentation and experience are often required to determine the best model selection and fine-tuning, such as:

For some complex tasks, a 33b+ model may be required for full fine-tuning.
For other tasks, it may be more appropriate to take a model of 110+ and freeze some parameters for fine-tuning.

Step 4: Model fine-tuning

Model fine-tuning is implemented by algorithm engineers and is the core link of the entire workflow.

Model fine-tuning (essentially SFT)

1) Full model fine-tuning: Adjust all parameters of the entire model.

2) Parameter-Efficient Fine-tuning (PEFT):

Low-Rank Adaptation (LoRA): This is one of the most commonly used fine-tuning techniques.
prompt tuning。
P-Tuning。
Prefix-Tuning。

3) Freeze some parameters for fine-tuning: Only some parameters of the model are adjusted, and the rest of the parameters remain frozen.

4) Progressive fine-tuning: Gradually adjust the parameters of the model to improve the effectiveness and stability of fine-tuning.

5) Multi-task fine-tuning: Fine-tune multiple tasks simultaneously to improve the generalization capabilities of the model.

LoRA fine-tuning principle

The core principle of LoRA fine-tuning is to select only some parameters of the original model as the target fine-tuning parameters (usually r takes 4, 8, 16, 32, etc.), and does not change the original parameters of the model, but adds an offset on the original parameters to obtain a new set of parameters. This method has the advantages of high efficiency and saving computing power, so it is most commonly used in practical applications.

QLoRA fine-tuning

The main purpose of QLoRA fine-tuning is to address the problem of excessive memory usage. The calculation method of video memory usage is as follows: the amount of parameters ×4×4 times (loaded + turned up)/(1024×1024×1024) = xG video memory. Taking the 7B model as an example, with a parameter of 7000000000, the calculated memory usage is about 112000000000 bytes, or about 104G, which requires 5 NVIDIA 4090 graphics cards (24G each). QLoRA greatly improves the feasibility of model fine-tuning by directly reducing the memory footprint to a quarter of the original by changing the floating point number of 4 bytes to a 1-byte integer.

Step 5: Model evaluation

Model evaluation is led by product managers and is a critical part of ensuring that the model meets the expected requirements.

Evaluation method: support rate

Approval rate is an important metric for evaluating fine-tuned model capabilities in specific scenarios. Assessments in general areas often have no practical meaning. The specific evaluation methods are as follows:

1) Design a question and answer task, use the pre-tuning and fine-tuned models to answer the questions separately, and then let the human choose the preference without knowing the source of the answer.

2) Evaluation criteria:

If the fine-tuned model support rate is less than 50%, it means that this fine-tuning not only does not improve the model’s capabilities, but also destroys the capabilities of the original model.
If the approval rating is around 50%, it means that fine-tuning has made little progress.
If the approval rate is between 50% and 70%, the fine-tuning results are not ideal.
If the approval rating is between 70%-80%, this fine-tuning is a success.
If the approval rating exceeds 80%, it indicates that this fine-tuning has achieved a significant improvement in most scenarios.

Step 6: Model deployment

Once the model passes the evaluation, it can be deployed to production by R&D personnel, making it available for the actual business.

Step 7: Monitoring and maintenance

After the model is deployed to production, the product manager is responsible for monitoring and maintaining it:

Performance Monitoring: Regularly check the model’s performance to ensure it continues to meet business needs.
Update and retrain: As new data is acquired or the business environment changes, the model may need to be retrained or fine-tuned to adapt to new situations.

Step 8: Feedback loop

Product managers need to design feedback and supervision mechanisms to establish an effective feedback loop: collect feedback information during model use to guide future improvement and optimization work, so that the model can continue to evolve and improve.

3. Data engineering

To be clear, fine-tuning is not a one-time project, and continuous data collection and systematic data processing are more important than the fine-tuning technology itself.

How to collect preference data (equivalent to manual annotation)

Likes and Clicks: Collect preference data based on user likes or clicks on the model’s output.
Multi-option selection: For example, 4 pictures are displayed to users at once, allowing users to choose their preferred options; Or let the model generate two answers for the user to choose.
Agent Workbench Assistance: In the agent workbench, the model generates 4 assisted responses, 2 from the original model and 2 from the fine-tuned model, allowing the agent to select the most suitable reply to collect preference data.

Product function design

Data collection capabilities: In product function design, data collection capabilities must be available to obtain user feedback and preference data in a timely manner.
Targeted data collection: Design targeted data collection mechanisms for product functions for specific scenarios to improve data pertinence and effectiveness.

Design a good data management platform

Leverage LLM capabilities: Give the data management platform a certain level of intelligence to improve the efficiency and quality of data management.
Reference cases: such as Baidu Intelligent Cloud’s data management and data annotation platform.

Data management platform product framework

Data source management: Manage various data sources such as proprietary data, public data, user-generated data, expert-written data, model synthesis data, and crowdsourced data collection, and realize the direct import of online system data and the real-time docking of data platforms with online systems.
Systematic labeling: including label definition, label hierarchy construction, marking task management and marking task hierarchy, etc., supporting various labeling methods such as user marking, service personnel marking, expert marking, AI marking, cross-marking, cross-re-inspection, and expert sampling inspection.
Data deduplication and enhancement: Data deduplication is carried out through prompt similarity calculation, overall quality grading of data sources, and data enhancement is carried out by synonym substitution, word order scramble, reverse translation, data blending, and other technologies. For example, 200 questions and answers data can be scaled to 400 pieces of data by generating the same question through the model; Translating data into other languages and back into the original language allows for diversity in expression methods.
Data packaging: Realize the automatic division of training sets, validation sets, and test sets, establish associations between datasets with model versions and model evaluation results, and evaluate data availability and reusability by labels.
Model evaluation: Support evaluation methods such as preference marking, expert scoring, and user preference collection, and output evaluation results according to labels to realize the reuse of evaluation and marking data.
Utilize powerful models to assist in work: For example, powerful models are used for data filtering, automatic marking, cross-checking, prompt similarity calculation, response quality comparison, data enhancement, and model evaluation to improve the efficiency and quality of data engineering.

4. Postscript

principle

Without high-quality data, there is little point in fine-tuning the practice. Data quality is a key factor in determining fine-tuning results, and even with state-of-the-art fine-tuning technology, it is difficult to achieve the desired results without high-quality data support.
The size of the data volume is not emphasized, but the data quality must be high. In data engineering, more attention should be paid to the quality of data rather than quantity, and high-quality data can enable the model to learn the required knowledge and patterns more effectively.

Through the above comprehensive introduction to model fine-tuning, I believe readers have an in-depth understanding of model fine-tuning. In practical applications, it is necessary to reasonably select fine-tuning methods and data processing methods based on specific business needs and scenarios to achieve effective improvement of model performance.

Model fine-tuning: in-depth analysis from theory to practice

1. What is model fine-tuning?

Who will do the fine-tuning

What kind of model can be fine-tuned

Core factors that influence fine-tuning

2. Model fine-tuning workflow

Step 1: Demand analysis and goal setting

Step 2: Data collection and preparation

Step 3: Model selection

Step 4: Model fine-tuning

Step 5: Model evaluation

Step 6: Model deployment

Step 7: Monitoring and maintenance

Step 8: Feedback loop

3. Data engineering

4. Postscript

principle

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Unemployment window period, a product manager’s thoughts on entrepreneurship

Understand the underlying logic of Huawei’s R&D in one article: billions of tuition fees summarize the R&D practice of “catching up to leading”

VibeGTM is here, AI agent continues to be popular in the marketing field, and another large financing of $30 million

Does China need more Robin Li?

Xiaohongshu, return to the old path or make new friends?