Limited by the computing power and storage space of devices, how to make the device-side model smarter and more efficient with limited resources has become a key challenge for AI product managers. This article will introduce nine cutting-edge technologies in detail, hoping to help you.
A device-side model is an AI that runs directly on your device.
Why use AI on the device side? Isn’t the cloud model fragrant?
The benefits of device-side AI are simply not too much: it can protect privacy (data is stored locally, no need to upload to the cloud), responds as fast as lightning (after all, the “brain” is around, no need to travel thousands of miles to visit the “cloud”), and does not rely on the network environment, so it can be used anytime and anywhere (there is not a signal everywhere when you are away).
There are so many benefits on the device side, but the problem is that the “brain” on the device has limited space and limited energy after all, and it is not as “arrogant” as the supercomputer that uses 10,000 card clusters in the cloud. How can we make the end-side model smarter and more powerful under “tight” conditions?
This article unveils these 9 technologies that make the end-side model “small and smart”.
The first trick: famous teachers produce master disciples – the wisdom inheritance of “Knowledge Distillation”, “what is condensed is the essence!” ”
Imagine a knowledgeable old professor (let’s call it the “teacher model”) and a smart student (our “student model”, the end-side model). Although the old professor has a lot of knowledge, he is “huge” and not suitable for directly stuffing into his mobile phone. What to do?
“Knowledge distillation” is like this old professor teaching students step by step. Students not only learn the standard answers in textbooks (the technical term is called “hard label”), but more importantly, learn the way the old professor thinks about the problem and the subtleties of his judgment (for example, when the old professor sees a cat, he not only knows that it is a cat, but also knows that it is 90% likely to be British and 10% may be American short, and this probability distribution is “soft label”).
To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.
View details >
In this way, although the student model is “petite”, he can learn the master’s “internal strength and mental method”, and his performance naturally far exceeds his own “behind closed doors”.
In short, let a “student model” with small parameters and small computational volume learn the essence of a “teacher model” with large parameters and strong capabilities, not just learn the training data itself.
How to achieve it?
First, train a full-blooded version of the “teacher model” (such as on a powerful cloud server).
Then, let the “student model” learn the answers of the real data while imitating the results of the “teacher model’s judgment of the data (those probability distributions”).
Eventually, this “student model” is small and strong enough to run on your phone.
For example:
Google has shown in its research that knowledge distillation can migrate the knowledge of a large image recognition model to a much smaller mobile model, which maintains low latency while losing very little accuracy. For example, a large model may have an accuracy rate of 85%, and through distillation, a small model may reach 83%, but the model size and computational volume are reduced by several times or even dozens of times.
The second trick: live a careful life – the magic of “quantification”, “from ‘luxury’ to ‘affordable’!” ”
We know that numbers in computers are represented by a string of 0s and 1s. The more precise the representation, the larger the space it occupies and the slower it is to calculate.
“Quantification” is like changing these numbers from “high-precision luxury” to “affordable”. For example, we used to use 32 bits to represent a number, but now we find a way to approximate it with 16 bits (FP16) or even 8 bits integer (INT8).
This is like using a ruler that is accurate to several decimal places to measure things, but now we use a ruler that is a little rougher but still sufficient. The “weight” of the model is instantly reduced, the calculation speed is “whoosh” faster, and in many cases, the accuracy of the final result is very small.
In short, it reduces the representation accuracy of numbers (weights and activation values) in the model, such as changing from a 32-bit floating-point number to an 8-bit integer, thereby greatly reducing the model size and computational effort.
How to achieve it?
Post-training quantization (PTQ): The model is trained with high precision and then directly converted to low-precision like “rapid slimming”. This requires a “calibration” process to see what range these numbers are approximately in, and then mapping.
Quantitative Perception Training (QAT): A more advanced way to play! During training, the model is told: “You will live a tight life of ‘low precision’ in the future.” The model will actively adapt to this change during the training process, and the accuracy loss is usually smaller.
For example
According to reports from chip manufacturers such as Qualcomm, by quantizing the model from FP32 to INT8, the model size can be reduced by about 4 times, and the inference speed can be increased by 2 to 4 times on hardware that supports INT8 computing (such as the AI engine in its Snapdragon processor), while the power consumption is also significantly reduced. For example, in some image classification tasks, the model accuracy lost by less than 1% after INT8 quantization.
The third trick: the strong man breaks his wrist, remove the weeds and save the essence – the art of “pruning”, “cut off unnecessary branches to make the trunk stronger!” ”
A trained neural network is like a leafy tree. But if you look closely, some “branches” (connections or neurons in the network) actually contribute little to the final “outcome” (the model’s prediction), or even a little redundant. The “pruning” technique is like a master gardener, pruning these “unfruitful” or “messy” branches.
In this way, the “body size” of the model becomes smaller, the amount of computation is reduced, and it naturally runs more briskly. Studies have shown that pruning some classical image recognition models (such as VGG and ResNet) can reduce parameters and computational costs by 50% or more without losing accuracy.
In a nutshell
Remove less important parameters (weights) or structures (neurons, channels) from the neural network to make the model smaller and faster.
How to achieve it?
- Amplitude pruning:The simplest and crudeest, which connection has a small weight value (close to 0), it is considered unimportant, click to cut it off!
- Structured Pruning:This is a more particular technique, not a single cut, but a single cut to cut off an entire “department” (such as an entire neuron or a convolutional nuclear channel). The advantage of this is that the trimmed model structure is more regular and the hardware is easier to speed up.
- Iterative pruning:Cut a little, then “tun up” (fine-tune the training) again to rejuvenate the model, cut a little more, and then tutor again…… This is the best way to do this.
Make an analogy
It’s a bit like our brain’s learning process. Neuroscientists have found that there are many connections between brain neurons in infancy, but as they grow and learn, some uncommonly used connections weaken or even disappear, while commonly used connections strengthen to form efficient neural networks. Pruning is also simulating this process of “survival of the fittest”.
Trick 4: AI Design AI – The automation revolution of “Neural Architecture Search (NAS)”, “Let AI design its own structure”
In the past, designing neural networks relied heavily on the experience and inspiration of human experts, just like architects designing houses, which required repeated debugging. But if you want to build a “good and fast” house on a “small foundation” like a mobile phone, the challenge is even greater.
“Neural architecture search” (NAS) is to let AI explore and design the network structure that best suits the device. You set goals for the AI (e.g., I want a model that is highly accurate, fast, and consumes less power), and then the AI will automatically build and test various neural architectures in a huge “brick pool” (various possible network components and connections) like playing Lego, and finally pick the best one.
In a nutshell
Instead of manually designing the structure of a neural network using algorithms, the goal is to find the architecture that performs best on specific hardware, such as mobile phone chips.
How to achieve it?
- Define the search space:First, tell the AI what “building blocks” (such as different types of convolutions, pooling layers) can be used, and how they can be put together.
- Employ a search strategy:High-tech means such as reinforcement learning, evolutionary algorithms, or gradient-based methods are used to guide AI on how to effectively “try” different combinations.
- Evaluate performance:Quickly evaluate the quality of each “design solution” and select the champion.
For example
Google’s EfficientNet series model is the representative of NAS. The researchers searched for a basic network architecture through the NAS and generated a series of models through a unified set of scaling rules, which improved accuracy and efficiency, making them ideal for deployment on mobile devices. For example, EfficientNet-B0 achieves an ImageNet accuracy similar to that of ResNet-50, while significantly reducing the amount of parameters and computation.
Trick 5: Mixed Precision Training and Inference “Good steel is used on the cutting edge, and precision is distributed as needed!”
Not all steps require the highest precision when performing calculations in AI models. Some calculations require high accuracy, and the result may be wrong (such as key judgment steps); Other computations, a little rough, are harmless and can greatly improve the speed (such as the transmission of some intermediate features).
In a nutshell
The “mixed precision” technique is like an experienced master who knows when to use a vernier caliper (high-precision FP32) and when to use a tape measure (low-precision FP16 or even INT8). It cleverly combines high-precision and low-precision calculations when training and inference (model prediction): high precision ensures accuracy in critical parts, and low precision in non-critical parts to improve efficiency, reduce memory footprint and power consumption.
How to achieve it?
- Hardware support:Today’s AI chips (such as NVIDIA’s GPUs introduced Tensor Cores from the Volta architecture, as well as many device-side AI accelerators) natively support low-precision computing such as FP16, which far exceeds FP32.
- Framework support:Mainstream deep learning frameworks such as TensorFlow and PyTorch have built-in Automatic Mixed Precision (AMP) modules. Developers only need to turn on this option, and the framework will automatically determine which operations are suitable for low precision and which need to be maintained with high accuracy, and perform corresponding transformations and compensations (e.g., using Loss Scaling to prevent gradients from disappearing).
Practical effect
In many cases, using mixed precision (e.g., FP32 and FP16 mixing) can increase training speed by 2-3 times, significantly improve inference speed, and reduce memory footprint by about half, with little loss of model accuracy. This is undoubtedly a big plus for scenarios where you want to run larger, more complex models on the device. For example, in image recognition or natural language processing tasks, with mixed precision, the model responds faster and the user experience is better.
Sixth trick: Everyone picks up firewood and the flame is high – the privacy protection and collective wisdom of “Federated Learning”, “data does not leave the house, wisdom is improved together”
We want the device-side model to learn from more diverse data and become smarter. However, users’ personal data is very sensitive, and directly uploading models to the cloud to train models carries the risk of privacy leakage. What to do?
“Federated learning” offers a wonderful solution. It’s like a “decentralized” study group. Everyone’s data is kept on their mobile phone or device (data is not local), and the “knowledge” (parameter update) of the model update is sent to a central server for aggregation to form a more powerful “collective intelligence model”, and then this “upgraded” model is distributed back to each device.
This not only protects user privacy but also allows the model to benefit from the massive amount of scattered data.
In a nutshell
Multiple devices, such as many mobile phones, collaboratively train a machine learning model without sharing their own local data. Each device trains the model with local data, and then only sends the model updates (not the data itself) to the central server for aggregation, and finally feeds back to the end-side to get better performance of the end-side model.
How to achieve it?
Step 1 The server initializes the end-side model
Step 2 The model is distributed to the selected end-side device
Step 3 The device trains the model locally with its own data
Step 4 The device encrypts the model updates generated by the training, such as weight changes, and sends them to the server
Step 5 The server aggregates updates from all devices (e.g., averaging) to form a better global model
Step 6 Repeat 2-5 steps to finally get better performance for the end-side model
For example
Google’s Gboard input method uses federated learning to improve the next-word prediction model. As millions of users enter, their devices leverage local input history (this historical data does not leave the device) to improve a small portion of the predictive model, which is then securely aggregated to form a global model that benefits all users. This makes the input method prediction more and more accurate, and the user’s input content is well protected by privacy.
Tip 7: Dynamic Inference and Adaptive Computation of Device-Side Models “Look at the food and eat according to your ability – AI’s ‘intelligent gear’!”
Imagine that when you are driving, you will use more fuel-efficient gears and speeds when cruising on flat roads; When encountering a steep slope that needs to be climbed, it switches to a more powerful low gear and increases the throttle. “Dynamic reasoning” or “adaptive computing” is to make the device-side AI model also have this “intelligent shifting” ability.
In a nutshell
It dynamically adjusts its computational power and “thinking depth” based on the “difficulty” of the current input data (such as whether an image is a simple solid background or a complex scene full of details) or the current “physical condition” of the device (such as whether the battery is sufficient and whether the CPU/NPU is idle). For simple tasks, the model will “stop at a shallow taste” and quickly give results with less calculations; For complex tasks, or when resources are sufficient, the model “goes all out” and calls more network layers or more complex computational paths to achieve the best results.
How to achieve it?
- Multi-Branch Networks/Early Exit:Design a network with multiple layers of “egress”. For simple inputs, you may get a sufficient level of confidence in the shallower layer of the network, and the model will output the answer from this “early export” without having to go the whole way. It’s like an exam, where you can see the answer to a simple question at a glance and don’t have to check it repeatedly. The Anytime Prediction model is the embodiment of this kind of thinking.
- Conditional Computation:Some modules or computational paths in the network are only activated when certain conditions are met. For example, a blurred outline of what could be a “cat” is detected before a module dedicated to fine identification of cat breeds is activated. Google’s Switch Transformers use similar ideas to sparse activation, but focus more on oversized models. On the end side, this means that the more computationally expensive parts can be selectively performed as needed.
- Resource-aware scheduling:The model or AI framework can sense the real-time resources of the device (power, temperature, available computing power, etc.) and adjust the complexity or inference strategy of the model accordingly.
Practical effect
This technology can significantly improve the energy efficiency ratio and user experience of the device-side model. In most cases, the input data may be relatively simple, and the model can respond quickly with very low power consumption; For a few complex situations, the treatment effect can be guaranteed. This allows AI applications to be “more durable” on power-sensitive mobile devices or IoT nodes, and can provide relatively smooth service under different load conditions. For example, the object detection function of a smart camera can reduce the frame rate or model complexity when the picture is still or the object is sparse, and increase the computing power once it detects fast-moving or dense objects.
Eighth trick: I am born to be useful – the innate advantage of “Efficient Model Architectures”, “Born to be ‘fast’ and ‘small’!”
In addition to the above-mentioned “acquired tuning” method, we can also start with “innate genes” and directly design network structures that are inherently low in parameters and low computational volume. It’s like having a body type and muscle type in an athlete that is specifically optimized for sprinting or marathons.
These efficient model architectures are designed with the limitations of the device in mind.
In a nutshell
From the beginning, a neural network model with a lightweight structure and high computational efficiency was designed.
Notable representatives
- MobileNets (v1, v2, v3)At their core, they are “deeply separable convolutions”, which split the traditional convolution operation into two steps, greatly reducing the amount of computation and parameters. Imagine that traditional convolution is “stir-frying a large pot of dishes over high heat”, while deep separable convolution is “stir-frying each dish separately first, and then mixing well with a small amount of seasoning”, which is more efficient. MobileNetV3 combines NAS technology to further enhance efficiency and performance.
- ShuffleNets: Techniques such as “point-by-point group convolution” and “channel shuffling” are used to further reduce computational costs like cleverly rearranging and combining blocks.
- SqueezeNets: With the “Squeeze” and “Expand” modules, the accuracy of large networks is achieved with fewer parameters.
- EfficientNets: As mentioned earlier, the NAS finds the optimal combination of network depth, width, and input image resolution, and scales it in equal proportions to achieve optimal performance under different computing resource constraints.
Data is proof
Taking MobileNetV2 as an example, compared with the classic VGG16 model, it can achieve similar accuracy in ImageNet image classification tasks, but the number of parameters is reduced by about 25 times and the computation is reduced by about 30 times. This makes it ideal for running smoothly on mobile devices like mobile phones.
The ninth trick: “slim down” the AI brain – memory optimization black technology, “so that the ‘brain capacity’ that is already valuable can be used to the extreme!”
In addition to the previous general techniques to make the model smaller and faster, there are also some tricks specifically aimed at “picking up memory” to ensure that the AI brain will not freeze our mobile phones and computers because it is “too space-occupied” when running.
A. Weight Sharing/Clustering
“Things are clustered by like, and parameters are grouped!”
In a nutshell
Imagine a neural network with thousands or even millions of parameters (weights). “Weight sharing” or “weight clustering” is like finding that many of these parameters are actually very similar, or can be grouped into several categories. Instead of storing an exact weight value for each connection individually, we have many connections sharing the same (or the same set) of weight values. It’s like having 100 white shirts in the closet with only slight differences in color and style, and each one has to occupy a hanger. Now we find that they can actually be divided into several categories such as “pure white”, “off-white”, “mercerized white”, etc., and each category uses a representative “standard white shirt” parameter. In this way, the number of “standard white shirts” that need to be stored is greatly reduced, and the memory is naturally much looser.
effect
This approach can significantly reduce the memory required to store weights. For example, the Deep Compression technology proposed by Stanford University combines weight clustering (a form of weight sharing) with pruning and quantization to successfully compress large networks such as AlexNet and VGG by 35 to 49 times with little loss of accuracy, which is crucial for deploying these complex models on memory-limited mobile devices.
B. Low-Rank Factorization
“Break down the ‘big fat matrix’ into two ‘little thin men’!”
In a nutshell
In neural networks, many layers of computation are essentially huge matrix multiplications. If a weight matrix is very “fat” (with a large dimension), it contains many parameters and takes up a lot of memory. “Low-rank decomposition” is like discovering that this “big fat matrix” can actually be approximated by multiplying two or more “thinner” “small matrices”. For example, a complex pattern (large matrix) can actually be superimposed and combined with several simple basic patterns (small matrix). We only need to store these basic patterns to reconstruct the original complex patterns, which greatly saves storage space.
effect
For example, in recommendation systems, user-item interaction matrices are often very large and sparse, and low-rank decomposition (such as SVD or its variants) can efficiently extract potential features and store these feature vectors in much smaller memory than the original matrix, enabling efficient personalized recommendations. The same idea is applied to the neural network layer, which can significantly reduce the storage of weight parameters.
C. Activation Memory Optimization
“Carefully calculate every minute of ‘temporary memory’!”
In a nutshell
When the model makes predictions (inferences), not only do the model’s own weight parameters take up memory, but the intermediate results produced by each layer of computation – which we call “activation values” – need to be temporarily stored in memory because they are needed for the next layer of computation. For deep networks, these activation values can add up to be very memory-intensive, like when cooking, if the half-finished products for each step are always on the table, the kitchen fills up quickly. Activation value memory optimization is to find ways to reduce the overhead of this part of the “temporary memory”, such as cleaning up a semi-finished product immediately after using it, or installing it with smaller dishes (low precision).
How to achieve and effect
- Activation Value Quantification:The activation value can also be reduced from 32-bit floating-point numbers to 8-bit integers, directly reducing the activation value memory by 75%.
- In-place Computation:The calculation process is cleverly designed so that the new calculation results directly overwrite the old activation values that are no longer needed, avoiding additional memory opening.
- Activation value recomputation:For some activation values, if they are not expensive to recalculate, then recalculate them when they are needed, rather than keeping them in memory all the time. It’s like a reference book, instead of memorizing the entire book (occupying brain memory), you turn through the book (recalculate) when you need to look up a certain knowledge point. This is especially critical when deploying AI models on memory-tight microcontrollers (MCUs).
Integrate to create the ultimate “strongest brain on the end side”
Seeing this, you have a more comprehensive understanding of how to “increase intelligence and reduce the burden” on device-side AI.
From the wisdom inheritance of knowledge distillation, to the careful calculation of quantification and pruning, to the automated design of neural architecture search, and the ingenious application of federated learning, supplemented by the innate advantages of efficient model architecture, as well as “slimming cheats” such as weight sharing, low-rank decomposition, and activation value optimization specifically for memory, as well as mixed-precision computing and flexible dynamic reasoning that pursue ultimate efficiency. Together, they form a powerful toolbox.
This is how clever end-side AI might be built:
- First, passNeural architecture searchFind one that is naturally suitable for the end sideEfficient model architecture, this architecture itself may supportDynamic reasoningSome features, such as including multiple compute branches.
- Then, passKnowledge distillation“Steal from teachers” from the more powerful “cloud teacher model” and improve their “IQ ceiling”.
- Next, usePruningTechnology andLow-rank decompositionRemove redundant parameters and structures in the model to reduce its “skeleton”.
- and then passedquantify(including weights and activation values) andWeight sharing/clusteringtechnology, combinedMixed-precision training and inferenceThe “weight” and “memory footprint” of the model are further compressed to the extreme, while optimizing the calculation speed.
- At runtime, adoptActivation value memory optimizationstrategy, and combineDynamic reasoningmechanism, which uses every KB of “temporary memory” and computing resources carefully.
- It can be used in scenarios where data privacy is critical or where data from multiple parties needs to be pooledFederated learningto continuously improve the swarm intelligence of the model.
Looking forward to the future: Device-side AI is not only “smart”, but also “smart”, “popular” and “insensitive”
Making the device-side model more “smarter” and “lighter” is related to the future digital life experience of each of us.
As these technologies continue to evolve and converge, the future of end-side AI will no longer be just a tool for executing simple commands. They will have stronger understanding, reasoning and personalized adaptability, while the resource requirements for equipment are getting smaller and smaller, and even reaching the state of “non-sense” operation – you can hardly notice that it is consuming resources, but you can enjoy the intelligent convenience it brings all the time.
Imagine this:
- The phone can truly understand your emotions and intentions, becoming your caring personal assistant, and all done locally, smoothly and privately.
- Smart devices at home can actively learn your living habits and create the most comfortable and convenient environment for you without worrying about privacy leaks or network latency.
- Robotic arms in factories can use end-side AI for more precise self-calibration and failure prediction, greatly improving production efficiency and safety, with models small enough to be embedded in each sensor node.
- In areas with scarce medical resources, portable diagnostic devices equipped with powerful and lightweight end-side AI can assist doctors in making quick and accurate initial diagnoses, and can even be easily operated by community health workers.
Device-side AI enables cutting-edge technology to serve everyone at a lower cost, lower power consumption, higher efficiency, and safer way, truly realizing inclusive AI.