Estimated reading time: 1 minutes
If ever there were a salient example of a counter-intuitive technique, it would be quantization of neural networks. Quantization reduces the precision of the weights and other tensors in neural network models, often drastically. It’s no surprise that reducing the precision of weights and other parameters from, say, 32-bit floats to 8-bit integers, makes the model run faster, and allows it to run in less powerful processors with far less memory. The stunning, counter-intuitive finding is that quantization can be done while largely preserving the accuracy of the model.
Why do we need quantization? The current large language models (LLMs) are enormous. The best models need to run on a cluster of server-class GPUs; gone are the days where you could run a state-of-the-art model locally on one GPU and get quick results. Quantization not only makes it possible to run a LLM on a single GPU, it allows you to run it on a CPU or on an edge device.
To read this article in full, please click here
About The Author
Discover more from Artificial Race!
Subscribe to get the latest posts sent to your email.