Llama Quantization Methods

Updated on

Midjourney-generated llama

“Llama” refers to a Large Language Model (LLM). “Local llama” refers to a locally-hosted (typically open source) llama, in contrast to commercially hosted ones.

Quantization Methods

Quantization is the process of “compressing” a model’s weights by changing them to lower-precision representations. Typically this goes from a 32bit float, to around 4bits, which is important for low-memory systems. These are then dynamically cast to Bfloat16 at runtime for inference.

Quantization saves space but makes inference slower due to the dynamic cast and loses precision, making models worse. However, the losses are typically acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For local llamas, it’s even more important as most people don’t have computers with several 100GB of VRAM, making quantization necessary for running these models in the first place.

Pre-quantization is the method of quantizing before running. This makes it possible for people without massive amounts of RAM to obtain quantized models. Currently, TheBloke is the standard distributor of pre-quantized models. He will often upload pre-quantized versions of the latest models within a few days.

There are currently three competing standards for quantization, each with their own pros and cons. In general, for your own local llamas, you likely want GGUF.

GGUF

GGUF is currently the most widely used standard amongst local llama enthusiasts. It allows using the CPU RAM, in addition to the VRAM, to run models. This means the maximum models size is RAM + VRAM, not just VRAM. As more of the model is loaded into the RAM, inference becomes slower.

Background:

  • Developed by Meta for [Llama2] and prompted by llama.cpp to supersede (and deprecate) GGML.
  • Currently the most popular format amongst hobbyists.

Pros:

  • The only format to run on both RAM and VRAM.
  • Offloads as much computation as possible onto the GPU.
  • Works spectacularly on Apple Silicon’s unified memory model.

Cons:

  • Theoretically slower for models that can fit entirely into the VRAM. Not noticeable in practice.
  • Requires an additional conversation step, since base models are typically released in a safetensors format. TheBloke often prioritizes GGUF quantizations.
  • Quantization into GGUF can fail, meaning some bleeding-edge models aren’t available in this format.

Being the most popular local quant, GGUF has several internal versions. The original GGUF quants (eg Q4_0, Q4_1), quantized all the weights directly to the same precision. K-quants are more recent and don’t quantize uniformly. Some layers are quantized more, some less, and bits can be shared between weights. For example Q4_K_M means it’s a 4-bit K-quant of type M. In early 2024, I-quants were also introduced (eg IQ4_S). I-quants have some more CPU-heavy work which means they can run much slower than K-quants in some cases, but faster in others.

GPTQ

GPTQ is the standard for models that are fully loaded into the VRAM. If you have enough VRAM, this is a good default choice. You’ll often see this using the file extension “.safetensors” and occasionally “.bin” on Huggingface.

Background:

Pros:

  • Very fast, due to running entirely in VRAM.

Cons:

  • Can’t run any model that exceeds VRAM capacity.

AWQ

Stands for “Activation-aware Weight Quantization”.

This is bleeding-edge of quantization standards and a direct competitor to GPTQ. It uses “mixed-quantization”, which means it doesn’t quantize all the weights. Leaving the n% most frequently used weights unquantized is primary meant as a way to avoid the computational cost of casting the weights to Bfloat16 all the time. However, this also helps with model accuracy, as the most frequently used weights retain their full precision.

Background:

Pros:

  • Doesn’t quantize the top n% most used weights.
  • Very fast, due to running entirely in VRAM.

Cons:

  • Not yet supported on most major backends: llama.cpp, Ollama… Support has been merged on llama.cpp.
  • Slightly bigger file size, as some weights aren’t quantized.
  • Can’t run any model that exceeds VRAM capacity.
  • The format is new, so older models will often not have AWQ pre-quantization done for them.

Sources