Saturday, March 23, 2024

ExLlamaV2: The Quickest Library to Run LLMs

Must read


Quantize and run EXL2 fashions

Towards Data Science
Picture by creator

Quantizing Massive Language Fashions (LLMs) is the most well-liked method to cut back the dimensions of those fashions and velocity up inference. Amongst these strategies, GPTQ delivers wonderful efficiency on GPUs. In comparison with unquantized fashions, this technique makes use of nearly 3 instances much less VRAM whereas offering an identical degree of accuracy and sooner era. It turned so common that it has not too long ago been immediately built-in into the transformers library.

ExLlamaV2 is a library designed to squeeze much more efficiency out of GPTQ. Due to new kernels, it’s optimized for (blazingly) quick inference. It additionally introduces a brand new quantization format, EXL2, which brings plenty of flexibility to how weights are saved.

On this article, we are going to see methods to quantize base fashions within the EXL2 format and methods to run them. As traditional, the code is out there on GitHub and Google Colab.

To start out our exploration, we have to set up the ExLlamaV2 library. On this case, we would like to have the ability to use some scripts contained within the repo, which is why we are going to set up it from supply as follows:

git clone https://github.com/turboderp/exllamav2
pip set up exllamav2

Now that ExLlamaV2 is put in, we have to obtain the mannequin we wish to quantize on this format. Let’s use the superb zephyr-7B-beta, a Mistral-7B mannequin fine-tuned utilizing Direct Desire Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is a formidable end result for a mannequin that’s ten instances smaller. You may check out the bottom Zephyr mannequin utilizing this house.

We obtain zephyr-7B-beta utilizing the next command (this will take some time for the reason that mannequin is about 15 GB):

git lfs set up
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ additionally requires a calibration dataset, which is used to measure the influence of the quantization course of by evaluating the outputs of the bottom mannequin and its quantized model. We’ll use the wikitext dataset and immediately obtain the take a look at file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

As soon as it’s completed, we are able to leverage the convert.py script offered by the ExLlamaV2 library. We’re principally involved with 4 arguments:

  • -i: Path of the bottom mannequin to transform in HF format (FP16).
  • -o: Path of the working listing with momentary recordsdata and last output.
  • -c: Path of the calibration dataset (in Parquet format).
  • -b: Goal common variety of bits per weight (bpw). For instance, 4.0 bpw will give retailer weights in 4-bit precision.

The whole checklist of arguments is out there on this web page. Let’s begin the quantization course of utilizing the convert.py script with the next arguments:

mkdir quant
python python exllamav2/convert.py
-i base_model
-o quant
-c wikitext-test.parquet
-b 5.0

Be aware that you’ll want a GPU to quantize this mannequin. The official documentation specifies that you just want roughly 8 GB of VRAM for a 7B mannequin, and 24 GB of VRAM for a 70B mannequin. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta utilizing a T4 GPU.

Underneath the hood, ExLlamaV2 leverages the GPTQ algorithm to decrease the precision of the weights whereas minimizing the influence on the output. Yow will discover extra particulars in regards to the GPTQ algorithm on this article.

So why are we utilizing the “EXL2” format as a substitute of the common GPTQ format? EXL2 comes with a couple of new options:

  • It helps completely different ranges of quantization: it’s not restricted to 4-bit precision and may deal with 2, 3, 4, 5, 6, and 8-bit quantization.
  • It may possibly combine completely different precisions inside a mannequin and inside every layer to protect crucial weights and layers with extra bits.

ExLlamaV2 makes use of this extra flexibility throughout quantization. It tries completely different quantization parameters and measures the error they introduce. On prime of making an attempt to attenuate the error, ExLlamaV2 additionally has to realize the goal common variety of bits per weight given as an argument. Due to this conduct, we are able to create quantized fashions with a mean variety of bits per weight of three.5 or 4.5 for instance.

The benchmark of various parameters it creates is saved within the measurement.json file. The next JSON exhibits the measurement for one layer:

"key": "mannequin.layers.0.self_attn.q_proj",
"numel": 16777216,
"choices": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},

On this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for a mean worth of two.188 bpw and a bunch measurement of 32. This launched a noticeable error that’s taken into consideration to pick the very best parameters.

Now that our mannequin is quantized, we wish to run it to see the way it performs. Earlier than that, we have to copy important config recordsdata from the base_model listing to the brand new quant listing. Mainly, we would like each file that’s not hidden (.*) or a safetensors file. Moreover, we do not want the out_tensor listing that was created by ExLlamaV2 throughout quantization.

In bash, you possibly can implement this as follows:

!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

Our EXL2 mannequin is prepared and we’ve got a number of choices to run it. Essentially the most easy technique consists of utilizing the test_inference.py script within the ExLlamaV2 repo (word that I don’t use a chat template right here):

python exllamav2/test_inference.py -m quant/ -p "I've a dream"

The era may be very quick (56.44 tokens/second on a T4 GPU), even in comparison with different quantization strategies and instruments like GGUF/llama.cpp or GPTQ. Yow will discover an in-depth comparability between completely different options on this wonderful article from oobabooga.

In my case, the LLM returned the next output:

 -- Mannequin: quant/
-- Choices: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading mannequin...
-- Loading tokenizer...
-- Warmup...
-- Producing...

I've a dream. <|consumer|>
Wow, that is an incredible speech! Are you able to add some statistics or examples to help the significance of training in society? It will make it much more persuasive and impactful. Additionally, are you able to counsel some methods we are able to guarantee equal entry to high quality training for all people no matter their background or monetary standing? Let's make this speech really unforgettable!

Completely! Here is your up to date speech:

Expensive fellow residents,

Schooling isn't just an educational pursuit however a basic human proper. It empowers individuals, opens doorways

-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (consists of immediate eval.)

Alternatively, you should utilize a chat model with the chatcode.py script for extra flexibility:

python exllamav2/examples/chatcode.py -m quant -mode llama

For those who’re planning to make use of an EXL2 mannequin extra usually, ExLlamaV2 has been built-in into a number of backends like oobabooga’s textual content era net UI. Be aware that it requires FlashAttention 2 to work correctly, which requires CUDA 12.1 on Home windows in the meanwhile (one thing you possibly can configure in the course of the set up course of).

Now that we examined the mannequin, we’re able to add it to the Hugging Face Hub. You may change the identify of your repo within the following code snippet and easily run it.

from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="mannequin"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)

Nice, the mannequin might be discovered on the Hugging Face Hub. The code within the pocket book is sort of common and may let you quantize completely different fashions, utilizing completely different values of bpw. That is superb for creating fashions devoted to your {hardware}.

On this article, we offered ExLlamaV2, a robust library to quantize LLMs. Additionally it is a incredible instrument to run them because it offers the very best variety of tokens per second in comparison with different options like GPTQ or llama.cpp. We utilized it to the zephyr-7B-beta mannequin to create a 5.0 bpw model of it, utilizing the brand new EXL2 format. After quantization, we examined our mannequin to see the way it performs. Lastly, it was uploaded to the Hugging Face Hub and might be discovered right here.

For those who’re thinking about extra technical content material round LLMs, comply with me on Medium.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article