The Python implementation is sort of simple:

`def zeropoint_quantize(X):`

# Calculate worth vary (denominator)

x_range = torch.max(X) - torch.min(X)

x_range = 1 if x_range == 0 else x_range# Calculate scale

scale = 255 / x_range

# Shift by zero-point

zeropoint = (-scale * torch.min(X) - 128).spherical()

# Scale and around the inputs

X_quant = torch.clip((X * scale + zeropoint).spherical(), -128, 127)

# Dequantize

X_dequant = (X_quant - zeropoint) / scale

return X_quant.to(torch.int8), X_dequant

As an alternative of counting on full toy examples, we will use these two capabilities on an actual mannequin because of the `transformers`

library.

We begin by loading the mannequin and tokenizer for GPT-2. This can be a very small mannequin we most likely donâ€™t wish to quantize, however will probably be ok for this tutorial. First, we wish to observe the mannequinâ€™s measurement so we will evaluate it later and consider the **reminiscence financial savings** as a result of 8-bit quantization.

`!pip set up -q bitsandbytes>=0.39.0`

!pip set up -q git+https://github.com/huggingface/speed up.git

!pip set up -q git+https://github.com/huggingface/transformers.git

`from transformers import AutoModelForCausalLM, AutoTokenizer`

import torch

torch.manual_seed(0)# Set system to CPU for now

system = 'cpu'

# Load mannequin and tokenizer

model_id = 'gpt2'

mannequin = AutoModelForCausalLM.from_pretrained(model_id).to(system)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Print mannequin measurement

print(f"Mannequin measurement: {mannequin.get_memory_footprint():,} bytes")

`Mannequin measurement: 510,342,192 bytes`

The dimensions of the GPT-2 mannequin is roughly 487MB in FP32. The subsequent step consists of quantizing the weights utilizing zero-point and absmax quantization. Within the following instance, we apply these methods to the primary consideration layer of GPT-2 to see the outcomes.

`# Extract weights of the primary layer`

weights = mannequin.transformer.h[0].attn.c_attn.weight.knowledge

print("Authentic weights:")

print(weights)# Quantize layer utilizing absmax quantization

weights_abs_quant, _ = absmax_quantize(weights)

print("nAbsmax quantized weights:")

print(weights_abs_quant)

# Quantize layer utilizing absmax quantization

weights_zp_quant, _ = zeropoint_quantize(weights)

print("nZero-point quantized weights:")

print(weights_zp_quant)

`Authentic weights:`

tensor([[-0.4738, -0.2614, -0.0978, ..., 0.0513, -0.0584, 0.0250],

[ 0.0874, 0.1473, 0.2387, ..., -0.0525, -0.0113, -0.0156],

[ 0.0039, 0.0695, 0.3668, ..., 0.1143, 0.0363, -0.0318],

...,

[-0.2592, -0.0164, 0.1991, ..., 0.0095, -0.0516, 0.0319],

[ 0.1517, 0.2170, 0.1043, ..., 0.0293, -0.0429, -0.0475],

[-0.4100, -0.1924, -0.2400, ..., -0.0046, 0.0070, 0.0198]])Absmax quantized weights:

tensor([[-21, -12, -4, ..., 2, -3, 1],

[ 4, 7, 11, ..., -2, -1, -1],

[ 0, 3, 16, ..., 5, 2, -1],

...,

[-12, -1, 9, ..., 0, -2, 1],

[ 7, 10, 5, ..., 1, -2, -2],

[-18, -9, -11, ..., 0, 0, 1]], dtype=torch.int8)

Zero-point quantized weights:

tensor([[-20, -11, -3, ..., 3, -2, 2],

[ 5, 8, 12, ..., -1, 0, 0],

[ 1, 4, 18, ..., 6, 3, 0],

...,

[-11, 0, 10, ..., 1, -1, 2],

[ 8, 11, 6, ..., 2, -1, -1],

[-18, -8, -10, ..., 1, 1, 2]], dtype=torch.int8)

The distinction between the unique (FP32) and quantized values (INT8) is evident, however the distinction between absmax and zero-point weights is extra delicate. On this case, the inputs look shifted by a price of -1. This means that the burden distribution on this layer is sort of symmetric.

We are able to evaluate these methods by quantizing each layer in GPT-2 (linear layers, consideration layers, and many others.) and create two new fashions: `model_abs`

and `model_zp`

. To be exact, we are going to truly substitute the unique weights with ** de**-quantized ones. This has two advantages: it permits us to 1/ evaluate the distribution of our weights (identical scale) and a couple of/ truly run the fashions.

Certainly, PyTorch doesnâ€™t enable INT8 matrix multiplication by default. In an actual state of affairs, we might dequantize them to run the mannequin (in FP16 for instance) however retailer them as INT8. Within the subsequent part, we are going to use the `bitsandbytes`

library to unravel this subject.

`import numpy as np`

from copy import deepcopy# Retailer unique weights

weights = [param.data.clone() for param in model.parameters()]

# Create mannequin to quantize

model_abs = deepcopy(mannequin)

# Quantize all mannequin weights

weights_abs = []

for param in model_abs.parameters():

_, dequantized = absmax_quantize(param.knowledge)

param.knowledge = dequantized

weights_abs.append(dequantized)

# Create mannequin to quantize

model_zp = deepcopy(mannequin)

# Quantize all mannequin weights

weights_zp = []

for param in model_zp.parameters():

_, dequantized = zeropoint_quantize(param.knowledge)

param.knowledge = dequantized

weights_zp.append(dequantized)

Now that our fashions have been quantized, we wish to examine the impression of this course of. Intuitively, we wish to guarantee that the quantized weights are **near the unique ones**. A visible strategy to examine it’s to plot the distribution of the dequantized and unique weights. If the quantization is lossy, it will drastically change the burden distribution.

The next determine exhibits this comparability, the place the blue histogram represents the unique (FP32) weights, and the purple one represents the dequantized (from INT8) weights. Word that we solely show this plot between -2 and a couple of due to outliers with very excessive absolute values (extra on that later).

Each plots are fairly comparable, with a shocking spike round 0. This spike exhibits that our quantization is sort of lossy since reversing the method doesnâ€™t output the unique values. That is notably true for the absmax mannequin, which shows each a decrease valley and the next spike round 0.

Letâ€™s evaluate the efficiency of the unique and quantized fashions. For this function, we outline a `generate_text()`

perform to generate 50 tokens with top-k sampling.

`def generate_text(mannequin, input_text, max_length=50):`

input_ids = tokenizer.encode(input_text, return_tensors='pt').to(system)

output = mannequin.generate(inputs=input_ids,

max_length=max_length,

do_sample=True,

top_k=30,

pad_token_id=tokenizer.eos_token_id,

attention_mask=input_ids.new_ones(input_ids.form))

return tokenizer.decode(output[0], skip_special_tokens=True)# Generate textual content with unique and quantized fashions

original_text = generate_text(mannequin, "I've a dream")

absmax_text = generate_text(model_abs, "I've a dream")

zp_text = generate_text(model_zp, "I've a dream")

print(f"Authentic mannequin:n{original_text}")

print("-" * 50)

print(f"Absmax mannequin:n{absmax_text}")

print("-" * 50)

print(f"Zeropoint mannequin:n{zp_text}")

`Authentic mannequin:`

I've a dream, and it's a dream I consider I might get to dwell in my future. I like my mom, and there was that one time I had been informed that my household wasn't even that robust. After which I obtained the

--------------------------------------------------

Absmax mannequin:

I've a dream to search out out the origin of her hair. She loves it. However there is not any manner you may be trustworthy about how her hair is made. She have to be loopy.We discovered a photograph of the coiffure posted on

--------------------------------------------------

Zeropoint mannequin:

I've a dream of making two full-time jobs in Americaâ€”one for folks with psychological well being points, and one for individuals who don't undergo from psychological sicknessâ€”or not less than have an employment and household historical past of substance abuse, to work half

As an alternative of attempting to see if one output makes extra sense than the others, we will quantify it by calculating the **perplexity** of every output. This can be a widespread metric used to judge language fashions, which measures the uncertainty of a mannequin in predicting the following token in a sequence. On this comparability, we make the widespread assumption that the decrease the rating, the higher the mannequin is. In follow, a sentence with a excessive perplexity may be right.

We implement it utilizing a minimal perform because it doesnâ€™t want to contemplate particulars just like the size of the context window since our sentences are quick.

`def calculate_perplexity(mannequin, textual content):`

# Encode the textual content

encodings = tokenizer(textual content, return_tensors='pt').to(system)# Outline input_ids and target_ids

input_ids = encodings.input_ids

target_ids = input_ids.clone()

with torch.no_grad():

outputs = mannequin(input_ids, labels=target_ids)

# Loss calculation

neg_log_likelihood = outputs.loss

# Perplexity calculation

ppl = torch.exp(neg_log_likelihood)

return ppl

ppl = calculate_perplexity(mannequin, original_text)

ppl_abs = calculate_perplexity(model_abs, absmax_text)

ppl_zp = calculate_perplexity(model_zp, absmax_text)

print(f"Authentic perplexity: {ppl.merchandise():.2f}")

print(f"Absmax perplexity: {ppl_abs.merchandise():.2f}")

print(f"Zeropoint perplexity: {ppl_zp.merchandise():.2f}")

`Authentic perplexity: 15.53`

Absmax perplexity: 17.92

Zeropoint perplexity: 17.97

We see that the perplexity of the unique mannequin is **barely decrease** than the 2 others. A single experiment just isn’t very dependable, however we might repeat this course of a number of instances to see the distinction between every mannequin. In principle, zero-point quantization needs to be barely higher than absmax, however can be extra expensive to compute.

On this instance, we utilized quantization methods to total layers (per-tensor foundation). Nonetheless, we might apply it at totally different granularity ranges: from your entire mannequin to particular person values. Quantizing your entire mannequin in a single cross would critically degrade the efficiency, whereas quantizing particular person values would create a giant overhead. In follow, we regularly want the **vector-wise quantization**, which considers the variability of values in rows and columns inside the identical tensor.

Nonetheless, even vector-wise quantization doesnâ€™t clear up the issue of outlier options. Outlier options are excessive values (damaging or constructive) that seem in all transformer layers when the mannequin attain a sure scale (>6.7B parameters). This is a matter since a single outlier can cut back the precision for all different values. However discarding these outlier options just isn’t an choice since it will **vastly degrade** the mannequinâ€™s efficiency.

Launched by Dettmers et al. (2022), LLM.int8() is an answer to the outlier drawback. It depends on a vector-wise (absmax) quantization scheme and introduces mixed-precision quantization. Because of this outlier options are processed in a FP16 format to retain their precision, whereas the opposite values are processed in an INT8 format. As outliers symbolize about 0.1% of values, this successfully reduces the reminiscence footprint of the LLM by virtually 2x.

LLM.int8() works by conducting matrix multiplication computation in three key steps:

- Extract columns from the enter hidden states
**X**containing outlier options utilizing a customized threshold. - Carry out the matrix multiplication of the outliers utilizing FP16 and the non-outliers utilizing INT8 with vector-wise quantization (row-wise for the hidden state
**X**and column-wise for the burden matrix**W**). - Dequantize the non-outlier outcomes (INT8 to FP16) and add them to the outlier outcomes to get the total lead to FP16.