## Outline and prepare GPT-2 in your MacBook

My purpose with this put up is to stroll you thru defining and coaching GPT-2 from scratch with MLX, Apple’s machine-learning library for Apple silicon. I need to go away no stone unturned from tokenizer to sampling. Within the spirit of Karpathy’s wonderful GPT from scratch tutorial, we’ll prepare a mannequin on the works of Shakespeare [1]. We are going to begin with a clean Python file and finish with a bit of software program that may write Shakespeare-like textual content. And we’ll construct all of it in MLX, which makes coaching on inference on Apple silicon a lot sooner.

This put up is finest skilled by following alongside. The code is contained within the following repo which I recommend opening and referencing.

Set up mlx and run the next imports.

`import mlx.core as mx`

import mlx.nn as nn

import mlx.optimizers as optim

import mlx.utils as utils

import numpy as np

import math

Step one to coaching an LLM is accumulating a big corpus of textual content knowledge after which tokenizing it. Tokenization is the method of mapping textual content to integers, which may be fed into the LLM. Our coaching corpus for this mannequin would be the works of Shakespeare concatenated into one file. That is roughly 1 million characters and appears like this:

`First Citizen:`

Earlier than we proceed any additional, hear me communicate.All:

Communicate, communicate.

First Citizen:

You might be all resolved reasonably to die than to famish?

All:

Resolved. resolved.

First Citizen:

First, Caius Marcius is chief enemy to the individuals.

...

First, we learn the file as a single lengthy string into the `textual content`

variable. Then we use the `set()`

operate to get all of the distinctive characters within the textual content which will probably be our vocabulary. By printing `vocab`

you’ll be able to see all of the characters in our vocabulary as one string, and we’ve a complete of 65 characters which until be our tokens.

`# Creating the vocabulary`

with open('enter.txt', 'r', encoding='utf-8') as f:

textual content = f.learn()

vocab = sorted(listing(set(textual content)))

vocab_size = len(vocab)print(''.be a part of(vocab))

# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

print(vocab_size)

# 65

Manufacturing fashions will use tokenization algorithms like byte-pair encoding to generate a bigger vocabulary of sub-word chunks. Since our focus in the present day is on the structure, we’ll proceed with character-level tokenization. Subsequent, we’ll map our vocabulary to integers generally known as token IDs. Then we are able to encode our textual content into tokens and decode them again to a string.

`# Create mapping from vocab to integers`

itos = {i:c for i,c in enumerate(vocab)} # int to string

stoi = {c:i for i,c in enumerate(vocab)} # string to int

encode = lambda x: [stoi[c] for c in x]

decode = lambda x: ''.be a part of([itos[i] for i in x])print(encode("hey world"))

# [46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]

print(decode(encode("hey world")))

# hey world

We use the`enumerate()`

operate to iterate over all characters and their index within the vocabulary and create a dictionary `itos`

which maps integers to characters and `stoi`

which maps strings to integers. Then we use these mappings to create our encode and decode capabilities. Now we are able to encode your complete textual content and cut up coaching and validation knowledge.

`knowledge = encode(textual content)`

cut up = int(0.9 * len(knowledge))

train_data = knowledge[:split]

val_data = knowledge[split:]

Presently, our coaching knowledge is only a very lengthy string of tokens. Nonetheless, we try to coach our mannequin to foretell the subsequent token some given earlier tokens. Subsequently our dataset ought to be comprised of examples the place the enter is a few string of tokens and the label is the proper subsequent token. We have to outline a mannequin parameter known as *context size *which is the utmost variety of tokens used to foretell the subsequent token. Our coaching examples would be the size of our context size.

Let’s have a look at the primary `ctx_len+1`

tokens.

`ctx_len = 8`

print(train_data[:ctx_len + 1])

# [18, 47, 56, 57, 58, 1, 15, 47, 58]

# x: [18, 47, 56, 57, 58, 1, 15, 47] | y: 58

That is one coaching instance the place the enter is “18, 47, 56, 57, 58, 1, 15, 47” and the specified output is “58”. That is 8 tokens of context. Nonetheless, we additionally need to prepare the mannequin to foretell the subsequent token given solely 7, 6, 5 … 0 tokens as context which is required throughout era. Subsequently we additionally think about the 8 sub examples packed into this instance:

`ctx_len = 8`

print(train_data[:ctx_len + 1])

# [18, 47, 56, 57, 58, 1, 15, 47, 58]

# 8 sub examples

# [18] --> 47

# [18, 47] --> 56

# [18, 47, 56] --> 57

# [18, 47, 56, 57] --> 58

# [18, 47, 56, 57, 58] --> 1

# [18, 47, 56, 57, 58, 1] --> 15

# [18, 47, 56, 57, 58, 1, 15] --> 47

# [18, 47, 56, 57, 58, 1, 15, 47] --> 58

Discover that the labels are merely the inputs shifted left.

`print("inputs: ", train_data[:ctx_len])`

print("labels: ", train_data[1:ctx_len+1]) # labels = inputs listed 1 increased

# inputs: [18, 47, 56, 57, 58, 1, 15, 47]

# labels: [47, 56, 57, 58, 1, 15, 47, 58]

At index 0 the enter is eighteen and the label is 47. At index 1 the enter is all the things earlier than and together with index 1 which is [18, 47] and the label is 56, and so on. Now that we perceive that the labels are merely the enter sequence listed one increased we are able to construct our datasets.

`# Creating coaching and validation datasets`

ctx_len = 8

X_train = mx.array([train_data[i:i+ctx_len] for i in vary(0, len(train_data) - ctx_len, ctx_len)])

y_train = mx.array([train_data[i+1:i+ctx_len+1] for i in vary(0, len(train_data) - ctx_len, ctx_len)])

X_val = mx.array([val_data[i:i+ctx_len] for i in vary(0, len(val_data) - ctx_len, ctx_len)])

y_val = mx.array([val_data[i+1:i+ctx_len+1] for i in vary(0, len(val_data) - ctx_len, ctx_len)])

We loop by means of the information and take chunks of measurement `ctx_len`

because the inputs (X) after which take the identical chunks however at 1 increased index because the labels (y). Then we take these Python lists and create mlx array objects from them. The mannequin internals will probably be written with mlx so we would like our inputs to be mlx arrays.

Yet one more factor. Throughout coaching we don’t need to feed the mannequin one instance at a time, we need to feed it a number of examples in parallel for effectivity. This group of examples is known as our batch, and the variety of examples in a gaggle is our batch measurement. Thus we outline a operate to generate batches for coaching.

`def get_batches(X, y, b_size, shuffle=True):`

if shuffle:

ix = np.arange(X.form[0])

np.random.shuffle(ix)

ix = mx.array(ix)

X = X[ix]

y = y[ix]

for i in vary(0, X.form[0], b_size):

enter = X[i:i+b_size]

label = y[i:i+b_size]

yield enter, label

If shuffle=True, we shuffle the information by indexing it with a randomly shuffled index. Then we loop by means of our dataset and return batch-size chunks from enter and label datasets. These chunks are generally known as mini-batches and are simply stacked examples that we course of in parallel. These mini-batches will probably be our enter to the mannequin throughout coaching.

Right here’s an instance of a minibatch of 4 examples with context size 8.

This minibatch packs 32 next-token prediction issues. The mannequin will predict the subsequent token for every token within the enter and the labels will probably be used to calculate the loss. Discover that the labels include the subsequent token for every index of the inputs.

You’ll need to maintain this image in your thoughts as a result of the shapes of those tensors will get bushy. For now, simply keep in mind that we’ll enter a tensor of form (batch_size, ctx_len) to the mannequin.

Let’s have a look at the GPT-2 structure to get an outline of what we try to implement.

Don’t fear if this seems to be complicated. We are going to implement it step-by-step from backside to prime. Let’s begin by implementing the enter embeddings.

## Enter Embeddings

The aim of the enter embedding layer is to map token IDs to vectors. Every token will probably be mapped to a vector which will probably be its illustration as it’s forwarded by means of the mannequin. The vectors for every token will accumulate and change info as they move by means of the mannequin and finally be used to foretell the subsequent token. These vectors are known as embeddings.

The best option to map token IDs to vectors is thru a lookup desk. We create a matrix of measurement (vocab_size, n_emb) the place every row is the embedding vector for the corresponding token. This matrix is named the embedding weights.

The diagram reveals an instance embedding layer of measurement (65, 6). This implies there are 65 tokens within the vocabulary and each will probably be represented by a size 6 embedding vector. The inputted sequence will probably be used to index the embedding weights to get the vector corresponding to every token. Keep in mind the minibatches we enter into the mannequin? Initially the minibatch is measurement (batch_size, ctx_len). After passing by means of the embedding layer it’s measurement (batch_size, ctx_len, n_emb). As an alternative of every token being a single integer, every token is now a vector of size n_emb.

Let’s outline the embedding layer in code now.

`n_emb = 6 # You'll be able to add these hyperparams on the prime of your file`

class GPT(nn.Module):

def __init__(self):

tremendous().__init__()

self.wte = nn.Embedding(vocab_size, n_emb)

We are going to outline a category to prepare our implementation. We subclass nn.Module to benefit from mlx’s options. Then within the init operate, we name the superclass constructor and initialize our token embedding layer known as `wte`

.

## Positional Embeddings

Subsequent up is the positional embeddings. The aim of positional embeddings is to encode details about the place of every token within the sequence. This may be added to our enter embeddings to get an entire illustration of every token that incorporates details about the token’s place within the sequence.

`class GPT(nn.Module):`

def __init__(self):

tremendous().__init__()

self.wte = nn.Embedding(vocab_size, n_emb) # token embeddings

self.wpe = nn.Embedding(ctx_len, n_emb) # place embeddings

The place embeddings work the identical as token embeddings, besides as a substitute of getting a row for every token we’ve a row for every potential place index. This implies our embedding weights will probably be of form (ctx_len, n_emb). Now we implement the __call__ operate in our GPT class. This operate will include the ahead move of the mannequin.

`# Tensor shapes commented`

def __call__(self, x):

B, T = x.form # (B = batch_size, T = ctx_len)

tok_emb = self.wte(x) # (B, T, n_emb)

pos_emb = self.wpe(mx.arange(T)) # (T, n_emb)

x = tok_emb + pos_emb # (B, T, n_emb)

First, we get away the size of our enter into variables B and T for straightforward dealing with. In sequence modeling contexts B and T are often used as shorthand for “batch” and “time” dimensions. On this case, the “time” dimension of our sequence is the context size.

Subsequent, we calculate token and place embeddings. Discover that for the place embeddings, our enter is `mx.arange(T)`

. It will output an array of consecutive integers from 0 to T-1 which is strictly what we would like as a result of these are the positions we need to embed. After passing that by means of the embedding layer we may have a tensor of form (T, n_emb) as a result of the embedding layer plucks out the n_emb size vector for every of the T positions. Observe that although pos_emb just isn’t the identical form as tok_emb we are able to add the 2 as a result of mlx will broadcast, or replicate pos_emb throughout the batch dimension to permit elementwise addition. Lastly, we carry out the addition to get the brand new representations of the tokens with positional info.

## Self-Consideration

Up to now the illustration vectors for every token have been calculated independently. They haven’t had the chance to change any info. That is intuitively dangerous in language modeling as a result of the that means and utilization of phrases depend upon the encircling context. Self-attention is how we incorporate info from earlier tokens right into a given token.

First, let’s think about a naive strategy. What if we merely represented every token as the typical of its illustration vector and the vectors of all of the tokens earlier than it? This achieves our purpose of packing info from earlier tokens into the illustration for a given token. Right here’s what it might appear like.

However self-attention doesn’t contain writing a for-loop. The important thing perception is we are able to obtain this earlier token averaging with matrix multiplication!

By multiplying our enter sequence on the left by a particular matrix we get the specified consequence. This matrix is named the eye weights. Discover that every row of the eye weight matrix specificies “how a lot” of one another token goes into the illustration for any given token. For instance in row two, we’ve [0.5, 0.5, 0, 0]. Because of this row two of the consequence will probably be `0.5*token1 + 0.5*token2 + 0*token3 + 0*token4`

, or the typical of token1 and token2. Observe that the eye weights are a lower-triangular matrix (zeros in higher proper entries). This ensures that future tokens is not going to be included within the illustration of a given token. This ensures that tokens can solely talk with the earlier tokens as a result of throughout era the mannequin will solely have entry to earlier tokens.

Let’s have a look at how we are able to assemble the eye weight matrix.

Discover that if we create an array of zeros with -inf within the higher proper entries after which carry out row-wise softmax we get the specified consideration weights. An excellent train is to step by means of the softmax calculation for a row to see how this works. The takeaway is that we are able to take some array of measurement (ctx_len, ctx_len) and softmax every row to get consideration weights that sum to 1.

Now we are able to go away the realm of naive self-attention. As an alternative of merely averaging earlier tokens, we use arbitrary weighted sums over earlier tokens. Discover what occurs after we do row-wise softmax of an arbitrary matrix.

We nonetheless get weights that sum to 1 on every row. Throughout coaching, we are able to study the numbers within the matrix on the left which can specify how a lot every token goes into the illustration for one more token. That is how tokens pay “consideration” to one another. However we nonetheless haven’t understood the place this matrix on the left got here from. These pre-softmax consideration weights are calculated from the tokens themselves, however not directly by means of three linear projections.

**Keys, Queries, and Values**

Every token in our sequence emits 3 new vectors. These vectors are known as keys, queries, and values. We use the dot product of the question vector of 1 token and the important thing vector of one other token to quantify the “affinity” these two tokens have. We need to calculate the pairwise affinities of every token with each different token, subsequently we multiply the question vector (4×3) with the important thing vector transposed (3×4) to get the uncooked consideration weights (4×4). Because of the means matrix multiplication works the (i,j) entry within the uncooked consideration weights would be the question of token i dot the important thing of token j or the “affinity” between the 2. Thus we’ve calculated interactions between each token. Nonetheless, we don’t need previous tokens interacting with future tokens so we apply a masks of -inf to the higher proper entries to make sure they are going to zero out after softmax. Then we carry out row-wise softmax to get the ultimate consideration weights. As an alternative of multiplying these weights straight with the enter, we multiply them with the worth projection. This leads to the brand new representations.

Now that we perceive consideration conceptually, let’s implement it.

`class Consideration(nn.Module):`

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

We begin by defining the important thing, question, and worth projection layers. Observe that as a substitute of going from n_emb to n_emb, we challenge from n_emb to head_size. This doesn’t change something, it simply means the brand new representations calculated by consideration will probably be dimension head_size.

`class Consideration(nn.Module):`

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

def __call__(self, x): # shapes commented

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

The ahead move begins by calculating the important thing, question, and worth projections. We additionally get away the enter form into the variables B, T, and C for future comfort.

`class Consideration(nn.Module):`

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

attn_weights = (Q @ Okay.transpose([0, 2, 1])) / math.sqrt(self.head_size)

# attn_weights.form = (B, T, T)

Subsequent, we calculate the eye weights. We solely need to transpose the final two dimensions of the important thing tensor, as a result of the batch dimension is simply there so we are able to ahead a number of coaching examples in parallel. The mlx transpose operate expects the brand new order of the size as enter, so we move it [0, 2, 1] to transpose the final two dimensions. Yet one more factor: we scale the eye weights by the inverse sq. root of head_size. This is named scaled consideration and the aim is to make sure that when Q and Okay are unit variance, attn_weights will probably be unit variance. If the variance of attn_weights is excessive, then the softmax will map these small and huge values to 0 or 1which leads to much less advanced representations.

The following step is to use the masks to make sure we’re doing causal language modeling i.e. making certain tokens can not attend to future tokens.

`class Consideration(nn.Module):`

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

attn_weights = (Q @ Okay.transpose([0, 2, 1])) / math.sqrt(self.head_size)

# attn_weights.form = (B, T, T)

We create the masks with a intelligent broadcasting trick. Let’s say our ctx_len=4 like within the diagrams above. First, we use mx.arange(4) to set the indices variable to [0, 1, 2, 3].

Then we are able to index like so `indices[:, None]`

to generate a column vector with the values of indices. Equally, we are able to get a row vector utilizing `indices[None]`

. Then after we do the < comparability, mlx broadcasts the vectors as a result of they’ve mismatching shapes to allow them to’t be in contrast elementwise. Broadcasting means mlx will replicate the vectors alongside the missing dimension. This leads to an elementwise comparability of two (4, 4) matrices which is smart. Aspect be aware: I like to recommend familiarizing your self with the main points of broadcasting by studying this, it comes up on a regular basis when coping with tensors.

After the elementwise comparability, we’re left with the next tensor:

`[[False, True, True, True],`

[False, False, True, True],

[False, False, False, True],

[False, False, False, False]]

Multiplying this tensor by -1e9, we get:

`[[-0e+00, -1e+09, -1e+09, -1e+09],`

[-0e+00, -0e+00, -1e+09, -1e+09],

[-0e+00, -0e+00, -0e+00, -1e+09],

[-0e+00, -0e+00, -0e+00, -0e+00]]

Now we’ve an additive masks. We are able to add this matrix to our consideration weights to make all of the higher proper entries very giant damaging numbers. It will trigger them to be zeroed out after the softmax operation. Additionally, be aware that we add “_” as a prefix to the attribute identify `_causal_mask`

which marks it as a personal variable. This alerts to mlx that it isn’t a parameter and shouldn’t be up to date throughout coaching.

`class Consideration(nn.Module):`

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

attn_weights = (Q @ Okay.transpose([0, 2, 1])) / math.sqrt(self.head_size)

# attn_weights.form = (B, T, T)

attn_weights = attn_weights + self._causal_mask

attn_weights = mx.softmax(attn_weights, axis=-1)

o = (attn_weights @ V) # (B, T, head_size)

Now we are able to softmax row-wise to get the ultimate consideration weights and multiply these weights by the values to get our output. Observe we move `axis=-1`

to softmax which specifies that we need to softmax throughout the final dimension that are the rows.

The ultimate step is output linear projection and dropout.

`dropout = 0.1 # add this with hyperparams at prime of file`

class Consideration(nn.Module):

def __init__(self, head_size):

tremendous().__init__()

self.head_size = head_size

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

self.c_proj = nn.Linear(head_size, n_emb) # output projection

self.resid_dropout = nn.Dropout(dropout)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

attn_weights = (Q @ Okay.transpose([0, 2, 1])) / math.sqrt(self.head_size)

# attn_weights.form = (B, T, T)

attn_weights = attn_weights + self._causal_mask

attn_weights = mx.softmax(attn_weights, axis=-1)

o = (attn_weights @ V) # (B, T, head_size)

o = self.c_proj(self.resid_dropout(o))

return o

We add two new layers, `c_proj`

and `resid_dropout`

that are the output projection and residual dropout. The output projection is to return the vectors to their unique dimension n_emb. The dropout is added for regularization and coaching stability which is vital as we begin layering the transformer blocks to get a deep community. And that’s it for implementing one consideration head!

## Multi-Head Consideration

As an alternative of getting only one consideration head LLMs usually use a number of consideration heads in parallel and concatenate their outputs to create the ultimate illustration. For instance, let’s say we had one consideration head with head_size=64 so the vector it produced for every token was 64 dimensional. We might obtain the identical factor with 4 parallel consideration heads every with head_size=16 by concatenating their outputs to supply a 16×4 = 64 dimensional output. Multi-head consideration permits the mannequin to study extra advanced representations as a result of every head learns totally different projections and a focus weights.

`n_heads = 4`

class MultiHeadAttention(nn.Module): # naive implementation

def __init__(self):

tremendous().__init__()

self.heads = [Attention(head_size // n_heads) for _ in range(n_heads)]

def __call__(self, x):

return mx.concatenate([head(x) for head in self.heads], axis=-1)

The easy implementation is to create an inventory of `n_heads`

consideration heads the place each has measurement equal to our closing head measurement divided by n_heads. Then we concatenate the output of every head over the past axis. Nonetheless, this implementation is inefficient and doesn’t benefit from the velocity of tensors. Let’s implement multi-head consideration with the ability of tensors.

`head_size = 64 # put at prime of file`

class MultiHeadAttention(nn.Module):

def __init__(self):

tremendous().__init__()

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

self.c_proj = nn.Linear(head_size, n_emb) # output projection

self.resid_dropout = nn.Dropout(dropout)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

We begin with our single-head consideration implementation. The `__init__()`

operate has not modified. The ahead move begins as regular with the creation of the important thing, question, and worth projections.

`head_size = 64 # put at prime of file`

n_heads = 8 # put at prime of file

class MultiHeadAttention(nn.Module):

def __init__(self):

tremendous().__init__()

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

self.c_proj = nn.Linear(head_size, n_emb) # output projection

self.resid_dropout = nn.Dropout(dropout)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

mha_shape = (B, T, n_heads, head_size//n_heads)

Okay = mx.as_strided(Okay, (mha_shape)) # (B, T, n_heads, head_size//n_heads)

Q = mx.as_strided(Q, (mha_shape)) # (B, T, n_heads, head_size//n_heads)

V = mx.as_strided(V, (mha_shape)) # (B, T, n_heads, head_size//n_heads)

The following factor we have to do is introduce a brand new dimension for the variety of heads `n_heads`

. Within the naive implementation, we had separate consideration objects every with their very own key, question, and worth tensors however now we’ve them multi functional tensor, subsequently we want a dimension for the heads. We outline the brand new form we would like in `mha_shape`

. Then we use `mx.as_strided()`

to reshape every tensor to have the top dimension. This operate is equal to `view`

from pytorch and tells mlx to deal with this array as a distinct form. However we nonetheless have an issue. Discover that we if attempt to multiply `Q @ K_t`

(the place K_t is Okay transposed over it’s final 2 dims) to compute consideration weights as we did earlier than, we will probably be multiplying the next shapes:

`(B, T, n_heads, head_size//n_heads) @ (B, T, head_size//n_heads, n_heads)`

End result form: (B, T, n_heads, n_heads)

This may lead to a tensor of form `(B, T, n_heads, n_heads)`

which is inaccurate. With one head our consideration weights have been form `(B, T, T)`

which is smart as a result of it offers us the interplay between every pair of tokens. So now our form ought to be the identical however with a heads dimension: `(B, n_heads, T, T)`

. We obtain this by transposing the size of keys, queries, and values after we reshape them to make `n_heads`

dimension 1 as a substitute of two.

`head_size = 64 # put at prime of file`

n_heads = 8 # put at prime of file

class MultiHeadAttention(nn.Module):

def __init__(self):

tremendous().__init__()

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

self.c_proj = nn.Linear(head_size, n_emb) # output projection

self.attn_dropout = nn.Dropout(dropout)

self.resid_dropout = nn.Dropout(dropout)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

mha_shape = (B, T, n_heads, head_size//n_heads)

Okay = mx.as_strided(Okay, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

Q = mx.as_strided(Q, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

V = mx.as_strided(V, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

attn_weights = (Q @ Okay.transpose([0, 1, 3, 2])) / math.sqrt(Q.form[-1]) # (B, n_heads, T, T)

attn_weights = attn_weights + self._causal_mask[:T, :T]

attn_weights = mx.softmax(attn_weights, axis=-1)

attn_weights = self.attn_dropout(attn_weights)

o = (attn_weights @ V) # (B, n_heads, T, head_size//n_heads)

Now we are able to calculate the correction consideration weights. Discover that we scale the eye weights by the scale of a person consideration head reasonably than head_size which might be the scale after concatenation. We additionally apply dropout to the eye weights.

Lastly, we carry out the concatenation and apply the output projection and dropout.

`head_size = 64 # put at prime of file`

n_heads = 8 # put at prime of file

class MultiHeadAttention(nn.Module):

def __init__(self):

tremendous().__init__()

self.k_proj = nn.Linear(n_emb, head_size, bias=False)

self.q_proj = nn.Linear(n_emb, head_size, bias=False)

self.v_proj = nn.Linear(n_emb, head_size, bias=False)

indices = mx.arange(ctx_len)

masks = indices[:, None] < indices[None] # broadcasting trick

self._causal_mask = masks * -1e9

self.c_proj = nn.Linear(head_size, n_emb) # output projection

self.attn_dropout = nn.Dropout(dropout)

self.resid_dropout = nn.Dropout(dropout)

def __call__(self, x):

B, T, C = x.form # (batch_size, ctx_len, n_emb)

Okay = self.k_proj(x) # (B, T, head_size)

Q = self.q_proj(x) # (B, T, head_size)

V = self.v_proj(x) # (B, T, head_size)

mha_shape = (B, T, n_heads, head_size//n_heads)

Okay = mx.as_strided(Okay, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

Q = mx.as_strided(Q, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

V = mx.as_strided(V, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)

attn_weights = (Q @ Okay.transpose([0, 1, 3, 2])) / math.sqrt(Q.form[-1]) # (B, n_heads, T, T)

attn_weights = attn_weights + self._causal_mask[:T, :T]

attn_weights = mx.softmax(attn_weights, axis=-1)

attn_weights = self.attn_dropout(attn_weights)

o = (attn_weights @ V) # (B, n_heads, T, head_size//n_heads)

o = o.transpose([0, 2, 1, 3]).reshape((B, T, head_size)) # concat heads

o = self.c_proj(self.resid_dropout(o))

return o

Since we’ve all the things in a single tensor, we are able to do some form manipulation to do the concatenation. First, we transfer `n_heads`

again to the second to final dimension with the transpose operate. Then we reshape again to the unique measurement to undo the splitting into heads we carried out earlier. This is similar as concatenating the ultimate vectors from every head. And that’s it for multi-head consideration! We’ve gotten by means of probably the most intense a part of our implementation.

The following a part of the structure is the multilayer notion or MLP. This can be a fancy means of claiming 2 stacked linear layers. There’s not a lot to be stated right here, it’s a commonplace neural community.

`class MLP(nn.Module):`

def __init__(self):

tremendous().__init__()

self.c_fc = nn.Linear(n_emb, 4 * n_emb)

self.gelu = nn.GELU()

self.c_proj = nn.Linear(4 * n_emb, n_emb)

self.dropout = nn.Dropout(dropout)

def __call__(self, x):

x = self.gelu(self.c_fc(x))

x = self.c_proj(x)

x = self.dropout(x)

return x

We take the enter and challenge it to a better dimension with `c_fc`

. Then we apply gelu nonlinearity and challenge it again all the way down to the embedding dimension with `c_proj`

. Lastly, we apply dropout and return. The aim of the MLP is to permit for some computation after the vectors have communicated throughout consideration. We are going to stack these communication layers (consideration) and computation layers (mlp) right into a block.

A GPT block consists of consideration adopted by an MLP. These blocks will probably be repeated to make the structure deep.

`class Block(nn.Module):`

def __init__(self):

tremendous().__init__()

self.mlp = MLP()

self.mha = MultiHeadAttention()

def __call__(self, x):

x = self.mha(x)

x = self.mlp(x)

return x

Now, we have to add two extra options to enhance coaching stability. Let’s check out the structure diagram once more.

## Layernorms and Skip Connections

We nonetheless must implement the parts highlighted in pink. The arrows are skip connections. As an alternative of the enter being reworked straight, the impact of the eye and MLP layers is additive. Their result’s added to the enter as a substitute of straight changing it. That is good for the coaching stability of deep networks since within the backward move, the operands of an addition operation will obtain the identical gradient as their sum. Gradients can thus stream backwards freely which prevents points like vanishing/exploding gradients that plague deep networks. Layernorm additionally helps with coaching stability by making certain activations are usually distributed. Right here is the ultimate implementation.

`class Block(nn.Module):`

def __init__(self):

tremendous().__init__()

self.mlp = MLP()

self.mha = MultiHeadAttention()

self.ln_1 = nn.LayerNorm(dims=n_emb)

self.ln_2 = nn.LayerNorm(dims=n_emb)

def __call__(self, x):

x = x + self.mha(self.ln_1(x))

x = x + self.mlp(self.ln_2(x))

return x

Layernorm is utilized earlier than multi-head consideration and MLP. The skip connections are added with `x = x + ...`

making the operations additive.

With the Block outlined, we are able to end the total GPT-2 ahead move.

`n_layers = 3 # put at prime of file`

class GPT(nn.Module):

def __init__(self):

tremendous().__init__()

self.wte = nn.Embedding(vocab_size, n_emb) # token embeddings

self.wpe = nn.Embedding(ctx_len, n_emb) # place embeddings

self.blocks = nn.Sequential(

*[Block() for _ in range(n_layers)],

) # transformer blocks

self.ln_f = nn.LayerNorm(dims=n_emb) # closing layernorm

self.lm_head = nn.Linear(n_emb, vocab_size) # output projection

# Tensor shapes commented

def __call__(self, x):

B, T = x.form # (B = batch_size, T = ctx_len)

tok_emb = self.wte(x) # (B, T, n_emb)

pos_emb = self.wpe(mx.arange(T)) # (T, n_emb)

x = tok_emb + pos_emb # (B, T, n_emb)

x = self.blocks(x) # (B, T, n_emb)

x = self.ln_f(x) # (B, T, b_emb)

logits = self.lm_head(x) # (B, T, vocab_size)

return logits

We create a container for the blocks utilizing `nn.Sequential`

which takes any enter and passes it sequentially by means of the contained layers. Then we are able to apply all of the blocks with `self.blocks(x)`

. Lastly, we apply a layer norm after which the lm_head. The lm_head or language modeling head is only a linear layer that maps from the embedding dimension to the vocab measurement. The mannequin will output a vector containing some worth for every phrase in our vocabulary, or the logits. We are able to softmax the logits to get a likelihood distribution over the vocabulary which we are able to pattern from to get the subsequent token. We will even use the logits to calculate the loss throughout coaching. There are simply two extra issues we have to implement earlier than we start coaching.

We have to write a generate operate to pattern from the mannequin as soon as coaching is full. The thought is that we begin with some sequence of our alternative, then we predict the subsequent token and append this to our sequence. Then we feed the brand new sequence in and predict the subsequent token once more. This continues till we determine to cease.

`# technique of GPT class`

def generate(self, max_new_tokens):

ctx = mx.zeros((1, 1), dtype=mx.int32)

We immediate the mannequin with a single token, zero. Zero is the newline character so it’s a pure place to begin the era since we simply need to see how Shakespeare-like our mannequin can get. Observe that we initialize the form to (1, 1) to simulate a single batch with a sequence size of 1.

`# technique of GPT class`

def generate(self, max_new_tokens):

ctx = mx.zeros((1, 1), dtype=mx.int32)

for _ in vary(max_new_tokens):

logits = self(ctx[:, -ctx_len:]) # move in final ctx_len characters

logits = logits[:, -1, :] # get logits for the subsequent token

next_tok = mx.random.categorical(logits, num_samples=1)

ctx = mx.concatenate((ctx, next_tok), axis=1)

return ctx

Then we get the logits for the subsequent token by passing within the final ctx_len characters to the mannequin. Nonetheless, our mannequin output is of form `(B, T, vocab_size)`

because it predicts the subsequent token logits for every token within the enter. We use all of that in coaching, however now we solely need the logits for the final token as a result of we are able to use this to pattern a brand new token. Subsequently we index the logits to get the final aspect within the first dimension which is the sequence dimension. Then we pattern the subsequent token utilizing the `mx.random.categorical()`

operate which takes the logits and the variety of samples we would like as enter. This operate will softmax the logits to show them right into a likelihood distribution after which randomly pattern a token in response to the possibilities. Lastly, we concatenate the brand new token to the context and repeat the method `max_new_tokens`

variety of instances.

The very last thing to do is deal with weight initialization which is vital for coaching dynamics.

`# technique of GPT`

def _init_parameters(self):

normal_init = nn.init.regular(imply=0.0, std=0.02)

residual_init = nn.init.regular(imply=0.0, std=(0.02 / math.sqrt(2 * n_layers)))

First, we outline two totally different `nn.init.regular`

capabilities. The primary one is for initializing all linear and embedding layers. The second is for initializing linear layers which can be particularly residual projections i.e. the final linear layer inside multi-head consideration and MLP. The explanation for this particular initialization is that it checks accumulation alongside the residual path as mannequin depth will increase in response to the GPT-2 paper [2].

In mlx we are able to change the parameters of the mannequin utilizing the `mx.replace()`

operate. Checking the docs, it expects an entire or partial dictionary of the brand new mannequin parameters. We are able to see what this dictionary seems to be like by printing out `self.parameters()`

contained in the GPT class.

`{'wte': {'weight': array([[-0.025084, -0.0197523, -0.0341617, ..., -0.0979123, -0.0830218, -0.0784692],`

[-0.00777913, -0.117002, -0.0310708, ..., 0.0128591, 0.122941, 0.000414443],

[0.0240044, -0.0859084, 0.0253116, ..., 0.108967, 0.0767123, 0.0221565],

...,

[0.050729, -0.04578, 0.0685943, ..., -0.0496998, -0.00350879, -0.00631825],

[0.00518804, 0.0499818, 0.0330045, ..., 0.0300661, 0.0431054, 0.000958906],

[-0.0323007, 0.0132046, 0.0208218, ..., -0.0785159, 0.00436121, -0.00726994]], dtype=float32)}, 'wpe': {'weight': array([[0.000797923, -0.0396898, -0.029047, ..., -0.0132273, 0.00684483, -0.0067624],

[-0.0247021, -0.0274349, 0.0310587, ..., -0.100099, 0.0301566, -0.0178732],

[0.0929172, -0.0468649, 0.0101506, ..., -0.0341086, -0.0516283, 0.0447596],

...,

[-0.0508172, 0.0892201, -0.00183612, ..., -0.00341944, 0.023437, 0.0296461],

[0.0105829, 0.0688093, 0.146744, ..., -0.0836337, 0.0206679, 0.0184166],

[-0.00578717, -0.0606196, -0.0917056, ..., -0.0641549, -0.0490424, 0.0998114]], dtype=float32)}, 'blocks': {'layers': [{'mlp': {'c_fc': {'weight': array([[0.0169199, 0.00264431, 0.0316978, ..., -0.0596867, -0.0153549, 0.0176386],

...

It’s a nested dictionary containing every mannequin weight as an mx.array. So to initialize the parameters of our mannequin we have to construct up a dictionary like this with our new params and move them to `self.replace()`

. We are able to obtain this as follows:

`# technique of GPT`

def _init_parameters(self):

normal_init = nn.init.regular(imply=0.0, std=0.02)

residual_init = nn.init.regular(imply=0.0, std=(0.02 / math.sqrt(2 * n_layers)))

new_params = []

for identify, module in self.named_modules():

if isinstance(module, nn.layers.linear.Linear):

new_params.append((identify + '.weight', normal_init(module.weight)))

elif isinstance(module, nn.layers.embedding.Embedding):

new_params.append((identify + '.weight', normal_init(module.weight)

We keep an inventory of tuples known as `new_params`

which can include tuples of (parameter_name, new_value). Subsequent, we loop by means of every nn.Module object in our mannequin with `self.named_modules()`

which returns tuples of (identify, module). If we print out the module names throughout the loop we see that they appear like this:

`lm_head`

blocks

blocks.layers.4

blocks.layers.3

blocks.layers.3.ln_2

blocks.layers.3.ln_1

blocks.layers.3.mha

blocks.layers.3.mha.resid_dropout

blocks.layers.3.mha.c_proj

blocks.layers.3.mha.attn_dropout

blocks.layers.3.mha.c_attn

...

blocks.layers.0.mlp.dropout

blocks.layers.0.mlp.c_proj

blocks.layers.0.mlp.gelu

blocks.layers.0.mlp.c_fc

wpe

wte

We use the `isinstance()`

operate to search out the linear and embedding layers after which add them to our listing. For instance, say we’re looping and attain “blocks.layers.0.mlp.c_fc” which is the primary linear layer within the MLP. This may set off the primary if assertion, and the tuple `("block.layers.0.mlp.c_fc.weight", [<normally initialized weight here>])`

could be added to our listing. We now have so as to add “.weight” to the identify as a result of we particularly need to initialize the burden on this means, not the bias. Now we have to deal with the residual projection initialization.

`# technique of GPT`

def _init_parameters(self):

normal_init = nn.init.regular(imply=0.0, std=0.02)

residual_init = nn.init.regular(imply=0.0, std=(0.02 / math.sqrt(2 * n_layers)))

new_params = []

for identify, module in self.named_modules():

if isinstance(module, nn.layers.linear.Linear):

if 'c_proj' in identify: # residual projection

new_params.append((identify + '.weight', residual_init(module.weight)))

else:

new_params.append((identify + '.weight', normal_init(module.weight)))

elif isinstance(module, nn.layers.embedding.Embedding):

new_params.append((identify + '.weight', normal_init(module.weight)))

After checking if the module is a linear layer, we examine if “c_proj” is within the identify as a result of that’s how we named the residual projections. Then we are able to apply the particular initialization. Lastly, we have to initialize the biases to be zero.

`# technique of GPT`

def _init_parameters(self):

normal_init = nn.init.regular(imply=0.0, std=0.02)

residual_init = nn.init.regular(imply=0.0, std=(0.02 / math.sqrt(2 * n_layers)))

new_params = []

for identify, module in self.named_modules():

if isinstance(module, nn.layers.linear.Linear):

if 'c_proj' in identify:

new_params.append((identify + '.weight', residual_init(module.weight)))

else:

new_params.append((identify + '.weight', normal_init(module.weight)))

if 'bias' in module:

new_params.append((identify + '.bias', mx.zeros(module.bias.form)))

elif isinstance(module, nn.layers.embedding.Embedding):

new_params.append((identify + '.weight', normal_init(module.weight)))

self = self.replace(utils.tree_unflatten(new_params))

We add one other if assertion below our linear department to examine if the nn.Module object has a bias attribute. If it does, we add it to the listing initialized to zeros. Lastly, we have to rework our listing of tuples right into a nested dictionary. Fortunately mlx has some capabilities applied for coping with parameter dictionaries, and we are able to use `util.tree_unflatten()`

to transform this listing of tuples to a nested parameter dictionary. That is handed into the replace technique to initialize the parameters. Now we are able to name `_init_parameters()`

within the constructor.

`class GPT(nn.Module):`

def __init__(self):

tremendous().__init__()

self.wte = nn.Embedding(vocab_size, n_emb) # token embeddings

self.wpe = nn.Embedding(ctx_len, n_emb) # place embeddings

self.blocks = nn.Sequential(

*[Block() for _ in range(n_layers)],

) # transformer blocks

self.ln_f = nn.LayerNorm(dims=n_emb) # closing layernorm

self.lm_head = nn.Linear(n_emb, vocab_size) # output projection

self._init_parameters() # <-- initialize params

# print complete variety of params on initialization

total_params = sum([p.size for n,p in utils.tree_flatten(self.parameters())])

print(f"Complete params: {(total_params / 1e6):.3f}M")

# Tensor shapes commented

def __call__(self, x):

B, T = x.form # (B = batch_size, T = ctx_len)

tok_emb = self.wte(x) # (B, T, n_emb)

pos_emb = self.wpe(mx.arange(T)) # (T, n_emb)

x = tok_emb + pos_emb # (B, T, n_emb)

x = self.blocks(x) # (B, T, n_emb)

x = self.ln_f(x) # (B, T, b_emb)

logits = self.lm_head(x) # (B, T, vocab_size)

return logits

def generate(self, max_new_tokens):

ctx = mx.zeros((1, 1), dtype=mx.int32)

for _ in vary(max_new_tokens):

logits = self(ctx[:, -ctx_len:]) # move in final ctx_len characters

logits = logits[:, -1, :] # get logits for the subsequent token

next_tok = mx.random.categorical(logits, num_samples=1)

ctx = mx.concatenate((ctx, next_tok), axis=1)

return ctx

def _init_parameters(self):

normal_init = nn.init.regular(imply=0.0, std=0.02)

residual_init = nn.init.regular(imply=0.0, std=(0.02 / math.sqrt(2 * n_layers)))

new_params = []

for identify, module in self.named_modules():

if isinstance(module, nn.layers.linear.Linear):

if 'c_proj' in identify:

new_params.append((identify + '.weight', residual_init(module.weight)))

else:

new_params.append((identify + '.weight', normal_init(module.weight)))

if 'bias' in module:

new_params.append((identify + '.bias', mx.zeros(module.bias.form)))

elif isinstance(module, nn.layers.embedding.Embedding):

new_params.append((identify + '.weight', normal_init(module.weight)))

self = self.replace(utils.tree_unflatten(new_params))

We additionally add 2 strains of code within the constructor to print the overall variety of params. Lastly, we’re able to construct the coaching loop.

To coach the mannequin we want a loss operate. Since we’re predicting courses (subsequent token) we use cross-entropy loss.

`def loss_fn(mannequin, x, y):`

logits = mannequin(x)

B, T, C = logits.form # (batch_size, seq_len, vocab_size)

logits = logits.reshape(B*T, C)

y = y.reshape(B*T)

loss = nn.losses.cross_entropy(logits, y, discount='imply')

return loss

First, we get the logits from the mannequin. Then we reshape logits to make an inventory of vocab_size size arrays. We additionally reshape y, the proper token ids, to have the identical size. Then we use the built-in cross-entropy loss operate to calculate the loss for every instance and common them to get a single worth.

`mannequin = GPT()`

mx.eval(mannequin.parameters()) # Create the mannequin params (mlx is lazy analysis)

loss_and_grad = nn.value_and_grad(mannequin, loss_fn)

lr = 0.1

optimizer = optim.AdamW(learning_rate=lr)

Subsequent, we instantiate the mannequin, however since mlx is lazy analysis it gained’t allocate and create the parameters. We have to name mx.eval on the parameters to make sure they get created. Then we are able to use `nn.value_and_grad()`

to get a operate that returns the loss and gradient of mannequin parameters w.r.t the loss. That is all we have to optimize. Lastly, we initialize an AdamW optimizer.

A fast be aware on nn.value_and_grad(). If you’re used to PyTorch you may count on us to make use of loss.backward() which fits by means of the computation graph and updates the .grad attribute of every tensor in our mannequin. Nonetheless, mlx computerized differentiation works on capabilities as a substitute of computation graphs [3]. Subsequently, mlx has built-ins that absorb a operate and return the gradient operate resembling `nn.value_and_grad()`

.

Now we outline the coaching loop.

`num_epochs=20`

batch_size=32

for epoch in vary(num_epochs):

mannequin.prepare(True)

running_loss = 0

batch_cnt = 0

for enter, label in get_batches(X_train, y_train, batch_size):

batch_cnt += 1

loss, grads = loss_and_grad(mannequin, enter, label)

optimizer.replace(mannequin, grads)

running_loss += loss.merchandise()

# compute new parameters and optimizer state

mx.eval(mannequin.parameters(), optimizer.state)

avg_train_loss = running_loss / batch_cnt

mannequin.prepare(False) # set eval mode

running_loss = 0

batch_cnt = 0

for enter, label in get_batches(X_val, y_val, batch_size):

batch_cnt += 1

loss = loss_fn(mannequin, enter, label)

running_loss += loss.merchandise()

avg_val_loss = running_loss / batch_cnt

print(f"Epoch {epoch:2} | prepare = {avg_train_loss:.4f} | val = {avg_val_loss:.4f}")

The outer loop runs by means of the epochs. We first set the mannequin to coaching mode as a result of some modules have totally different behaviors throughout coaching and testing resembling dropout. Then we use our `get_batches`

operate from earlier to loop by means of batches of the coaching knowledge. We get the loss over the batch and the gradient utilizing `loss_and_grad`

. Then we move the mannequin and gradients to the optimizer to replace the mannequin parameters. Lastly we name mx.eval (keep in mind mlx does lazy analysis) to make sure the parameters and optimizer state get up to date. Then we calculate the typical prepare loss over the information to print later. That is one move by means of the coaching knowledge. Equally, we calculate the validation loss after which print the typical prepare and val loss over the epoch.

`completion = decode(mannequin.generate(1000)[0].tolist())`

print(completion)

with open('completions.txt', 'w') as f:

f.write(completion)

Lastly, we add some code to generate from our mannequin. For the reason that era output continues to be within the (B, T) form we’ve to index it at 0 to make it 1D after which convert it from an mlx array to a Python listing. Then we are able to move it to our decode operate from earlier, and write it to a file.

These are the parameters we’ll use for coaching (you’ll be able to mess around with this):

`ctx_len = 128`

n_emb = 128

dropout = 0.1

head_size = 128

n_heads = 4

n_layers = 3

num_epochs = 20

batch_size = 64

lr = 1e-3

Now we are able to run the file to begin coaching. With the settings above coaching took round 10 minutes on my m2 MacBook. I achieved the next coaching loss final epoch.

`Epoch 19 | prepare = 1.6961 | val = 1.8143`

Let’s have a look at some output.

`GLOUCESTER:`

However accomes mo transfer it.KING EDWARD:

The place our that proclaim that I curse, or I sprithe.

CORIOLANUS:

Not need:

His bops to thy father

At with hath folks; by son and fproathead:

The nice nor could prosperson prefer it not,

What, the beggares

Extra hath, when that made a,

Your vainst Citizen:

Let listed below are go in queen me and knife

To my deserved me you promise: not a fettimes,

That one the is not going to.

CORIOLANUS:

And been of queens,

Thou to will we finest!

JULIET:

Not, brother recourable this doth our accuse

Into combat!

Not dangerous for simply 10 minutes of coaching with a tiny mannequin that’s predicting characters! It clearly has the type of Shakespeare, though it’s nonsense. The one distinction between our mannequin and the true GPT-2 now could be scale! Now I encourage you to experiment — check out totally different settings, possibly tinker with the structure, and see how low of a loss you’ll be able to obtain.

[1] Karpathy A (2015).*Tiny Shakespeare *[Data set]. https://github.com/karpathy/char-rnn (MIT license)

[2] A. Radford, J. Wu, R. Baby, D. Luan, D. Amodei, I. Sutskever, Language Fashions are Unsupervised Multitask Learners (2019), OpenAI

[3] Computerized Differentiation — mlx docs