Monday, April 22, 2024

Structured Generative AI. Learn how to constrain your mannequin to output… | by Oren Matar | Apr, 2024

Must read


Learn how to constrain your mannequin to output outlined codecs

Towards Data Science

On this submit I’ll clarify and display the idea of “structured generative AI”: generative AI constrained to outlined codecs. By the tip of the submit, you’ll perceive the place and when it may be used and learn how to implement it whether or not you’re crafting a transformer mannequin from scratch or using Hugging Face’s fashions. Moreover, we are going to cowl an essential tip for tokenization that’s particularly related for structured languages.

One of many many makes use of of generative AI is as a translation device. This typically includes translating between two human languages however can even embrace laptop languages or codecs. For instance, your utility could must translate pure (human) language to SQL:

Pure language: “Get buyer names and emails of consumers from the US”

SQL: "SELECT title, electronic mail FROM prospects WHERE nation = 'USA'"

Or to transform textual content knowledge right into a JSON format:

Pure language: “I'm John Doe, telephone quantity is 555–123–4567,
my buddies are Anna and Sara”

JSON: {title: "John Doe",
phone_number: "555–123–5678",
buddies: {
title: [["Anna", "Sara"]]}
}

Naturally, many extra functions are potential, for different structured languages. The coaching course of for such duties includes feeding examples of pure language alongside structured codecs to an encoder-decoder mannequin. Alternatively, leveraging a pre-trained Language Mannequin (LLM) can suffice.

Whereas attaining 100% accuracy is unattainable, there’s one class of errors that we are able to remove: syntax errors. These are violations of the format of the language, like changing commas with dots, utilizing desk names that aren’t current within the SQL schema, or omitting bracket closures, which render SQL or JSON non-executable.

The truth that we’re translating right into a structured language implies that the record of official tokens at each era step is restricted, and pre-determined. If we may insert this data into the generative AI course of we are able to keep away from a variety of incorrect outcomes. That is the concept behind structured generative AI: constrain it to an inventory of official tokens.

A fast reminder on how tokens are generated

Whether or not using an encoder-decoder or GPT structure, token era operates sequentially. Every token’s choice depends on each the enter and beforehand generated tokens, persevering with till a <finish> token is generated, signifying the completion of the sequence. At every step, a classifier assigns logit values to all tokens within the vocabulary, representing the likelihood of every token as the following choice. The following token is sampled based mostly on these logits.

The decoder classifier assigns a logit to each token within the vocabulary (Picture by creator)

Limiting token era

To constrain token era, we incorporate information of the output language’s construction. Illegitimate tokens have their logits set to -inf, making certain their exclusion from choice. As an illustration, if solely a comma or “FROM” is legitimate after “Choose title,” all different token logits are set to -inf.

For those who’re utilizing Hugging Face, this may be carried out utilizing a “logits processor”. To make use of it you want to implement a category with a __call__ technique, which can be known as after the logits are calculated, however earlier than the sampling. This technique receives all token logits and generated enter IDs, returning modified logits for all tokens.

The logits returned from the logits processor: all illegitimate tokens get a worth of -inf (Picture by creator)

I’ll display the code with a simplified instance. First, we initialize the mannequin, we are going to use Bart on this case, however this could work with any mannequin.

from transformers import BartForConditionalGeneration, BartTokenizerFast, PreTrainedTokenizer
from transformers.era.logits_process import LogitsProcessorList, LogitsProcessor
import torch

title = 'fb/bart-large'
tokenizer = BartTokenizerFast.from_pretrained(title, add_prefix_space=True)
pretrained_model = BartForConditionalGeneration.from_pretrained(title)

If we need to generate a translation from the pure language to SQL, we are able to run:

to_translate = 'prospects emails from the us'
phrases = to_translate.cut up()
tokenized_text = tokenizer([words], is_split_into_words=True)

out = pretrained_model.generate(
torch.tensor(tokenized_text["input_ids"]),
max_new_tokens=20,
)
print(tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(
out[0], skip_special_tokens=True)))

Returning

'Extra emails from the us'

Since we didn’t fine-tune the mannequin for text-to-SQL duties, the output doesn’t resemble SQL. We is not going to prepare the mannequin on this tutorial, however we are going to information it to generate an SQL question. We’ll obtain this by using a perform that maps every generated token to an inventory of permissible subsequent tokens. For simplicity, we’ll focus solely on the speedy previous token, however extra difficult mechanisms are simple to implement. We’ll use a dictionary defining for every token, which tokens are allowed to comply with it. E.g. The question should start with “SELECT” or “DELETE”, and after “SELECT” solely “title”, “electronic mail”, or ”id” are allowed since these are the columns in our schema.

guidelines = {'<s>': ['SELECT', 'DELETE'], # starting of the era
'SELECT': ['name', 'email', 'id'], # names of columns in our schema
'DELETE': ['name', 'email', 'id'],
'title': [',', 'FROM'],
'electronic mail': [',', 'FROM'],
'id': [',', 'FROM'],
',': ['name', 'email', 'id'],
'FROM': ['customers', 'vendors'], # names of tables in our schema
'prospects': ['</s>'],
'distributors': ['</s>'], # finish of the era
}

Now we have to convert these tokens to the IDs utilized by the mannequin. This can occur inside a category inheriting from LogitsProcessor.

def convert_token_to_id(token):
return tokenizer(token, add_special_tokens=False)['input_ids'][0]

class SQLLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer: PreTrainedTokenizer):
self.tokenizer = tokenizer
self.guidelines = {convert_token_to_id(ok): [convert_token_to_id(v0) for v0 in v] for ok,v in guidelines.objects()}

Lastly, we are going to implement the __call__ perform, which is known as after the logits are calculated. The perform creates a brand new tensor of -infs, checks which IDs are official in accordance with the foundations (the dictionary), and locations their scores within the new tensor. The result’s a tensor that solely has legitimate values for the legitimate tokens.

class SQLLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer: PreTrainedTokenizer):
self.tokenizer = tokenizer
self.guidelines = {convert_token_to_id(ok): [convert_token_to_id(v0) for v0 in v] for ok,v in guidelines.objects()}

def __call__(self, input_ids: torch.LongTensor, scores: torch.LongTensor):
if not (input_ids == self.tokenizer.bos_token_id).any():
# we should permit the beginning token to seem earlier than we begin processing
return scores
# create a brand new tensor of -inf
new_scores = torch.full((1, self.tokenizer.vocab_size), float('-inf'))
# ids of official tokens
legit_ids = self.guidelines[int(input_ids[0, -1])]
# place their values within the new tensor
new_scores[:, legit_ids] = scores[0, legit_ids]
return new_scores

And that’s it! We will now run a era with the logits-processor:

to_translate = 'prospects emails from the us'
phrases = to_translate.cut up()
tokenized_text = tokenizer([words], is_split_into_words=True, return_offsets_mapping=True)

logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])

out = pretrained_model.generate(
torch.tensor(tokenized_text["input_ids"]),
max_new_tokens=20,
logits_processor=logits_processor
)
print(tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(
out[0], skip_special_tokens=True)))

Returning

 SELECT electronic mail , electronic mail , id , electronic mail FROM prospects

The end result is slightly unusual, however bear in mind: we didn’t even prepare the mannequin! We solely enforced token era based mostly on particular guidelines. Notably, constraining era doesn’t intrude with coaching; constraints solely apply throughout era post-training. Thus, when appropriately carried out, these constraints can solely improve era accuracy.

Our simplistic implementation falls wanting masking all of the SQL syntax. An actual implementation should help extra syntax, doubtlessly contemplating not simply the final token however a number of, and allow batch era. As soon as these enhancements are in place, our educated mannequin can reliably generate executable SQL queries, constrained to legitimate desk and column names from the schema. A Comparable method can implement constraints in producing JSON, making certain key presence and bracket closure.

Watch out of tokenization

Tokenization is usually neglected however appropriate tokenization is essential when utilizing generative AI for structured output. Nevertheless, below the hood, tokenization could make an affect on the coaching of your mannequin. For instance, chances are you’ll fine-tune a mannequin to translate textual content right into a JSON. As a part of the fine-tuning course of, you present the mannequin with examples of text-JSON pairs, which it tokenizes. What’s going to this tokenization appear like?

(Picture by creator)

Whilst you learn “[[“ as two square brackets, the tokenizer converts them into a single ID, which will be treated as a completely distinct class from the single bracket by the token classifier. This makes the entire logic that the model must learn — more complicated (for example, remembering how many brackets to close). Similarly, adding a space before words may change their tokenization and their class ID. For instance:

(Image by author)

Again, this complicates the logic the model will have to learn since the weights connected to each of these IDs will have to be learned separately, for slightly different cases.

For simpler learning, ensure each concept and punctuation is consistently converted to the same token, by adding spaces before words and characters.

Spaced-out words lead to more consistent tokenization (Image by author)

Inputting spaced examples during fine-tuning simplifies the patterns the model has to learn, enhancing model accuracy. During prediction, the model will output the JSON with spaces, which you can then remove before parsing.

Summary

Generative AI offers a valuable approach for translating into a formatted language. By leveraging the knowledge of the output structure, we can constrain the generative process, eliminating a class of errors and ensuring the executability of queries and parse-ability of data structures.

Additionally, these formats may use punctuation and keywords to signify certain meanings. Making sure that the tokenization of these keywords is consistent can dramatically reduce the complexity of the patterns that the model has to learn, thus reducing the required size of the model and its training time, while increasing its accuracy.

Structured generative AI can effectively translate natural language into any structured format. These translations enable information extraction from text or query generation, which is a powerful tool for numerous applications.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article