The right way to construct an OpenAI-compatible API | by Saar Berkovich | Mar, 2024

We’ll begin with implementing the non-streaming bit. Let’s begin with modeling our request:

from typing import Checklist, Non-compulsoryfrom pydantic import BaseModel


class ChatMessage(BaseModel):
position: str
content material: str
class ChatCompletionRequest(BaseModel):
mannequin: str = "mock-gpt-model"
messages: Checklist[ChatMessage]
max_tokens: Non-compulsory[int] = 512
temperature: Non-compulsory[float] = 0.1
stream: Non-compulsory[bool] = False

The PyDantic mannequin represents the request from the shopper, aiming to copy the API reference. For the sake of brevity, this mannequin doesn’t implement your entire specs, however slightly the naked bones wanted for it to work. If you happen to’re lacking a parameter that is part of the API specs (like top_p), you may merely add it to the mannequin.

The ChatCompletionRequest fashions the parameters OpenAI makes use of of their requests. The chat API specs require specifying a listing of ChatMessage (like a chat historical past, the shopper is often in control of preserving it and feeding again in at each request). Every chat message has a position attribute (often system, assistant , or consumer ) and a content material attribute containing the precise message textual content.

Subsequent, we’ll write our FastAPI chat completions endpoint:

import timefrom fastapi import FastAPI
app = FastAPI(title="OpenAI-compatible API")
@app.put up("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages and request.messages[0].position == 'consumer':
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there have been no messages!"
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"selections": [{
"message": ChatMessage(role="assistant", content=resp_content)
}]
}

That straightforward.

Testing our implementation

Assuming each code blocks are in a file known as most important.py, we’ll set up two Python libraries in our surroundings of alternative (at all times finest to create a brand new one): pip set up fastapi openai and launch the server from a terminal:

uvicorn most important:app

Utilizing one other terminal (or by launching the server within the background), we are going to open a Python console and copy-paste the next code, taken straight from OpenAI’s Python Shopper Reference:

from openai import OpenAI# init shopper and connect with localhost server
shopper = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if wanted
)
# name API
chat_completion = shopper.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
mannequin="gpt-1337-turbo-pro-max",
)
# print the highest "alternative" 
print(chat_completion.selections[0].message.content material)

If you happen to’ve executed all the things accurately, the response from the server needs to be accurately printed. It’s additionally price inspecting the chat_completion object to see that every one related attributes are as despatched from our server. It’s best to see one thing like this:

Code by the writer, formatted utilizing Carbon

As LLM era tends to be gradual (computationally costly), it’s price streaming your generated content material again to the shopper, in order that the consumer can see the response because it’s being generated, with out having to attend for it to complete. If you happen to recall, we gave ChatCompletionRequest a boolean stream property — this lets the shopper request that the information be streamed again to it, slightly than despatched without delay.

This makes issues only a bit extra complicated. We’ll create a generator operate to wrap our mock response (in a real-world situation, we are going to desire a generator that is connected to our LLM era)

import asyncio
import jsonasync def _resp_async_generator(text_resp: str):
# let's faux each phrase is a token and return it over time
tokens = text_resp.cut up(" ")
for i, token in enumerate(tokens):
chunk = {
"id": i,
"object": "chat.completion.chunk",
"created": time.time(),
"mannequin": "blah",
"selections": [{"delta": {"content": token + " "}}],
}
yield f"knowledge: {json.dumps(chunk)}nn"
await asyncio.sleep(1)
yield "knowledge: [DONE]nn"

And now, we might modify our unique endpoint to return a StreamingResponse when stream==True

import timefrom starlette.responses import StreamingResponse
app = FastAPI(title="OpenAI-compatible API")
@app.put up("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages:
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there wasn't one!"
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type="software/x-ndjson")
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"selections": [{
"message": ChatMessage(role="assistant", content=resp_content)        }]
}

Testing the streaming implementation

After restarting the uvicorn server, we’ll open up a Python console and put on this code (once more, taken from OpenAI’s library docs)

from openai import OpenAI# init shopper and connect with localhost server
shopper = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if wanted
)
stream = shopper.chat.completions.create(
mannequin="mock-gpt-model",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
print(chunk.selections[0].delta.content material or "")

It’s best to see every phrase within the server’s response being slowly printed, mimicking token era. We are able to examine the final chunk object to see one thing like this:

Placing all of it collectively

Lastly, within the gist under, you may see your entire code for the server.

Supply hyperlink

The right way to construct an OpenAI-compatible API | by Saar Berkovich | Mar, 2024

Must read

12 Recruiting E-mail Examples I Love (For Your Inspiration)

Fifth Largest Bitcoin Whale Strikes $6 Billion In BTC, Right here’s The Vacation spot

Random Walks Are Unusual and Lovely | by Marcel Moosbrugger | Mar, 2024

BlackRock CEO Larry Fink Says He's "Very Bullish On The Lengthy Time period Viability of Bitcoin"

Testing our implementation

Testing the streaming implementation

Placing all of it collectively

More articles

LEAVE A REPLY Cancel reply

Latest article

12 Recruiting E-mail Examples I Love (For Your Inspiration)

Fifth Largest Bitcoin Whale Strikes $6 Billion In BTC, Right here’s The Vacation spot

Random Walks Are Unusual and Lovely | by Marcel Moosbrugger | Mar, 2024

BlackRock CEO Larry Fink Says He's "Very Bullish On The Lengthy Time period Viability of Bitcoin"

8 Free search engine optimization Reporting Instruments

Popular Category

Editor Picks

12 Recruiting E-mail Examples I Love (For Your Inspiration)

Fifth Largest Bitcoin Whale Strikes $6 Billion In BTC, Right here’s The Vacation spot