-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multimodal API such as using image as part of prompt #40
Comments
+1 The current OpenAI async function main() {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What’s in this image?" },
{
type: "image_url",
image_url: {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
},
],
});
} Claud const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": image_media_type,
"data": image_data,
},
}
],
}
]
}); I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much. |
My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like await session.prompt([
"This is a user text message",
{ role: "system", content: "This is a system text message" },
{ type: "image", value: image }, // A user image message
{ role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]); Other cosmetic variations might be using MIME types instead of strings like I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. ...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input. So my current plan is the following: await session.prompt([
"This is a user text message",
{ role: "system", content: "This is a system text message" },
image, // A user image message
{ role: "assistant", content: image }, // An assistant image message
]); |
Wait, no, that doesn't work, because then we can't distinguish an image |
In #71 (comment) @michaelwasserman points out that the proposal currently in #71 is more complicated than it needs to be, and suggests a simpler alternative. Let me outline the two possibilities in full. Option 1: Double nestingIn this version the core example is await session.prompt({
role: "user",
content: {
type: "image",
data: bytesOrImageBitmapOrWhatever
}
}); Here This is inspired by, but not exactly the same as, various existing APIs for LLMs. For example:
Exactly aligning with any of these APIs is not a good idea for the web. For example, we should not accept input as base64 strings or as URLs to remote resources; we should accept the web's existing types for image and audio inputs. And on the web, we rarely require you to specify the MIME type for images and audio, instead relying on MIME sniffing. The Given that, it's not clear to me whether semi-aligning with these APIs is important. Since we're not 100% aligned, people won't be able to exactly reuse existing code. So the benefit is some vague familiarity. Option 2: unnestIn this version the core example is await session.prompt({
role: "user",
type: "image",
data: bytesOrImageBitmapOrWhatever // or maybe `content` instead of `data`
}); This is just an unnested version of our proposal above. It seems pretty nice! Why make the developer nest if you don't need to, right? But it moves away from the This might get slightly more awkward if we added some type-specific fields. In option 1, we could nest those under Overall I'm leaning in this direction, but I wanted to give developers who are familiar with the various AI APIs a chance to weigh in. Side note: abbreviated typesBoth versions allow abbreviated versions for simple string prompts, or for omitting the "a string" // = option 1 { role: "user", content: { type: "text", data: "a string" } } }
// = option 2 { role: "user", type: "text", data: "a string" }
{ role: "assistant", content: "a string" } // = option 1 { role: "assistant", content: { type: "text", data: "a string" } }
{ role: "assistant", data: "a string" } // = option 2 { role: "assistant", type: "text", data: "a string" }
{ type: "audio", data: someAudio } // = option 1 { role: "user", content: { type: "audio", data: "a string" } }
// = option 2 { role: "user", type: "audio" data: someAudio } |
Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.
The text was updated successfully, but these errors were encountered: