Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimodal API such as using image as part of prompt #40

Closed
yaoyaoumbc opened this issue Sep 6, 2024 · 4 comments · Fixed by #71
Closed

Add multimodal API such as using image as part of prompt #40

yaoyaoumbc opened this issue Sep 6, 2024 · 4 comments · Fixed by #71
Labels
enhancement New feature or request

Comments

@yaoyaoumbc
Copy link

Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.

@domenic domenic added the enhancement New feature or request label Oct 9, 2024
@basvandorst
Copy link

+1

The current languageModel context/prefix is too specific and doesn't account for future AI capabilities like image/voice/video interactions. I'd suggest to have a look at OpenAI (and others) an see how they don't strictly tight to a "language model"

OpenAI

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
}

Claud

const message = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
      ]
});

I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much.

@domenic
Copy link
Collaborator

domenic commented Jan 20, 2025

My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  { type: "image", value: image }, // A user image message
  { role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]);

Other cosmetic variations might be using MIME types instead of strings like "text" or "image" (but this seems kind of dumb, why make the web developer care about the difference between image/png and image/jpeg). Or using data instead of value.

I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. { type: "image", image: image } or { type: "audio", audio: audio }. That seemed unnecessary.

...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input.

So my current plan is the following:

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  image, // A user image message
  { role: "assistant", content: image }, // An assistant image message
]);

@domenic
Copy link
Collaborator

domenic commented Jan 20, 2025

Wait, no, that doesn't work, because then we can't distinguish an image Blob from an audio Blob. OK, back to the initial design.

domenic added a commit that referenced this issue Jan 20, 2025
Closes #40. Somewhat helps with #70.
domenic added a commit that referenced this issue Jan 20, 2025
Closes #40. Somewhat helps with #70.
@domenic
Copy link
Collaborator

domenic commented Feb 4, 2025

In #71 (comment) @michaelwasserman points out that the proposal currently in #71 is more complicated than it needs to be, and suggests a simpler alternative. Let me outline the two possibilities in full.

Option 1: Double nesting

In this version the core example is

await session.prompt({
  role: "user",
  content: {
    type: "image",
    data: bytesOrImageBitmapOrWhatever
  }
});

Here type is one of "text", "image", or "audio"

This is inspired by, but not exactly the same as, various existing APIs for LLMs. For example:

  • OpenAI: { role, content: { type, typeSpecificProperty } }
    • type is one of "text", "image_url", or "input_audio"
    • typeSpecificProperty is one of text, image_url, and input_audio
    • Many others are compatible with this de-facto standard, e.g. Gemini, DeepSeek
  • Anthropic: { role, content: { type, typeSpecificProperty } }
    • type is one of "text", "base64", "document", or some tool-use related ones
    • typeSpecificProperty is one of text or source
      • source is { data, media_type, type: "base64" }
  • Gemini: { role, parts: [{ typeSpecificProperty }] }
    • typeSpecificProperty is one of text, inlineData, fileData, or some others
    • inlineData is { mimeType, data }
    • As noted above Gemini also has an OpenAI-compatible version
  • Vercel AI SDK: { role, content: { type, typeSpecificProperty, mimeType } }
    • type is one of "text", "image", "file"
    • typeSpecificProperty is one of text, image, data
    • mimeType is omitted for type: "text" and optional for type: "image"
    • This one is especially interesting since it's a JavaScript API instead of a HTTP API

Exactly aligning with any of these APIs is not a good idea for the web. For example, we should not accept input as base64 strings or as URLs to remote resources; we should accept the web's existing types for image and audio inputs. And on the web, we rarely require you to specify the MIME type for images and audio, instead relying on MIME sniffing. The type + typeSpecificProperty design is also pretty strange; on the web we'd either do what Gemini did (only use typeSpecificProperty) or something more generic (our proposed type + data).

Given that, it's not clear to me whether semi-aligning with these APIs is important. Since we're not 100% aligned, people won't be able to exactly reuse existing code. So the benefit is some vague familiarity.

Option 2: unnest

In this version the core example is

await session.prompt({
  role: "user",
  type: "image",
  data: bytesOrImageBitmapOrWhatever  // or maybe `content` instead of `data`
});

This is just an unnested version of our proposal above. It seems pretty nice! Why make the developer nest if you don't need to, right? But it moves away from the { role, content } tuple that is familiar from almost all of the above APIs.

This might get slightly more awkward if we added some type-specific fields. In option 1, we could nest those under content. Now we'd make them a sibling of content. I can't think of too many examples here: most of the types we'd put into data already encapsulate any metadata like image width/height, audio sample rate, etc. One possible example is language, for text input (although even that is weak since the model should be able to figure it out). Another is OpenAI's detail parameter for image understanding. But keeping those as siblings, instead of nested under content, seems fine?

Overall I'm leaning in this direction, but I wanted to give developers who are familiar with the various AI APIs a chance to weigh in.

Side note: abbreviated types

Both versions allow abbreviated versions for simple string prompts, or for omitting the role: "user" part, so don't worry about that. Examples:

"a string" // = option 1 { role: "user", content: { type: "text", data: "a string" } } }
           // = option 2 { role: "user", type: "text", data: "a string" }

{ role: "assistant", content: "a string" } // = option 1 { role: "assistant", content: { type: "text", data: "a string" } }
{ role: "assistant", data: "a string" }    // = option 2 { role: "assistant", type: "text", data: "a string" }

{ type: "audio", data: someAudio }  // = option 1 { role: "user", content: { type: "audio", data: "a string" } }
                                    // = option 2 { role: "user", type: "audio" data: someAudio }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants