Add multimodal API such as using image as part of prompt #40

yaoyaoumbc · 2024-09-06T19:27:27Z

Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.

basvandorst · 2024-12-10T15:59:51Z

+1

The current languageModel context/prefix is too specific and doesn't account for future AI capabilities like image/voice/video interactions. I'd suggest to have a look at OpenAI (and others) an see how they don't strictly tight to a "language model"

OpenAI

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
}

Claud

const message = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
      ]
});

I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much.

domenic · 2025-01-20T06:17:03Z

My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  { type: "image", value: image }, // A user image message
  { role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]);

Other cosmetic variations might be using MIME types instead of strings like "text" or "image" (but this seems kind of dumb, why make the web developer care about the difference between image/png and image/jpeg). Or using data instead of value.

I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. { type: "image", image: image } or { type: "audio", audio: audio }. That seemed unnecessary.

...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input.

So my current plan is the following:

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  image, // A user image message
  { role: "assistant", content: image }, // An assistant image message
]);

domenic · 2025-01-20T06:20:56Z

Wait, no, that doesn't work, because then we can't distinguish an image Blob from an audio Blob. OK, back to the initial design.

Closes #40. Somewhat helps with #70.

domenic · 2025-02-04T03:50:00Z

In #71 (comment) @michaelwasserman points out that the proposal currently in #71 is more complicated than it needs to be, and suggests a simpler alternative. Let me outline the two possibilities in full.

Option 1: Double nesting

In this version the core example is

await session.prompt({
  role: "user",
  content: {
    type: "image",
    data: bytesOrImageBitmapOrWhatever
  }
});

Here type is one of "text", "image", or "audio"

This is inspired by, but not exactly the same as, various existing APIs for LLMs. For example:

OpenAI: { role, content: { type, typeSpecificProperty } }
- type is one of "text", "image_url", or "input_audio"
- typeSpecificProperty is one of text, image_url, and input_audio
- Many others are compatible with this de-facto standard, e.g. Gemini, DeepSeek
Anthropic: { role, content: { type, typeSpecificProperty } }
- type is one of "text", "base64", "document", or some tool-use related ones
- typeSpecificProperty is one of text or source
  - source is { data, media_type, type: "base64" }
Gemini: { role, parts: [{ typeSpecificProperty }] }
- typeSpecificProperty is one of text, inlineData, fileData, or some others
- inlineData is { mimeType, data }
- As noted above Gemini also has an OpenAI-compatible version
Vercel AI SDK: { role, content: { type, typeSpecificProperty, mimeType } }
- type is one of "text", "image", "file"
- typeSpecificProperty is one of text, image, data
- mimeType is omitted for type: "text" and optional for type: "image"
- This one is especially interesting since it's a JavaScript API instead of a HTTP API

Exactly aligning with any of these APIs is not a good idea for the web. For example, we should not accept input as base64 strings or as URLs to remote resources; we should accept the web's existing types for image and audio inputs. And on the web, we rarely require you to specify the MIME type for images and audio, instead relying on MIME sniffing. The type + typeSpecificProperty design is also pretty strange; on the web we'd either do what Gemini did (only use typeSpecificProperty) or something more generic (our proposed type + data).

Given that, it's not clear to me whether semi-aligning with these APIs is important. Since we're not 100% aligned, people won't be able to exactly reuse existing code. So the benefit is some vague familiarity.

Option 2: unnest

In this version the core example is

await session.prompt({
  role: "user",
  type: "image",
  data: bytesOrImageBitmapOrWhatever  // or maybe `content` instead of `data`
});

This is just an unnested version of our proposal above. It seems pretty nice! Why make the developer nest if you don't need to, right? But it moves away from the { role, content } tuple that is familiar from almost all of the above APIs.

This might get slightly more awkward if we added some type-specific fields. In option 1, we could nest those under content. Now we'd make them a sibling of content. I can't think of too many examples here: most of the types we'd put into data already encapsulate any metadata like image width/height, audio sample rate, etc. One possible example is language, for text input (although even that is weak since the model should be able to figure it out). Another is OpenAI's detail parameter for image understanding. But keeping those as siblings, instead of nested under content, seems fine?

Overall I'm leaning in this direction, but I wanted to give developers who are familiar with the various AI APIs a chance to weigh in.

Side note: abbreviated types

Both versions allow abbreviated versions for simple string prompts, or for omitting the role: "user" part, so don't worry about that. Examples:

"a string" // = option 1 { role: "user", content: { type: "text", data: "a string" } } }
           // = option 2 { role: "user", type: "text", data: "a string" }

{ role: "assistant", content: "a string" } // = option 1 { role: "assistant", content: { type: "text", data: "a string" } }
{ role: "assistant", data: "a string" }    // = option 2 { role: "assistant", type: "text", data: "a string" }

{ type: "audio", data: someAudio }  // = option 1 { role: "user", content: { type: "audio", data: "a string" } }
                                    // = option 2 { role: "user", type: "audio" data: someAudio }

domenic added the enhancement New feature or request label Oct 9, 2024

AdamSobieski mentioned this issue Jan 9, 2025

[FR] Document Object Model Integration #70

Open

domenic added a commit that referenced this issue Jan 20, 2025

Add image and audio prompting API

ff96dc3

Closes #40. Somewhat helps with #70.

domenic added a commit that referenced this issue Jan 20, 2025

Add image and audio prompting API

2a9f391

Closes #40. Somewhat helps with #70.

domenic mentioned this issue Jan 20, 2025

Add image and audio prompting API #71

Merged

domenic closed this as completed in #71 Feb 25, 2025

domenic closed this as completed in 331914a Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024

domenic commented Jan 20, 2025 •

edited

Loading

domenic commented Jan 20, 2025

domenic commented Feb 4, 2025 •

edited

Loading

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

Comments

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024

domenic commented Jan 20, 2025 • edited Loading

domenic commented Jan 20, 2025

domenic commented Feb 4, 2025 • edited Loading

Option 1: Double nesting

Option 2: unnest

Side note: abbreviated types

domenic commented Jan 20, 2025 •

edited

Loading

domenic commented Feb 4, 2025 •

edited

Loading